[jira] [Commented] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread XiDuo You (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696070#comment-17696070
 ] 

XiDuo You commented on SPARK-42650:
---

To be clear,  it is the issue of Spark 3.2.3. Spark3.2.1, 3.3.x and master are 
fine.

It can be reproduced by:

{code:java}

CREATE TABLE IF NOT EXISTS spark32_overwrite(amt1 int) STORED AS ORC;
CREATE TABLE IF NOT EXISTS spark32_overwrite2(amt1 long) STORED AS ORC;

INSERT OVERWRITE TABLE spark32_overwrite2 select 644164;

set spark.sql.ansi.enabled=true;
INSERT OVERWRITE TABLE spark32_overwrite select amt1 from (select cast(amt1 as 
int) as amt1 from spark32_overwrite2 distribute by amt1);

{code}


> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a `insert overwrite` statment meet exception ,a no partion table's 
> home directory will lost ,partion table will lost partion directory.
>  
> my spark-defaults.conf config : 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696059#comment-17696059
 ] 

Apache Spark commented on SPARK-42660:
--

User 'mskapilks' has created a pull request for this issue:
https://github.com/apache/spark/pull/40266

> Infer filters for Join produced by IN and EXISTS clause 
> (RewritePredicateSubquery rule)
> ---
>
> Key: SPARK-42660
> URL: https://issues.apache.org/jira/browse/SPARK-42660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42660:


Assignee: Apache Spark

> Infer filters for Join produced by IN and EXISTS clause 
> (RewritePredicateSubquery rule)
> ---
>
> Key: SPARK-42660
> URL: https://issues.apache.org/jira/browse/SPARK-42660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Kapil Singh
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42660:


Assignee: (was: Apache Spark)

> Infer filters for Join produced by IN and EXISTS clause 
> (RewritePredicateSubquery rule)
> ---
>
> Key: SPARK-42660
> URL: https://issues.apache.org/jira/browse/SPARK-42660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Kapil Singh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42609) Add tests for grouping() and grouping_id() functions

2023-03-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42609.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40259
[https://github.com/apache/spark/pull/40259]

> Add tests for grouping() and grouping_id() functions
> 
>
> Key: SPARK-42609
> URL: https://issues.apache.org/jira/browse/SPARK-42609
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42556:


Assignee: Apache Spark

> Dataset.colregex should link a plan_id when it only matches a single column.
> 
>
> Key: SPARK-42556
> URL: https://issues.apache.org/jira/browse/SPARK-42556
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> When colregex returns a single column it should link the plans plan_id. For 
> reference here is the non-connect Dataset code that does this:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512]
> This also needs to be fixed for the Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42660) Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-02 Thread Kapil Singh (Jira)
Kapil Singh created SPARK-42660:
---

 Summary: Infer filters for Join produced by IN and EXISTS clause 
(RewritePredicateSubquery rule)
 Key: SPARK-42660
 URL: https://issues.apache.org/jira/browse/SPARK-42660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.1
Reporter: Kapil Singh






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42556:


Assignee: (was: Apache Spark)

> Dataset.colregex should link a plan_id when it only matches a single column.
> 
>
> Key: SPARK-42556
> URL: https://issues.apache.org/jira/browse/SPARK-42556
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> When colregex returns a single column it should link the plans plan_id. For 
> reference here is the non-connect Dataset code that does this:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512]
> This also needs to be fixed for the Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696046#comment-17696046
 ] 

Apache Spark commented on SPARK-42556:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40265

> Dataset.colregex should link a plan_id when it only matches a single column.
> 
>
> Key: SPARK-42556
> URL: https://issues.apache.org/jira/browse/SPARK-42556
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> When colregex returns a single column it should link the plans plan_id. For 
> reference here is the non-connect Dataset code that does this:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512]
> This also needs to be fixed for the Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42473) An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL

2023-03-02 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-42473.
-
Fix Version/s: 3.3.3
 Assignee: Runyao.Chen
   Resolution: Fixed

> An explicit cast will be needed when INSERT OVERWRITE SELECT UNION ALL
> --
>
> Key: SPARK-42473
> URL: https://issues.apache.org/jira/browse/SPARK-42473
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 3.3.1
> Environment: spark 3.3.1
>Reporter: kevinshin
>Assignee: Runyao.Chen
>Priority: Major
> Fix For: 3.3.3
>
>
> *when 'union all' and one select statement use* *Literal as column value , 
> the other* *select statement  has computed expression at the same column , 
> then the whole statement will compile failed. A explicit cast will be needed.*
> for example:
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1, {*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {*}cast{*}('200.99' *as* 
> {*}decimal{*}(20,8)){*}/{*}100 *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
> *will got error :* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.Literal cannot be *cast* *to* 
> org.apache.spark.{*}sql{*}.catalyst.expressions.AnsiCast
> The SQL will need to change to : 
> {color:#4c9aff}explain{color}
> {color:#4c9aff}*INSERT* OVERWRITE *TABLE* test.spark33_decimal_orc{color}
> {color:#4c9aff}*select* *null* *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2{color}
> {color:#4c9aff}*union* *all*{color}
> {color:#4c9aff}*select* {color:#de350b}{*}cast{*}({color}{*}cast{*}('200.99' 
> *as* {*}decimal{*}(20,8)){*}/{*}100 *as* 
> {*}decimal{*}(20,8){color:#de350b}){color} *as* amt1,{*}cast{*}('256.99' *as* 
> {*}decimal{*}(20,8)) *as* amt2;{color}
>  
> *but this is not need in spark3.2.1 , is this a bug for spark3.3.1 ?* 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696042#comment-17696042
 ] 

Apache Spark commented on SPARK-42635:
--

User 'chenhao-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/40264

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
> Fix For: 3.4.1
>
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread kevinshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696037#comment-17696037
 ] 

kevinshin commented on SPARK-42650:
---

Spark and Kyuubi are both belong to apache.

May Apache community help to figure out the detail of this issue? Will this 
issue keep exist the next releases.

> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a `insert overwrite` statment meet exception ,a no partion table's 
> home directory will lost ,partion table will lost partion directory.
>  
> my spark-defaults.conf config : 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-42635.
--
Fix Version/s: 3.4.1
   Resolution: Fixed

Issue resolved by pull request 40237
[https://github.com/apache/spark/pull/40237]

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
> Fix For: 3.4.1
>
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42635) Several counter-intuitive behaviours in the TimestampAdd expression

2023-03-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-42635:


Assignee: Chenhao Li

> Several counter-intuitive behaviours in the TimestampAdd expression
> ---
>
> Key: SPARK-42635
> URL: https://issues.apache.org/jira/browse/SPARK-42635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.3.2
>Reporter: Chenhao Li
>Assignee: Chenhao Li
>Priority: Major
>
> # When the time is close to daylight saving time transition, the result may 
> be discontinuous and not monotonic.
> We currently have:
> {code:scala}
> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
> scala> spark.sql("select timestampadd(second, 24 * 3600 - 1, 
> timestamp'2011-03-12 03:00:00')").show
> ++
> |timestampadd(second, ((24 * 3600) - 1), TIMESTAMP '2011-03-12 03:00:00')|
> ++
> | 2011-03-13 03:59:59|
> ++
> scala> spark.sql("select timestampadd(second, 24 * 3600, timestamp'2011-03-12 
> 03:00:00')").show
> +--+
> |timestampadd(second, (24 * 3600), TIMESTAMP '2011-03-12 03:00:00')|
> +--+
> |   2011-03-13 03:00:00|
> +--+ {code}
>  
> In the second query, adding one more second will set the time back one hour 
> instead. Plus, there are only {{23 * 3600}} seconds from {{2011-03-12 
> 03:00:00}} to {{2011-03-13 03:00:00}}, instead of {{24 * 3600}} seconds, due 
> to the daylight saving time transition.
> The root cause of the problem is the Spark code at 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L790]
>  wrongly assumes every day has {{MICROS_PER_DAY}} seconds, and does the day 
> and time-in-day split before looking at the timezone.
> 2. Adding month, quarter, and year silently ignores Int overflow during unit 
> conversion.
> The root cause is 
> [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L1246].
>  {{quantity}} is multiplied by {{3}} or {{MONTHS_PER_YEAR}} without checking 
> overflow. Note that we do have overflow checking in adding the amount to the 
> timestamp, so the behavior is inconsistent.
> This can cause counter-intuitive results like this:
> {code:scala}
> scala> spark.sql("select timestampadd(quarter, 1431655764, 
> timestamp'1970-01-01')").show
> +--+
> |timestampadd(quarter, 1431655764, TIMESTAMP '1970-01-01 00:00:00')|
> +--+
> |   1969-09-01 00:00:00|
> +--+{code}
> 3. Adding sub-month units (week, day, hour, minute, second, millisecond, 
> microsecond)silently ignores Long overflow during unit conversion.
> This is similar to the previous problem:
> {code:scala}
>  scala> spark.sql("select timestampadd(day, 106751992, 
> timestamp'1970-01-01')").show(false)
> +-+
> |timestampadd(day, 106751992, TIMESTAMP '1970-01-01 00:00:00')|
> +-+
> |-290308-12-22 15:58:10.448384|
> +-+{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42656.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40257
[https://github.com/apache/spark/pull/40257]

> Spark Connect Scala Client Shell Script
> ---
>
> Key: SPARK-42656
> URL: https://issues.apache.org/jira/browse/SPARK-42656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
> Fix For: 3.4.0
>
>
> Adding a shell script to run scala client in a scala REPL to allow users to 
> connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42656:


Assignee: Zhen Li

> Spark Connect Scala Client Shell Script
> ---
>
> Key: SPARK-42656
> URL: https://issues.apache.org/jira/browse/SPARK-42656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Assignee: Zhen Li
>Priority: Major
>
> Adding a shell script to run scala client in a scala REPL to allow users to 
> connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42659:


Assignee: (was: Apache Spark)

> Reimplement `FPGrowthModel.transform` with dataframe operations
> ---
>
> Key: SPARK-42659
> URL: https://issues.apache.org/jira/browse/SPARK-42659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42659:


Assignee: Apache Spark

> Reimplement `FPGrowthModel.transform` with dataframe operations
> ---
>
> Key: SPARK-42659
> URL: https://issues.apache.org/jira/browse/SPARK-42659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696019#comment-17696019
 ] 

Apache Spark commented on SPARK-42659:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40263

> Reimplement `FPGrowthModel.transform` with dataframe operations
> ---
>
> Key: SPARK-42659
> URL: https://issues.apache.org/jira/browse/SPARK-42659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17696018#comment-17696018
 ] 

Apache Spark commented on SPARK-42659:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40263

> Reimplement `FPGrowthModel.transform` with dataframe operations
> ---
>
> Key: SPARK-42659
> URL: https://issues.apache.org/jira/browse/SPARK-42659
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42659) Reimplement `FPGrowthModel.transform` with dataframe operations

2023-03-02 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42659:
-

 Summary: Reimplement `FPGrowthModel.transform` with dataframe 
operations
 Key: SPARK-42659
 URL: https://issues.apache.org/jira/browse/SPARK-42659
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42651:


Assignee: (was: Apache Spark)

> Optimize global sort to driver sort
> ---
>
> Key: SPARK-42651
> URL: https://issues.apache.org/jira/browse/SPARK-42651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> If the size of plan is small enough, it's more efficient to sort all rows at 
> driver side that saves one shuffle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695991#comment-17695991
 ] 

Apache Spark commented on SPARK-42651:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40262

> Optimize global sort to driver sort
> ---
>
> Key: SPARK-42651
> URL: https://issues.apache.org/jira/browse/SPARK-42651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> If the size of plan is small enough, it's more efficient to sort all rows at 
> driver side that saves one shuffle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42651:


Assignee: Apache Spark

> Optimize global sort to driver sort
> ---
>
> Key: SPARK-42651
> URL: https://issues.apache.org/jira/browse/SPARK-42651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Major
>
> If the size of plan is small enough, it's more efficient to sort all rows at 
> driver side that saves one shuffle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695992#comment-17695992
 ] 

Apache Spark commented on SPARK-42651:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/40262

> Optimize global sort to driver sort
> ---
>
> Key: SPARK-42651
> URL: https://issues.apache.org/jira/browse/SPARK-42651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> If the size of plan is small enough, it's more efficient to sort all rows at 
> driver side that saves one shuffle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695989#comment-17695989
 ] 

Yuming Wang commented on SPARK-42650:
-

It seems like a Kyuubi bug?

> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a `insert overwrite` statment meet exception ,a no partion table's 
> home directory will lost ,partion table will lost partion directory.
>  
> my spark-defaults.conf config : 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42556) Dataset.colregex should link a plan_id when it only matches a single column.

2023-03-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695965#comment-17695965
 ] 

jiaan.geng commented on SPARK-42556:


I'm working on.

> Dataset.colregex should link a plan_id when it only matches a single column.
> 
>
> Key: SPARK-42556
> URL: https://issues.apache.org/jira/browse/SPARK-42556
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> When colregex returns a single column it should link the plans plan_id. For 
> reference here is the non-connect Dataset code that does this:
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L1512]
> This also needs to be fixed for the Python client.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42604) Implement functions.typedlit

2023-03-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695964#comment-17695964
 ] 

jiaan.geng commented on SPARK-42604:


I will take a look!

> Implement functions.typedlit
> 
>
> Key: SPARK-42604
> URL: https://issues.apache.org/jira/browse/SPARK-42604
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> We need to add functions.typedlit. This requires a change to the connect 
> protocol. See SPARK-42579



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42647:
-
Priority: Minor  (was: Major)

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Assignee: Aimilios Tsouvelekakis
>Priority: Minor
> Fix For: 3.3.3, 3.4.1
>
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42647:


Assignee: Aimilios Tsouvelekakis

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Assignee: Aimilios Tsouvelekakis
>Priority: Major
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42647) Remove aliases from deprecated numpy data types

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42647.
--
Fix Version/s: 3.3.3
   3.4.1
   Resolution: Fixed

Issue resolved by pull request 40220
[https://github.com/apache/spark/pull/40220]

> Remove aliases from deprecated numpy data types
> ---
>
> Key: SPARK-42647
> URL: https://issues.apache.org/jira/browse/SPARK-42647
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0, 3.3.1, 3.3.3, 3.3.2, 3.4.0, 3.4.1
>Reporter: Aimilios Tsouvelekakis
>Assignee: Aimilios Tsouvelekakis
>Priority: Major
> Fix For: 3.3.3, 3.4.1
>
>
> Numpy has started changing the alias to some of its data-types. This means 
> that users with the latest version of numpy they will face either warnings or 
> errors according to the type that they are using. This affects all the users 
> using numoy > 1.20.0. One of the types was fixed back in September with this 
> [pull|https://github.com/apache/spark/pull/37817] request.
> The problem can be split into 2 types:
> [numpy 1.24.0|https://github.com/numpy/numpy/pull/22607]: The scalar type 
> aliases ending in a 0 bit size: np.object0, np.str0, np.bytes0, np.void0, 
> np.int0, np.uint0 as well as np.bool8 are now deprecated and will eventually 
> be removed. At this point in numpy 1.25.0 they give a awarning
> [numpy 1.20.0|https://github.com/numpy/numpy/pull/14882]: Using the aliases 
> of builtin types like np.int is deprecated and removed since numpy version 
> 1.24.0
> The changes are needed so pyspark can be compatible with the latest numpy and 
> avoid
>  * attribute errors on data types being deprecated from version 1.20.0: 
> [https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations]
>  * warnings on deprecated data types from version 1.24.0: 
> [https://numpy.org/devdocs/release/1.24.0-notes.html#deprecations]
>  
> From my main research I see the following:
> The only changes that are functional are related with the conversion.py file. 
> The rest of the changes are inside tests in the user_guide or in some 
> docstrings describing specific functions. Since I am not an expert in these 
> tests I wait for the reviewer and some people with more experience in the 
> pyspark code.
> These types are aliases for classic python types so yes they should work with 
> all the numpy versions 
> [1|https://numpy.org/devdocs/release/1.20.0-notes.html], 
> [2|https://stackoverflow.com/questions/74844262/how-can-i-solve-error-module-numpy-has-no-attribute-float-in-python].
>  The error or warning comes from the call to the numpy.
>  
> For the versions I chose to include from 3.3 and onwards but I see that 3.2 
> also is still in the 18 month maintenace cadence as it was released in 
> October 2021.
>  
> The pull request: [https://github.com/apache/spark/pull/40220]
> Best Regards,
> Aimilios



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41718) Numpy 1.24 breaks PySpark due to use of `np.bool` instead of `np.bool_` in many places

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-41718.
--
Resolution: Duplicate

> Numpy 1.24 breaks PySpark due to use of `np.bool` instead of `np.bool_` in 
> many places
> --
>
> Key: SPARK-41718
> URL: https://issues.apache.org/jira/browse/SPARK-41718
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Phillip Cloud
>Priority: Major
>
> In numpy 1.24, `numpy.bool` was removed (it was deprecated prior to 1.24). 
> This causes many APIs in pyspark to stop working because an AttributeError is 
> raised. The alternative is to use `numpy.bool_` (trailing underscore).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42615) Refactor the AnalyzePlan RPC and add `session.version`

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695943#comment-17695943
 ] 

Apache Spark commented on SPARK-42615:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40261

> Refactor the AnalyzePlan RPC and add `session.version`
> --
>
> Key: SPARK-42615
> URL: https://issues.apache.org/jira/browse/SPARK-42615
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-03-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-41497.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 39459
[https://github.com/apache/spark/pull/39459]

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Assignee: Tengfei Huang
>Priority: Major
> Fix For: 3.5.0
>
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in the case of task retry, the value 
> should be reported. However, in the case of rdd cache reuse, the value 
> shouldn’t be reported (should it?);
> Option 4: Do task success validation when a task trying to load the rdd 
> cache: this way defines a rdd cache is only valid/accessible if the task has 
> succeeded. This way could be either overkill or a bit complex (because 
> currently Spark would clean up the task state once it’s finished. So we need 
> to maintain a structure to know if task once succeeded or not. )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41497) Accumulator undercounting in the case of retry task with rdd cache

2023-03-02 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-41497:
---

Assignee: Tengfei Huang

> Accumulator undercounting in the case of retry task with rdd cache
> --
>
> Key: SPARK-41497
> URL: https://issues.apache.org/jira/browse/SPARK-41497
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.8, 3.0.3, 3.1.3, 3.2.2, 3.3.1
>Reporter: wuyi
>Assignee: Tengfei Huang
>Priority: Major
>
> Accumulator could be undercounted when the retried task has rdd cache.  See 
> the example below and you could also find the completed and reproducible 
> example at 
> [https://github.com/apache/spark/compare/master...Ngone51:spark:fix-acc]
>   
> {code:scala}
> test("SPARK-XXX") {
>   // Set up a cluster with 2 executors
>   val conf = new SparkConf()
> .setMaster("local-cluster[2, 1, 
> 1024]").setAppName("TaskSchedulerImplSuite")
>   sc = new SparkContext(conf)
>   // Set up a custom task scheduler. The scheduler will fail the first task 
> attempt of the job
>   // submitted below. In particular, the failed first attempt task would 
> success on computation
>   // (accumulator accounting, result caching) but only fail to report its 
> success status due
>   // to the concurrent executor lost. The second task attempt would success.
>   taskScheduler = setupSchedulerWithCustomStatusUpdate(sc)
>   val myAcc = sc.longAccumulator("myAcc")
>   // Initiate a rdd with only one partition so there's only one task and 
> specify the storage level
>   // with MEMORY_ONLY_2 so that the rdd result will be cached on both two 
> executors.
>   val rdd = sc.parallelize(0 until 10, 1).mapPartitions { iter =>
> myAcc.add(100)
> iter.map(x => x + 1)
>   }.persist(StorageLevel.MEMORY_ONLY_2)
>   // This will pass since the second task attempt will succeed
>   assert(rdd.count() === 10)
>   // This will fail due to `myAcc.add(100)` won't be executed during the 
> second task attempt's
>   // execution. Because the second task attempt will load the rdd cache 
> directly instead of
>   // executing the task function so `myAcc.add(100)` is skipped.
>   assert(myAcc.value === 100)
> } {code}
>  
> We could also hit this issue with decommission even if the rdd only has one 
> copy. For example, decommission could migrate the rdd cache block to another 
> executor (the result is actually the same with 2 copies) and the 
> decommissioned executor lost before the task reports its success status to 
> the driver. 
>  
> And the issue is a bit more complicated than expected to fix. I have tried to 
> give some fixes but all of them are not ideal:
> Option 1: Clean up any rdd cache related to the failed task: in practice, 
> this option can already fix the issue in most cases. However, theoretically, 
> rdd cache could be reported to the driver right after the driver cleans up 
> the failed task's caches due to asynchronous communication. So this option 
> can’t resolve the issue thoroughly;
> Option 2: Disallow rdd cache reuse across the task attempts for the same 
> task: this option can 100% fix the issue. The problem is this way can also 
> affect the case where rdd cache can be reused across the attempts (e.g., when 
> there is no accumulator operation in the task), which can have perf 
> regression;
> Option 3: Introduce accumulator cache: first, this requires a new framework 
> for supporting accumulator cache; second, the driver should improve its logic 
> to distinguish whether the accumulator cache value should be reported to the 
> user to avoid overcounting. For example, in the case of task retry, the value 
> should be reported. However, in the case of rdd cache reuse, the value 
> shouldn’t be reported (should it?);
> Option 4: Do task success validation when a task trying to load the rdd 
> cache: this way defines a rdd cache is only valid/accessible if the task has 
> succeeded. This way could be either overkill or a bit complex (because 
> currently Spark would clean up the task state once it’s finished. So we need 
> to maintain a structure to know if task once succeeded or not. )



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695928#comment-17695928
 ] 

Apache Spark commented on SPARK-42630:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40260

> Make `parse_data_type` use new proto message `DDLParse`
> ---
>
> Key: SPARK-42630
> URL: https://issues.apache.org/jira/browse/SPARK-42630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42630) Make `parse_data_type` use new proto message `DDLParse`

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695927#comment-17695927
 ] 

Apache Spark commented on SPARK-42630:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/40260

> Make `parse_data_type` use new proto message `DDLParse`
> ---
>
> Key: SPARK-42630
> URL: https://issues.apache.org/jira/browse/SPARK-42630
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42658) Handle timeouts and CRC failures during artifact transfer

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-42658:
-
Description: 
We would need a retry mechanism on the client side to handle CRC failures 
during artifact transfer because the server would discard data that fails CRC 
and hence, may lead to missing artifacts during UDF execution. 

We also require a timeout policy to prevent indefinitely waiting for the server 
reply.

  was:We would need a retry mechanism on the client side to handle CRC failures 
during artifact transfer. The server would discard data that fails CRC and 
hence, may lead to missing artifacts during UDF execution. 


> Handle timeouts and CRC failures during artifact transfer
> -
>
> Key: SPARK-42658
> URL: https://issues.apache.org/jira/browse/SPARK-42658
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> We would need a retry mechanism on the client side to handle CRC failures 
> during artifact transfer because the server would discard data that fails CRC 
> and hence, may lead to missing artifacts during UDF execution. 
> We also require a timeout policy to prevent indefinitely waiting for the 
> server reply.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42658) Handle timeouts and CRC failures during artifact transfer

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-42658:
-
Summary: Handle timeouts and CRC failures during artifact transfer  (was: 
Handle CRC failures during artifact transfer)

> Handle timeouts and CRC failures during artifact transfer
> -
>
> Key: SPARK-42658
> URL: https://issues.apache.org/jira/browse/SPARK-42658
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> We would need a retry mechanism on the client side to handle CRC failures 
> during artifact transfer. The server would discard data that fails CRC and 
> hence, may lead to missing artifacts during UDF execution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42658) Handle CRC failures during artifact transfer

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-42658:


 Summary: Handle CRC failures during artifact transfer
 Key: SPARK-42658
 URL: https://issues.apache.org/jira/browse/SPARK-42658
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa


We would need a retry mechanism on the client side to handle CRC failures 
during artifact transfer. The server would discard data that fails CRC and 
hence, may lead to missing artifacts during UDF execution. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-42657:


 Summary: Support to find and transfer client-side REPL classfiles 
to server as artifacts  
 Key: SPARK-42657
 URL: https://issues.apache.org/jira/browse/SPARK-42657
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa


To run UDFs which are defined on the client side REPL, we require a mechanism 
that can find the local REPL classfiles and then utilise the mechanism from 
https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the 
server as artifacts.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42657) Support to find and transfer client-side REPL classfiles to server as artifacts

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-42657:
-
Epic Link: SPARK-42554

> Support to find and transfer client-side REPL classfiles to server as 
> artifacts  
> -
>
> Key: SPARK-42657
> URL: https://issues.apache.org/jira/browse/SPARK-42657
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> To run UDFs which are defined on the client side REPL, we require a mechanism 
> that can find the local REPL classfiles and then utilise the mechanism from 
> https://issues.apache.org/jira/browse/SPARK-42653 to transfer them to the 
> server as artifacts.  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42609) Add tests for grouping() and grouping_id() functions

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42609:


Assignee: Apache Spark  (was: Rui Wang)

> Add tests for grouping() and grouping_id() functions
> 
>
> Key: SPARK-42609
> URL: https://issues.apache.org/jira/browse/SPARK-42609
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42609) Add tests for grouping() and grouping_id() functions

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42609:


Assignee: Rui Wang  (was: Apache Spark)

> Add tests for grouping() and grouping_id() functions
> 
>
> Key: SPARK-42609
> URL: https://issues.apache.org/jira/browse/SPARK-42609
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42609) Add tests for grouping() and grouping_id() functions

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42609?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695912#comment-17695912
 ] 

Apache Spark commented on SPARK-42609:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/40259

> Add tests for grouping() and grouping_id() functions
> 
>
> Key: SPARK-42609
> URL: https://issues.apache.org/jira/browse/SPARK-42609
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42640.
---
Fix Version/s: 3.4.1
   Resolution: Fixed

> Remove stale entries from the excluding rules for CompabilitySuite
> --
>
> Key: SPARK-42640
> URL: https://issues.apache.org/jira/browse/SPARK-42640
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
> Fix For: 3.4.1
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42640) Remove stale entries from the excluding rules for CompabilitySuite

2023-03-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell updated SPARK-42640:
--
Epic Link: SPARK-42554

> Remove stale entries from the excluding rules for CompabilitySuite
> --
>
> Key: SPARK-42640
> URL: https://issues.apache.org/jira/browse/SPARK-42640
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.1
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42655:


Assignee: Apache Spark

> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Assignee: Apache Spark
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
> df2.select("id").show()
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
> val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
> df3.select("id").show()
> org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could 
> be: id, id.
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
>  
> Since, Spark is case insensitive, it should work for second case also when we 
> have upper and lower case column names in the column list.
> It also works fine in Spark 2.3.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695892#comment-17695892
 ] 

Apache Spark commented on SPARK-42655:
--

User 'shrprasa' has created a pull request for this issue:
https://github.com/apache/spark/pull/40258

> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
> df2.select("id").show()
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
> val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
> df3.select("id").show()
> org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could 
> be: id, id.
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
>  
> Since, Spark is case insensitive, it should work for second case also when we 
> have upper and lower case column names in the column list.
> It also works fine in Spark 2.3.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42655:


Assignee: (was: Apache Spark)

> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
> df2.select("id").show()
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
> val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
> df3.select("id").show()
> org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could 
> be: id, id.
>   at 
> org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787)
>   at 
> org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:286)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)
>  
> Since, Spark is case insensitive, it should work for second case also when we 
> have upper and lower case column names in the column list.
> It also works fine in Spark 2.3.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42656:


Assignee: (was: Apache Spark)

> Spark Connect Scala Client Shell Script
> ---
>
> Key: SPARK-42656
> URL: https://issues.apache.org/jira/browse/SPARK-42656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Priority: Major
>
> Adding a shell script to run scala client in a scala REPL to allow users to 
> connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42656:


Assignee: Apache Spark

> Spark Connect Scala Client Shell Script
> ---
>
> Key: SPARK-42656
> URL: https://issues.apache.org/jira/browse/SPARK-42656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Assignee: Apache Spark
>Priority: Major
>
> Adding a shell script to run scala client in a scala REPL to allow users to 
> connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695879#comment-17695879
 ] 

Apache Spark commented on SPARK-42656:
--

User 'zhenlineo' has created a pull request for this issue:
https://github.com/apache/spark/pull/40257

> Spark Connect Scala Client Shell Script
> ---
>
> Key: SPARK-42656
> URL: https://issues.apache.org/jira/browse/SPARK-42656
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Zhen Li
>Priority: Major
>
> Adding a shell script to run scala client in a scala REPL to allow users to 
> connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42656) Spark Connect Scala Client Shell Script

2023-03-02 Thread Zhen Li (Jira)
Zhen Li created SPARK-42656:
---

 Summary: Spark Connect Scala Client Shell Script
 Key: SPARK-42656
 URL: https://issues.apache.org/jira/browse/SPARK-42656
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Zhen Li


Adding a shell script to run scala client in a scala REPL to allow users to 
connect to spark connect.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36604) timestamp type column analyze result is wrong

2023-03-02 Thread Ritika Maheshwari (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695869#comment-17695869
 ] 

Ritika Maheshwari commented on SPARK-36604:
---

Seems to be working correctly in  Spark 3.3.0

spark-sql> insert into a values(cast('2021-08-15 15:30:01' as timestamp)
         > );
23/03/02 11:04:11 WARN ObjectStore: Failed to get database global_temp, 
returning NoSuchObjectException
Time taken: 3.278 seconds
spark-sql> select * from a;
2021-08-15 15:30:01
Time taken: 0.782 seconds, Fetched 1 row(s)
spark-sql> analyze table a compute statistics for columns a;
Time taken: 1.882 seconds
spark-sql> desc formatted a a;
col_name        a
data_type       timestamp
comment NULL
min     2021-08-15 15:30:01.00 -0700
max     2021-08-15 15:30:01.00 -0700
num_nulls       0
distinct_count  1
avg_col_len     8
max_col_len     8
histogram       NULL
Time taken: 0.095 seconds, Fetched 10 row(s)
spark-sql> desc a;
a                       timestamp                                   
Time taken: 0.059 seconds, Fetched 1 row(s)
spark-sql>

> timestamp type column analyze result is wrong
> -
>
> Key: SPARK-36604
> URL: https://issues.apache.org/jira/browse/SPARK-36604
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1
>Reporter: YuanGuanhu
>Priority: Major
>
> when we create table with timestamp column type, the min and max data of the 
> analyze result for the timestamp column is wrong
> eg:
> {code}
> > select * from a;
> {code}
> {code}
> 2021-08-15 15:30:01
> Time taken: 2.789 seconds, Fetched 1 row(s)
> spark-sql> desc formatted a a;
> col_name a
> data_type timestamp
> comment NULL
> min 2021-08-15 07:30:01.00
> max 2021-08-15 07:30:01.00
> num_nulls 0
> distinct_count 1
> avg_col_len 8
> max_col_len 8
> histogram NULL
> Time taken: 0.278 seconds, Fetched 10 row(s)
> spark-sql> desc a;
> a timestamp NULL
> Time taken: 1.432 seconds, Fetched 1 row(s)
> {code}
>  
> reproduce step:
> {code}
> create table a(a timestamp);
> insert into a select '2021-08-15 15:30:01';
> analyze table a compute statistics for columns a;
> desc formatted a a;
> select * from a;
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-42655:

Description: 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show()
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_mixed_case = List("id","col2","col3","col4", "col5", "ID")
val df3 = df1.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: _*)
df3.select("id").show()



org.apache.spark.sql.AnalysisException: Reference 'id' is ambiguous, could be: 
id, id.

  at 
org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:363)

  at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveChildren(LogicalPlan.scala:112)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpressionByPlanChildren$1(Analyzer.scala:1857)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$resolveExpression$2(Analyzer.scala:1787)

  at 
org.apache.spark.sql.catalyst.analysis.package$.withPosition(package.scala:60)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.innerResolve$1(Analyzer.scala:1794)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpression(Analyzer.scala:1812)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer.resolveExpressionByPlanChildren(Analyzer.scala:1863)

  at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$17.$anonfun$applyOrElse$94(Analyzer.scala:1577)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:193)

  at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:193)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:204)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:209)

  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)

  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)

  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)

  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)

  at scala.collection.TraversableLike.map(TraversableLike.scala:286)

  at scala.collection.TraversableLike.map$(TraversableLike.scala:279)

  at scala.collection.AbstractTraversable.map(Traversable.scala:108)

  at 
org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:209)

 


Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 

  was:
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 


> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
> df2.select("id").show()
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> 

[jira] [Updated] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shrikant Prasad updated SPARK-42655:

Description: 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.

It also works fine in Spark 2.3.
 

  was:
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.
 


> Incorrect ambiguous column reference error
> --
>
> Key: SPARK-42655
> URL: https://issues.apache.org/jira/browse/SPARK-42655
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Shrikant Prasad
>Priority: Major
>
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
> val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
> df2.select("id").show() 
>  
> This query runs fine.
>  
> But when we change the casing of the op_cols to have mix of upper & lower 
> case ("id" & "ID") it throws an ambiguous col ref error:
>  
> val df1 = 
> sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
> "col5")
> val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
> val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
> df2.select("id").show() 
>  
> Since, Spark is case insensitive, it should work for second case also when we 
> have upper and lower case column names in the column list.
> It also works fine in Spark 2.3.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42655) Incorrect ambiguous column reference error

2023-03-02 Thread Shrikant Prasad (Jira)
Shrikant Prasad created SPARK-42655:
---

 Summary: Incorrect ambiguous column reference error
 Key: SPARK-42655
 URL: https://issues.apache.org/jira/browse/SPARK-42655
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Shrikant Prasad


val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
This query runs fine.
 
But when we change the casing of the op_cols to have mix of upper & lower case 
("id" & "ID") it throws an ambiguous col ref error:
 
val df1 = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", 
"col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_same_case .head, op_cols_same_case .tail: _*)
df2.select("id").show() 
 
Since, Spark is case insensitive, it should work for second case also when we 
have upper and lower case column names in the column list.
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42599) Make `CompatibilitySuite` as a tool like `dev/mima`

2023-03-02 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-42599?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hövell resolved SPARK-42599.
---
Fix Version/s: 3.4.0
 Assignee: Yang Jie
   Resolution: Fixed

> Make `CompatibilitySuite` as a tool like `dev/mima`
> ---
>
> Key: SPARK-42599
> URL: https://issues.apache.org/jira/browse/SPARK-42599
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Major
> Fix For: 3.4.0
>
>
> Using maven to test `CompatibilitySuite` requires some pre-work(need maven 
> build sql & 
> connect-client-jvm module before test), so when we run `mvn package test`, 
> there will be following errors:
>  
> {code:java}
> CompatibilitySuite:
> - compatibility MiMa tests *** FAILED ***
>   java.lang.AssertionError: assertion failed: Failed to find the jar inside 
> folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$1(CompatibilitySuite.scala:69)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   ...
> - compatibility API tests: Dataset *** FAILED ***
>   java.lang.AssertionError: assertion failed: Failed to find the jar inside 
> folder: /home/bjorn/spark-3.4.0/connector/connect/client/jvm/target
>   at scala.Predef$.assert(Predef.scala:223)
>   at 
> org.apache.spark.sql.connect.client.util.IntegrationTestUtils$.findJar(IntegrationTestUtils.scala:67)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar$lzycompute(CompatibilitySuite.scala:57)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.clientJar(CompatibilitySuite.scala:53)
>   at 
> org.apache.spark.sql.connect.client.CompatibilitySuite.$anonfun$new$7(CompatibilitySuite.scala:110)
>   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22) {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26314) support Confluent encoded Avro in Spark Structured Streaming

2023-03-02 Thread Gustavo Martin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695828#comment-17695828
 ] 

Gustavo Martin commented on SPARK-26314:


My team just stumbled upon this problem :(

I was hoping Spark would be making use of the AVRO capabilities for finding the 
right schema associated with some event when using a Schema Registy.

 

> support Confluent encoded Avro in Spark Structured Streaming
> 
>
> Key: SPARK-26314
> URL: https://issues.apache.org/jira/browse/SPARK-26314
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: David Ahern
>Priority: Major
>
> As Avro has now been added as a first class citizen,
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> please make Confluent encoded avro work out of the box with Spark Structured 
> Streaming
> as described in this link, Avro messages on Kafka encoded with confluent 
> serializer also need to be decoded with confluent.  It would be great if this 
> worked out of the box
> [https://developer.ibm.com/answers/questions/321440/ibm-iidr-cdc-db2-to-kafka.html?smartspace=blockchain]
> here are details on the Confluent encoding
> [https://www.sderosiaux.com/articles/2017/03/02/serializing-data-efficiently-with-apache-avro-and-dealing-with-a-schema-registry/#encodingdecoding-the-messages-with-the-schema-id]
> It's been a year since i worked on anything to do with Avro and Spark 
> Structured Streaming, but i had to take an approach such as this when getting 
> it to work.  This is what i  used as a reference at that time
> [https://github.com/tubular/confluent-spark-avro]
> Also, here is another link i found that someone has done in the meantime
> [https://github.com/AbsaOSS/ABRiS]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Venkata Sai Akhil Gudesa updated SPARK-42653:
-
Epic Link: SPARK-42554

> Artifact transfer from Scala/JVM client to Server
> -
>
> Key: SPARK-42653
> URL: https://issues.apache.org/jira/browse/SPARK-42653
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> In the decoupled client-server architecture of Spark Connect, a remote client 
> may use a local JAR or a new class in their UDF that may not be present on 
> the server. To handle these cases of missing "artifacts", we need to 
> implement a mechanism to transfer artifacts from the client side over to the 
> server side as per the protocol defined in 
> https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695820#comment-17695820
 ] 

Apache Spark commented on SPARK-42653:
--

User 'vicennial' has created a pull request for this issue:
https://github.com/apache/spark/pull/40256

> Artifact transfer from Scala/JVM client to Server
> -
>
> Key: SPARK-42653
> URL: https://issues.apache.org/jira/browse/SPARK-42653
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> In the decoupled client-server architecture of Spark Connect, a remote client 
> may use a local JAR or a new class in their UDF that may not be present on 
> the server. To handle these cases of missing "artifacts", we need to 
> implement a mechanism to transfer artifacts from the client side over to the 
> server side as per the protocol defined in 
> https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42653:


Assignee: (was: Apache Spark)

> Artifact transfer from Scala/JVM client to Server
> -
>
> Key: SPARK-42653
> URL: https://issues.apache.org/jira/browse/SPARK-42653
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> In the decoupled client-server architecture of Spark Connect, a remote client 
> may use a local JAR or a new class in their UDF that may not be present on 
> the server. To handle these cases of missing "artifacts", we need to 
> implement a mechanism to transfer artifacts from the client side over to the 
> server side as per the protocol defined in 
> https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42653:


Assignee: Apache Spark

> Artifact transfer from Scala/JVM client to Server
> -
>
> Key: SPARK-42653
> URL: https://issues.apache.org/jira/browse/SPARK-42653
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Assignee: Apache Spark
>Priority: Major
>
> In the decoupled client-server architecture of Spark Connect, a remote client 
> may use a local JAR or a new class in their UDF that may not be present on 
> the server. To handle these cases of missing "artifacts", we need to 
> implement a mechanism to transfer artifacts from the client side over to the 
> server side as per the protocol defined in 
> https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695819#comment-17695819
 ] 

Apache Spark commented on SPARK-42653:
--

User 'vicennial' has created a pull request for this issue:
https://github.com/apache/spark/pull/40256

> Artifact transfer from Scala/JVM client to Server
> -
>
> Key: SPARK-42653
> URL: https://issues.apache.org/jira/browse/SPARK-42653
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Venkata Sai Akhil Gudesa
>Priority: Major
>
> In the decoupled client-server architecture of Spark Connect, a remote client 
> may use a local JAR or a new class in their UDF that may not be present on 
> the server. To handle these cases of missing "artifacts", we need to 
> implement a mechanism to transfer artifacts from the client side over to the 
> server side as per the protocol defined in 
> https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42654) Upgrade dropwizard metrics 4.2.17

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42654:


Assignee: Apache Spark

> Upgrade dropwizard metrics 4.2.17
> -
>
> Key: SPARK-42654
> URL: https://issues.apache.org/jira/browse/SPARK-42654
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Major
>
> * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16]
>  * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42558) Implement DataFrameStatFunctions

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42558:


Assignee: Apache Spark

> Implement DataFrameStatFunctions
> 
>
> Key: SPARK-42558
> URL: https://issues.apache.org/jira/browse/SPARK-42558
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>
> Implement DataFrameStatFunctions for connect, and hook it up to Dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42558) Implement DataFrameStatFunctions

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695816#comment-17695816
 ] 

Apache Spark commented on SPARK-42558:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40255

> Implement DataFrameStatFunctions
> 
>
> Key: SPARK-42558
> URL: https://issues.apache.org/jira/browse/SPARK-42558
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement DataFrameStatFunctions for connect, and hook it up to Dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42654) Upgrade dropwizard metrics 4.2.17

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695814#comment-17695814
 ] 

Apache Spark commented on SPARK-42654:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/40254

> Upgrade dropwizard metrics 4.2.17
> -
>
> Key: SPARK-42654
> URL: https://issues.apache.org/jira/browse/SPARK-42654
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16]
>  * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42558) Implement DataFrameStatFunctions

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42558:


Assignee: (was: Apache Spark)

> Implement DataFrameStatFunctions
> 
>
> Key: SPARK-42558
> URL: https://issues.apache.org/jira/browse/SPARK-42558
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement DataFrameStatFunctions for connect, and hook it up to Dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42654) Upgrade dropwizard metrics 4.2.17

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42654:


Assignee: (was: Apache Spark)

> Upgrade dropwizard metrics 4.2.17
> -
>
> Key: SPARK-42654
> URL: https://issues.apache.org/jira/browse/SPARK-42654
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.5.0
>Reporter: Yang Jie
>Priority: Major
>
> * [https://github.com/dropwizard/metrics/releases/tag/v4.2.16]
>  * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42654) Upgrade dropwizard metrics 4.2.17

2023-03-02 Thread Yang Jie (Jira)
Yang Jie created SPARK-42654:


 Summary: Upgrade dropwizard metrics 4.2.17
 Key: SPARK-42654
 URL: https://issues.apache.org/jira/browse/SPARK-42654
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: Yang Jie


* [https://github.com/dropwizard/metrics/releases/tag/v4.2.16]
 * [https://github.com/dropwizard/metrics/releases/tag/v4.2.17]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42653) Artifact transfer from Scala/JVM client to Server

2023-03-02 Thread Venkata Sai Akhil Gudesa (Jira)
Venkata Sai Akhil Gudesa created SPARK-42653:


 Summary: Artifact transfer from Scala/JVM client to Server
 Key: SPARK-42653
 URL: https://issues.apache.org/jira/browse/SPARK-42653
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.4.0
Reporter: Venkata Sai Akhil Gudesa


In the decoupled client-server architecture of Spark Connect, a remote client 
may use a local JAR or a new class in their UDF that may not be present on the 
server. To handle these cases of missing "artifacts", we need to implement a 
mechanism to transfer artifacts from the client side over to the server side as 
per the protocol defined in https://github.com/apache/spark/pull/40147 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-02 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-42553:
-
Fix Version/s: 3.3.3

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
> Fix For: 3.3.3, 3.4.1
>
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$NamedExpressionContext.accept(SqlBaseParser.java:14124)
>   at 
> 

[jira] [Resolved] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-42622.
--
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40219
[https://github.com/apache/spark/pull/40219]

> StackOverflowError reading json that does not conform to schema
> ---
>
> Key: SPARK-42622
> URL: https://issues.apache.org/jira/browse/SPARK-42622
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.0
>Reporter: Jelmer Kuperus
>Assignee: Jelmer Kuperus
>Priority: Critical
> Fix For: 3.4.0
>
>
> Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we 
> encountered the following problem
>  
> !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-42622:


Assignee: Jelmer Kuperus

> StackOverflowError reading json that does not conform to schema
> ---
>
> Key: SPARK-42622
> URL: https://issues.apache.org/jira/browse/SPARK-42622
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.0
>Reporter: Jelmer Kuperus
>Assignee: Jelmer Kuperus
>Priority: Critical
>
> Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we 
> encountered the following problem
>  
> !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42622) StackOverflowError reading json that does not conform to schema

2023-03-02 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-42622:
-
Fix Version/s: 3.4.1
   (was: 3.4.0)
 Priority: Major  (was: Critical)

> StackOverflowError reading json that does not conform to schema
> ---
>
> Key: SPARK-42622
> URL: https://issues.apache.org/jira/browse/SPARK-42622
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.4.0
>Reporter: Jelmer Kuperus
>Assignee: Jelmer Kuperus
>Priority: Major
> Fix For: 3.4.1
>
>
> Databricks runtime 12.1 uses a pre-release version of spark 3.4.x we 
> encountered the following problem
>  
> !https://user-images.githubusercontent.com/133639/221866500-99f187a0-8db3-42a7-85ca-b027fdec160d.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42652) Improve the multiple watermarking policy documentation

2023-03-02 Thread Sandeep Chandran (Jira)
Sandeep Chandran created SPARK-42652:


 Summary: Improve the multiple watermarking policy documentation
 Key: SPARK-42652
 URL: https://issues.apache.org/jira/browse/SPARK-42652
 Project: Spark
  Issue Type: Documentation
  Components: Structured Streaming
Affects Versions: 3.3.2
Reporter: Sandeep Chandran


Its better if we add some examples on Handling multiple watermark documentation.

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#policy-for-handling-multiple-watermarks



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42558) Implement DataFrameStatFunctions

2023-03-02 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695706#comment-17695706
 ] 

Yang Jie commented on SPARK-42558:
--

Seems DataFrameStatFunctions can only be partially implemented. bloomFilter and 
 countMinSketch have no corresponding protocol yet

 

> Implement DataFrameStatFunctions
> 
>
> Key: SPARK-42558
> URL: https://issues.apache.org/jira/browse/SPARK-42558
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>
> Implement DataFrameStatFunctions for connect, and hook it up to Dataset.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42553) NonReserved keyword "interval" can't be column name

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695696#comment-17695696
 ] 

Apache Spark commented on SPARK-42553:
--

User 'jiang13021' has created a pull request for this issue:
https://github.com/apache/spark/pull/40253

> NonReserved keyword "interval" can't be column name
> ---
>
> Key: SPARK-42553
> URL: https://issues.apache.org/jira/browse/SPARK-42553
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0, 3.3.1, 3.2.3, 3.3.2
> Environment: Scala version 2.12.15 (OpenJDK 64-Bit Server VM, Java 
> 1.8.0_345)
> Spark version 3.2.3-SNAPSHOT
>Reporter: jiang13021
>Assignee: jiang13021
>Priority: Major
> Fix For: 3.4.1
>
>
> INTERVAL is a Non-Reserved keyword in spark. "Non-Reserved keywords" have a 
> special meaning in particular contexts and can be used as identifiers in 
> other contexts. So by design, interval can be used as a column name.
> {code:java}
> scala> spark.sql("select interval from mytable")
> org.apache.spark.sql.catalyst.parser.ParseException:
> at least one time unit should be given for interval literal(line 1, pos 7)== 
> SQL ==
> select interval from mytable
> ---^^^  at 
> org.apache.spark.sql.errors.QueryParsingErrors$.invalidIntervalLiteralError(QueryParsingErrors.scala:196)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$parseIntervalLiteral$1(AstBuilder.scala:2481)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.parseIntervalLiteral(AstBuilder.scala:2466)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitInterval$1(AstBuilder.scala:2432)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:2431)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitInterval(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalContext.accept(SqlBaseParser.java:17308)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitIntervalLiteral(SqlBaseBaseVisitor.java:1581)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$IntervalLiteralContext.accept(SqlBaseParser.java:16929)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitConstantDefault(SqlBaseBaseVisitor.java:1511)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ConstantDefaultContext.accept(SqlBaseParser.java:15905)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitValueExpressionDefault(SqlBaseBaseVisitor.java:1392)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ValueExpressionDefaultContext.accept(SqlBaseParser.java:15298)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPredicated$1(AstBuilder.scala:1548)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:1547)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitPredicated(AstBuilder.scala:57)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$PredicatedContext.accept(SqlBaseParser.java:14745)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitChildren(AstBuilder.scala:71)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseBaseVisitor.visitExpression(SqlBaseBaseVisitor.java:1343)
>   at 
> org.apache.spark.sql.catalyst.parser.SqlBaseParser$ExpressionContext.accept(SqlBaseParser.java:14606)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.typedVisit(AstBuilder.scala:61)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.expression(AstBuilder.scala:1412)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitNamedExpression$1(AstBuilder.scala:1434)
>   at 
> org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:133)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:1433)
>   at 
> org.apache.spark.sql.catalyst.parser.AstBuilder.visitNamedExpression(AstBuilder.scala:57)
>   at 
> 

[jira] [Commented] (SPARK-42555) Add JDBC to DataFrameReader

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695691#comment-17695691
 ] 

Apache Spark commented on SPARK-42555:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/40252

> Add JDBC to DataFrameReader
> ---
>
> Key: SPARK-42555
> URL: https://issues.apache.org/jira/browse/SPARK-42555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42555) Add JDBC to DataFrameReader

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42555:


Assignee: Apache Spark

> Add JDBC to DataFrameReader
> ---
>
> Key: SPARK-42555
> URL: https://issues.apache.org/jira/browse/SPARK-42555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42555) Add JDBC to DataFrameReader

2023-03-02 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-42555:


Assignee: (was: Apache Spark)

> Add JDBC to DataFrameReader
> ---
>
> Key: SPARK-42555
> URL: https://issues.apache.org/jira/browse/SPARK-42555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests

2023-03-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-41725:


Assignee: Hyukjin Kwon

> Remove the workaround of sql(...).collect back in PySpark tests
> ---
>
> Key: SPARK-41725
> URL: https://issues.apache.org/jira/browse/SPARK-41725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> See https://github.com/apache/spark/pull/39224/files#r1057436437
> We don't have to `collect` for every `sql` but Spark Connect requires it. We 
> should remove them out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread XiDuo You (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

XiDuo You updated SPARK-42651:
--
Description: If the size of plan is small enough, it's more efficient to 
sort all rows at driver side that saves one shuffle

> Optimize global sort to driver sort
> ---
>
> Key: SPARK-42651
> URL: https://issues.apache.org/jira/browse/SPARK-42651
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: XiDuo You
>Priority: Major
>
> If the size of plan is small enough, it's more efficient to sort all rows at 
> driver side that saves one shuffle



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42393) Support for Pandas/Arrow Functions API

2023-03-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42393:
-

Assignee: Xinrong Meng

> Support for Pandas/Arrow Functions API
> --
>
> Key: SPARK-42393
> URL: https://issues.apache.org/jira/browse/SPARK-42393
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0, 3.5.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42499) Support for Runtime SQL configuration

2023-03-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42499:
-

Assignee: Takuya Ueshin

> Support for Runtime SQL configuration
> -
>
> Key: SPARK-42499
> URL: https://issues.apache.org/jira/browse/SPARK-42499
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, PySpark
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42651) Optimize global sort to driver sort

2023-03-02 Thread XiDuo You (Jira)
XiDuo You created SPARK-42651:
-

 Summary: Optimize global sort to driver sort
 Key: SPARK-42651
 URL: https://issues.apache.org/jira/browse/SPARK-42651
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.0
Reporter: XiDuo You






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33628) Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the HiveClientImpl

2023-03-02 Thread Maxim Martynov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695654#comment-17695654
 ] 

Maxim Martynov commented on SPARK-33628:


Can anyone review this pull request?

> Use the Hive.getPartitionsByNames method instead of Hive.getPartitions in the 
> HiveClientImpl
> 
>
> Key: SPARK-33628
> URL: https://issues.apache.org/jira/browse/SPARK-33628
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: jinhai
>Priority: Major
> Attachments: image-2020-12-02-16-57-43-619.png, 
> image-2020-12-03-14-38-19-221.png
>
>
> When partitions are tracked by the catalog, that will compute all custom 
> partition locations, especially when dynamic partitions, and the field 
> staticPartitions is empty.
>  The poor performance of the method listPartitions results in a long period 
> of no response at the Driver.
> When read 12253 partitions, the method getPartitionsByNames takes 2 seconds, 
> and the getPartitions takes 457 seconds, nearly 8 minutes
> !image-2020-12-02-16-57-43-619.png|width=783,height=54!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42578) Add JDBC to DataFrameWriter

2023-03-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42578?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695650#comment-17695650
 ] 

jiaan.geng commented on SPARK-42578:


I will take a look!

> Add JDBC to DataFrameWriter
> ---
>
> Key: SPARK-42578
> URL: https://issues.apache.org/jira/browse/SPARK-42578
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41725) Remove the workaround of sql(...).collect back in PySpark tests

2023-03-02 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695643#comment-17695643
 ] 

Apache Spark commented on SPARK-41725:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/40251

> Remove the workaround of sql(...).collect back in PySpark tests
> ---
>
> Key: SPARK-41725
> URL: https://issues.apache.org/jira/browse/SPARK-41725
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.4.0
>
>
> See https://github.com/apache/spark/pull/39224/files#r1057436437
> We don't have to `collect` for every `sql` but Spark Connect requires it. We 
> should remove them out.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42555) Add JDBC to DataFrameReader

2023-03-02 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17695633#comment-17695633
 ] 

jiaan.geng commented on SPARK-42555:


I will take a look!

> Add JDBC to DataFrameReader
> ---
>
> Key: SPARK-42555
> URL: https://issues.apache.org/jira/browse/SPARK-42555
> Project: Spark
>  Issue Type: New Feature
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Herman van Hövell
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42641.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40243
[https://github.com/apache/spark/pull/40243]

> Upgrade buf to v1.15.0
> --
>
> Key: SPARK-42641
> URL: https://issues.apache.org/jira/browse/SPARK-42641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42641) Upgrade buf to v1.15.0

2023-03-02 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42641:
-

Assignee: Ruifeng Zheng

> Upgrade buf to v1.15.0
> --
>
> Key: SPARK-42641
> URL: https://issues.apache.org/jira/browse/SPARK-42641
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.4.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread kevinshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevinshin updated SPARK-42650:
--
Description: 
When use 
[KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
 and when a `insert overwrite` statment meet exception ,a no partion table's 
home directory will lost ,partion table will lost partion directory.
 
spark-defaults.conf: 
spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
 

because I can't reopen SPARK-42550 , for detail and reproduce please reference: 

https://issues.apache.org/jira/browse/SPARK-42550

 

  was:
When use 
[KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
 and when a{{ insert overwrite}} statment meet exception ,a no partion table's 
home directory will lost ,partion table will lost partion directory.
 
spark-defaults.conf: 
spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
 

because I can't reopen SPARK-42550 , for detail and reproduce please reference: 

https://issues.apache.org/jira/browse/SPARK-42550

 


> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a `insert overwrite` statment meet exception ,a no partion table's 
> home directory will lost ,partion table will lost partion directory.
>  
> spark-defaults.conf: 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread kevinshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevinshin updated SPARK-42650:
--
Description: 
When use 
[KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
 and when a `insert overwrite` statment meet exception ,a no partion table's 
home directory will lost ,partion table will lost partion directory.
 
my spark-defaults.conf config : 
spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
 

because I can't reopen SPARK-42550 , for detail and reproduce please reference: 

https://issues.apache.org/jira/browse/SPARK-42550

 

  was:
When use 
[KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
 and when a `insert overwrite` statment meet exception ,a no partion table's 
home directory will lost ,partion table will lost partion directory.
 
spark-defaults.conf: 
spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
 

because I can't reopen SPARK-42550 , for detail and reproduce please reference: 

https://issues.apache.org/jira/browse/SPARK-42550

 


> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a `insert overwrite` statment meet exception ,a no partion table's 
> home directory will lost ,partion table will lost partion directory.
>  
> my spark-defaults.conf config : 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread kevinshin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kevinshin updated SPARK-42650:
--
Description: 
When use 
[KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
 and when a{{ insert overwrite}} statment meet exception ,a no partion table's 
home directory will lost ,partion table will lost partion directory.
 
spark-defaults.conf: 
spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
 

because I can't reopen SPARK-42550 , for detail and reproduce please reference: 

https://issues.apache.org/jira/browse/SPARK-42550

 

  was:https://issues.apache.org/jira/browse/SPARK-42550


> link issue SPARK-42550
> --
>
> Key: SPARK-42650
> URL: https://issues.apache.org/jira/browse/SPARK-42650
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.3
>Reporter: kevinshin
>Priority: Major
>
> When use 
> [KyuubiSparkSQLExtension|https://kyuubi.readthedocs.io/en/v1.6.1-incubating/extensions/engines/spark/]
>  and when a{{ insert overwrite}} statment meet exception ,a no partion 
> table's home directory will lost ,partion table will lost partion directory.
>  
> spark-defaults.conf: 
> spark.sql.extensions org.apache.kyuubi.sql.KyuubiSparkSQLExtension
>  
> because I can't reopen SPARK-42550 , for detail and reproduce please 
> reference: 
> https://issues.apache.org/jira/browse/SPARK-42550
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42650) link issue SPARK-42550

2023-03-02 Thread kevinshin (Jira)
kevinshin created SPARK-42650:
-

 Summary: link issue SPARK-42550
 Key: SPARK-42650
 URL: https://issues.apache.org/jira/browse/SPARK-42650
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.3
Reporter: kevinshin


https://issues.apache.org/jira/browse/SPARK-42550



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42642) Make Python the first code example tab in the Spark documentation

2023-03-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42642.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40250
[https://github.com/apache/spark/pull/40250]

> Make Python the first code example tab in the Spark documentation
> -
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
> Fix For: 3.5.0
>
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples so this makes Python the first code example 
> tab consistently across the documentation, where applicable.
> This is continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> where these two pages were updated:
> [https://spark.apache.org/docs/latest/sql-getting-started.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
>  
> Pages being updated now:
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-migration-guide.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
>  
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
>  
> [https://spark.apache.org/docs/latest/quick-start.html]
>  
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-json.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
> sql-data-sources-protobuf.html
> [https://spark.apache.org/docs/latest/sql-data-sources-text.html]
> [https://spark.apache.org/docs/latest/sql-migration-guide.html]
> [https://spark.apache.org/docs/latest/sql-performance-tuning.html]
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  
> [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42642) Make Python the first code example tab in the Spark documentation

2023-03-02 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42642:


Assignee: Allan Folting

> Make Python the first code example tab in the Spark documentation
> -
>
> Key: SPARK-42642
> URL: https://issues.apache.org/jira/browse/SPARK-42642
> Project: Spark
>  Issue Type: Documentation
>  Components: Spark Core
>Affects Versions: 3.5.0
>Reporter: Allan Folting
>Assignee: Allan Folting
>Priority: Major
> Attachments: Screenshot 2023-03-01 at 8.10.08 PM.png, Screenshot 
> 2023-03-01 at 8.10.22 PM.png
>
>
> Python is the most approachable and most popular language so it should be the 
> default language in code examples so this makes Python the first code example 
> tab consistently across the documentation, where applicable.
> This is continuing the work started with:
> https://issues.apache.org/jira/browse/SPARK-42493
> where these two pages were updated:
> [https://spark.apache.org/docs/latest/sql-getting-started.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html]
>  
> Pages being updated now:
> [https://spark.apache.org/docs/latest/ml-classification-regression.html]
> [https://spark.apache.org/docs/latest/ml-clustering.html]
> [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/ml-datasource.html]
> [https://spark.apache.org/docs/latest/ml-features.html]
> [https://spark.apache.org/docs/latest/ml-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/ml-migration-guide.html]
> [https://spark.apache.org/docs/latest/ml-pipeline.html]
> [https://spark.apache.org/docs/latest/ml-statistics.html]
> [https://spark.apache.org/docs/latest/ml-tuning.html]
>  
> [https://spark.apache.org/docs/latest/mllib-clustering.html]
> [https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html]
> [https://spark.apache.org/docs/latest/mllib-data-types.html]
> [https://spark.apache.org/docs/latest/mllib-decision-tree.html]
> [https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html]
> [https://spark.apache.org/docs/latest/mllib-ensembles.html]
> [https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html]
> [https://spark.apache.org/docs/latest/mllib-feature-extraction.html]
> [https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html]
> [https://spark.apache.org/docs/latest/mllib-isotonic-regression.html]
> [https://spark.apache.org/docs/latest/mllib-linear-methods.html]
> [https://spark.apache.org/docs/latest/mllib-naive-bayes.html]
> [https://spark.apache.org/docs/latest/mllib-statistics.html]
>  
> [https://spark.apache.org/docs/latest/quick-start.html]
>  
> [https://spark.apache.org/docs/latest/rdd-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/sql-data-sources-avro.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-binaryFile.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-csv.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-json.html]
> [https://spark.apache.org/docs/latest/sql-data-sources-parquet.html]
> sql-data-sources-protobuf.html
> [https://spark.apache.org/docs/latest/sql-data-sources-text.html]
> [https://spark.apache.org/docs/latest/sql-migration-guide.html]
> [https://spark.apache.org/docs/latest/sql-performance-tuning.html]
> [https://spark.apache.org/docs/latest/sql-ref-datatypes.html]
>  
> [https://spark.apache.org/docs/latest/streaming-kinesis-integration.html]
> [https://spark.apache.org/docs/latest/streaming-programming-guide.html]
>  
> [https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html]
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >