[jira] [Updated] (SPARK-48652) Casting Issue in Spark SQL: String Column Compared to Integer Value Yields Empty Results

2024-06-18 Thread Abhishek Singh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Singh updated SPARK-48652:
---
Description: 
In Spark SQL, comparing a string column to an integer value can lead to 
unexpected results due to type casting resulting in an empty result set.
{code:java}
case class Person(id: String, name: String)
val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
personDF.createOrReplaceTempView("person_ddf")
val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
val resultDF = spark.sql(sqlQuery)
resultDF.show() // Empty result due to type casting issue 

{code}
Below is the logical and physical plan which I m getting
{code:java}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT ('id = -1)
   +- 'UnresolvedRelation [person_ddf], [], false

== Analyzed Logical Plan ==
id: string, name: string
Project [id#356, name#357]
+- Filter NOT (cast(id#356 as int) = -1)
   +- SubqueryAlias person_ddf
  +- View (`person_ddf`, [id#356,name#357])
 +- LocalRelation [id#356, name#357]{code}

*But when I m using the same query and table in Redshift which is based on 
PostGreSQL. I am getting the desired result.*


{code:java}
select * from person where id <> -1; {code}

Explain plan obtained in Redshift.


{code:java}
XN Seq Scan on person  (cost=0.00..0.03 rows=1 width=336)
  Filter: ((id)::text <> '-1'::text) {code}
 

In the execution plan for Spark, the ID column is cast as an integer, while in 
Redshift, the ID column is cast as a varchar.

Shouldn't Spark SQL handle this the same way as Redshift, using the datatype of 
the ID column rather than the datatype of -1?

 

  was:
In Spark SQL, comparing a string column to an integer value can lead to 
unexpected results due to implicit type casting. When a string column is 
compared to an integer, Spark attempts to cast the strings to integers, which 
fails for non-numeric strings, resulting in an empty result set.


{code:java}
case class Person(id: String, name: String)
val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
personDF.createOrReplaceTempView("person_ddf")
val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
val resultDF = spark.sql(sqlQuery)
resultDF.show() // Empty result due to type casting issue 

{code}
Below is the logical and physical plan which I m getting
{code:java}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT ('id = -1)
   +- 'UnresolvedRelation [person_ddf], [], false

== Analyzed Logical Plan ==
id: string, name: string
Project [id#356, name#357]
+- Filter NOT (cast(id#356 as int) = -1)
   +- SubqueryAlias person_ddf
      +- View (`person_ddf`, [id#356,name#357])
         +- LocalRelation [id#356, name#357]

== Optimized Logical Plan ==
LocalRelation , [id#356, name#357]

== Physical Plan ==
LocalTableScan , [id#356, name#357]

== Physical Plan ==
LocalTableScan (1) {code}


> Casting Issue in Spark SQL: String Column Compared to Integer Value Yields 
> Empty Results
> 
>
> Key: SPARK-48652
> URL: https://issues.apache.org/jira/browse/SPARK-48652
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.3.2
>Reporter: Abhishek Singh
>Priority: Minor
>
> In Spark SQL, comparing a string column to an integer value can lead to 
> unexpected results due to type casting resulting in an empty result set.
> {code:java}
> case class Person(id: String, name: String)
> val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
> personDF.createOrReplaceTempView("person_ddf")
> val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
> val resultDF = spark.sql(sqlQuery)
> resultDF.show() // Empty result due to type casting issue 
> {code}
> Below is the logical and physical plan which I m getting
> {code:java}
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'Filter NOT ('id = -1)
>+- 'UnresolvedRelation [person_ddf], [], false
> == Analyzed Logical Plan ==
> id: string, name: string
> Project [id#356, name#357]
> +- Filter NOT (cast(id#356 as int) = -1)
>+- SubqueryAlias person_ddf
>   +- View (`person_ddf`, [id#356,name#357])
>  +- LocalRelation [id#356, name#357]{code}
> *But when I m using the same query and table in Redshift which is based on 
> PostGreSQL. I am getting the desired result.*
> {code:java}
> select * from person where id <> -1; {code}
> Explain plan obtained in Redshift.
> {code:java}
> XN Seq Scan on person  (cost=0.00..0.03 rows=1 width=336)
>   Filter: ((id)::text <> '-1'::text) {code}
>  
> In the execution plan for Spark, the ID column is cast as an integer, while 
> in Redshift, the ID column is cast as a varchar.
> Shouldn't Spark SQL handle this the same way as Redshift, using the datatype 
> of the ID column 

[jira] [Updated] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-48660:

Description: 
How to reproduce:

{code:sql}
CREATE TABLE order_history_version_audit_rno (
  eventid STRING,
  id STRING,
  referenceid STRING,
  type STRING,
  referencetype STRING,
  sellerid BIGINT,
  buyerid BIGINT,
  producerid STRING,
  versionid INT,
  changedocuments ARRAY>,
  dt STRING,
  hr STRING)
USING parquet
PARTITIONED BY (dt, hr);

explain cost
CREATE TABLE order_history_version_audit_rno
USING parquet
PARTITIONED BY (dt)
CLUSTERED BY (id) INTO 1000 buckets
AS SELECT * FROM order_history_version_audit_rno
WHERE dt >= '2023-11-29';
{code}


{noformat}
spark-sql (default)> 
   > explain cost
   > CREATE TABLE order_history_version_audit_rno
   > USING parquet
   > PARTITIONED BY (dt)
   > CLUSTERED BY (id) INTO 1000 buckets
   > AS SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
dt#15, hr#16]
 +- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
   +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
 +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, type#8, 
referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
changedocuments#14, dt#15, hr#16]
   +- Filter (dt#15 >= 2023-11-29)
  +- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
 +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet
{noformat}

If remove create table:

{noformat}
   > explain cost 
   > SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
Filter (isnotnull(dt#15) AND (dt#15 >= 2023-11-29)), Statistics(sizeInBytes=1.0 
B)
+- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet, Statistics(sizeInBytes=0.0 B)

== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 
paths)[], PartitionFilters: [isnotnull(dt#15), (dt#15 >= 2023-11-29)], 
PushedFilters: [], ReadSchema: 
struct>,
  dt STRING,
  hr STRING)
USING parquet
PARTITIONED BY (dt, hr);

explain cost
CREATE TABLE order_history_version_audit_rno
USING parquet
PARTITIONED BY (dt)
CLUSTERED BY (id) INTO 1000 buckets
AS SELECT * FROM order_history_version_audit_rno
WHERE dt >= '2023-11-29';
{code}


{noformat}
spark-sql (default)> 
   > explain cost
   > CREATE TABLE order_history_version_audit_rno
   > USING parquet
   > PARTITIONED BY (dt)
   > CLUSTERED BY (id) INTO 1000 buckets
   > AS SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, 

[jira] [Commented] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo commented on SPARK-48660:
-

I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Wei Guo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17856122#comment-17856122
 ] 

Wei Guo edited comment on SPARK-48660 at 6/19/24 4:18 AM:
--

I am working on this and thank your for recommendation [~LuciferYang] 


was (Author: wayne guo):
I am working on this and thank your for recommendation [~yangjie01] .

> The result of explain is incorrect for CreateTableAsSelect
> --
>
> Key: SPARK-48660
> URL: https://issues.apache.org/jira/browse/SPARK-48660
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0, 3.5.1
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE order_history_version_audit_rno (
>   eventid STRING,
>   id STRING,
>   referenceid STRING,
>   type STRING,
>   referencetype STRING,
>   sellerid BIGINT,
>   buyerid BIGINT,
>   producerid STRING,
>   versionid INT,
>   changedocuments ARRAY BIGINT, changeDetails: STRING>>,
>   dt STRING,
>   hr STRING)
> USING parquet
> PARTITIONED BY (dt, hr);
> explain cost
> CREATE TABLE order_history_version_audit_rno
> USING parquet
> PARTITIONED BY (dt)
> CLUSTERED BY (id) INTO 1000 buckets
> AS SELECT * FROM order_history_version_audit_rno
> WHERE dt >= '2023-11-29';
> {code}
> {noformat}
> spark-sql (default)> 
>> explain cost
>> CREATE TABLE order_history_version_audit_rno
>> USING parquet
>> PARTITIONED BY (dt)
>> CLUSTERED BY (id) INTO 1000 buckets
>> AS SELECT * FROM order_history_version_audit_rno
>> WHERE dt >= '2023-11-29';
> == Optimized Logical Plan ==
> CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>+- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
>   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> dt#15, hr#16]
>  +- Filter (dt#15 >= 2023-11-29)
> +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>+- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> == Physical Plan ==
> Execute CreateDataSourceTableAsSelectCommand
>+- CreateDataSourceTableAsSelectCommand 
> `spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
> [eventid, id, referenceid, type, referencetype, sellerid, buyerid, 
> producerid, versionid, changedocuments, hr, dt]
>  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
> sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
> hr#16, dt#15]
> +- Project [eventid#5, id#6, referenceid#7, type#8, 
> referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
> changedocuments#14, dt#15, hr#16]
>+- Filter (dt#15 >= 2023-11-29)
>   +- SubqueryAlias 
> spark_catalog.default.order_history_version_audit_rno
>  +- Relation 
> spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
>  parquet
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48661) Upgrade RoaringBitmap to 1.1.0

2024-06-18 Thread Wei Guo (Jira)
Wei Guo created SPARK-48661:
---

 Summary: Upgrade RoaringBitmap to 1.1.0
 Key: SPARK-48661
 URL: https://issues.apache.org/jira/browse/SPARK-48661
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Affects Versions: 4.0.0
Reporter: Wei Guo






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-48660:

Description: 
How to reproduce:

{code:sql}
CREATE TABLE order_history_version_audit_rno (
  eventid STRING,
  id STRING,
  referenceid STRING,
  type STRING,
  referencetype STRING,
  sellerid BIGINT,
  buyerid BIGINT,
  producerid STRING,
  versionid INT,
  changedocuments ARRAY>,
  dt STRING,
  hr STRING)
USING parquet
PARTITIONED BY (dt, hr);

explain cost
CREATE TABLE order_history_version_audit_rno
USING parquet
PARTITIONED BY (dt)
CLUSTERED BY (id) INTO 1000 buckets
AS SELECT * FROM order_history_version_audit_rno
WHERE dt >= '2023-11-29';
{code}


{noformat}
spark-sql (default)> 
   > explain cost
   > CREATE TABLE order_history_version_audit_rno
   > USING parquet
   > PARTITIONED BY (dt)
   > CLUSTERED BY (id) INTO 1000 buckets
   > AS SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
dt#15, hr#16]
 +- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
   +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
 +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, type#8, 
referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
changedocuments#14, dt#15, hr#16]
   +- Filter (dt#15 >= 2023-11-29)
  +- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
 +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet
{noformat}


  was:
How to reproduce:
{noformat}
spark-sql (default)> 
   > explain cost
   > CREATE TABLE order_history_version_audit_rno
   > USING parquet
   > PARTITIONED BY (dt)
   > CLUSTERED BY (id) INTO 1000 buckets
   > AS SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
dt#15, hr#16]
 +- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
   +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
 +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, 

[jira] [Created] (SPARK-48660) The result of explain is incorrect for CreateTableAsSelect

2024-06-18 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-48660:
---

 Summary: The result of explain is incorrect for CreateTableAsSelect
 Key: SPARK-48660
 URL: https://issues.apache.org/jira/browse/SPARK-48660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.5.1, 3.5.0, 4.0.0
Reporter: Yuming Wang


How to reproduce:
{noformat}
spark-sql (default)> 
   > explain cost
   > CREATE TABLE order_history_version_audit_rno
   > USING parquet
   > PARTITIONED BY (dt)
   > CLUSTERED BY (id) INTO 1000 buckets
   > AS SELECT * FROM order_history_version_audit_rno
   > WHERE dt >= '2023-11-29';
== Optimized Logical Plan ==
CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
   +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
  +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
dt#15, hr#16]
 +- Filter (dt#15 >= 2023-11-29)
+- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
   +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet

== Physical Plan ==
Execute CreateDataSourceTableAsSelectCommand
   +- CreateDataSourceTableAsSelectCommand 
`spark_catalog`.`default`.`order_history_version_audit_rno`, ErrorIfExists, 
[eventid, id, referenceid, type, referencetype, sellerid, buyerid, producerid, 
versionid, changedocuments, hr, dt]
 +- Project [eventid#5, id#6, referenceid#7, type#8, referencetype#9, 
sellerid#10L, buyerid#11L, producerid#12, versionid#13, changedocuments#14, 
hr#16, dt#15]
+- Project [eventid#5, id#6, referenceid#7, type#8, 
referencetype#9, sellerid#10L, buyerid#11L, producerid#12, versionid#13, 
changedocuments#14, dt#15, hr#16]
   +- Filter (dt#15 >= 2023-11-29)
  +- SubqueryAlias 
spark_catalog.default.order_history_version_audit_rno
 +- Relation 
spark_catalog.default.order_history_version_audit_rno[eventid#5,id#6,referenceid#7,type#8,referencetype#9,sellerid#10L,buyerid#11L,producerid#12,versionid#13,changedocuments#14,dt#15,hr#16]
 parquet
{noformat}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48659) Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests

2024-06-18 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-48659:
---

 Summary: Unify v1 and v2 ALTER TABLE .. SET TBLPROPERTIES tests
 Key: SPARK-48659
 URL: https://issues.apache.org/jira/browse/SPARK-48659
 Project: Spark
  Issue Type: Improvement
  Components: SQL, Tests
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-48567:
--
  Assignee: (was: Wei Liu)

Reverted at 
https://github.com/apache/spark/commit/d067fc6c1635dfe7730223021e912e78637bb791

> Pyspark StreamingQuery lastProgress and friend should return actual 
> StreamingQueryProgress
> --
>
> Key: SPARK-48567
> URL: https://issues.apache.org/jira/browse/SPARK-48567
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48567) Pyspark StreamingQuery lastProgress and friend should return actual StreamingQueryProgress

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-48567:
-
Fix Version/s: (was: 4.0.0)

> Pyspark StreamingQuery lastProgress and friend should return actual 
> StreamingQueryProgress
> --
>
> Key: SPARK-48567
> URL: https://issues.apache.org/jira/browse/SPARK-48567
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SS
>Affects Versions: 4.0.0
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48651) Document configuring different JDK for Spark on YARN

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48651.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47010
[https://github.com/apache/spark/pull/47010]

> Document configuring different JDK for Spark on YARN
> 
>
> Key: SPARK-48651
> URL: https://issues.apache.org/jira/browse/SPARK-48651
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48651) Document configuring different JDK for Spark on YARN

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48651:
---

Assignee: Cheng Pan

> Document configuring different JDK for Spark on YARN
> 
>
> Key: SPARK-48651
> URL: https://issues.apache.org/jira/browse/SPARK-48651
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48658) Encode/Decode functions report coding error instead of mojibake

2024-06-18 Thread Kent Yao (Jira)
Kent Yao created SPARK-48658:


 Summary: Encode/Decode functions report coding error instead of 
mojibake
 Key: SPARK-48658
 URL: https://issues.apache.org/jira/browse/SPARK-48658
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48601) Fix Spark internal error when setting null value for jdbc option

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48601.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46955
[https://github.com/apache/spark/pull/46955]

> Fix Spark internal error when setting null value for jdbc option
> 
>
> Key: SPARK-48601
> URL: https://issues.apache.org/jira/browse/SPARK-48601
> Project: Spark
>  Issue Type: Task
>  Components: Connect
>Affects Versions: 3.4.3
>Reporter: Stevo Mitric
>Assignee: Stevo Mitric
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When setting a null value for any JDBC option, a spark internal error is 
> thrown caused by java.lang.nullpointer exception.
>  
> Make this exception more user friendly and explain what is causing it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48649) Add "ignoreInvalidPartitionPaths" and "spark.sql.files.ignoreInvalidPartitionPaths" configs to allow ignoring invalid partition paths

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48649:
---

Assignee: Ivan Sadikov

> Add "ignoreInvalidPartitionPaths" and 
> "spark.sql.files.ignoreInvalidPartitionPaths" configs to allow ignoring 
> invalid partition paths
> -
>
> Key: SPARK-48649
> URL: https://issues.apache.org/jira/browse/SPARK-48649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>  Labels: pull-request-available
>
> When having a table directory with invalid partitions such as:
> {code:java}
> table/
>   invalid/...
>   part=1/...
>   part=2/...
>   part=3/...{code}
> a SQL query reading all of the partitions would fail with 
> {code:java}
> java.lang.AssertionError: assertion failed: Conflicting directory structures 
> detected. Suspicious paths: 
>  table 
>  table/invalid {code}
>  
> I propose to add a data source option and Spark SQL config to ignore invalid 
> partition paths. The config will be disabled by default to retain the current 
> behaviour.
> {code:java}
> spark.conf.set("spark.sql.files.ignoreInvalidPartitionPaths", "true"){code}
> {code:java}
> spark.read.format("parquet").option("ignoreInvalidPartitionPaths", 
> "true").load(...)  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48649) Add "ignoreInvalidPartitionPaths" and "spark.sql.files.ignoreInvalidPartitionPaths" configs to allow ignoring invalid partition paths

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48649.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47006
[https://github.com/apache/spark/pull/47006]

> Add "ignoreInvalidPartitionPaths" and 
> "spark.sql.files.ignoreInvalidPartitionPaths" configs to allow ignoring 
> invalid partition paths
> -
>
> Key: SPARK-48649
> URL: https://issues.apache.org/jira/browse/SPARK-48649
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Ivan Sadikov
>Assignee: Ivan Sadikov
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> When having a table directory with invalid partitions such as:
> {code:java}
> table/
>   invalid/...
>   part=1/...
>   part=2/...
>   part=3/...{code}
> a SQL query reading all of the partitions would fail with 
> {code:java}
> java.lang.AssertionError: assertion failed: Conflicting directory structures 
> detected. Suspicious paths: 
>  table 
>  table/invalid {code}
>  
> I propose to add a data source option and Spark SQL config to ignore invalid 
> partition paths. The config will be disabled by default to retain the current 
> behaviour.
> {code:java}
> spark.conf.set("spark.sql.files.ignoreInvalidPartitionPaths", "true"){code}
> {code:java}
> spark.read.format("parquet").option("ignoreInvalidPartitionPaths", 
> "true").load(...)  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48657) The document is out of date and needs to be updated

2024-06-18 Thread BrevinFu (Jira)
BrevinFu created SPARK-48657:


 Summary: The document is out of date and needs to be updated
 Key: SPARK-48657
 URL: https://issues.apache.org/jira/browse/SPARK-48657
 Project: Spark
  Issue Type: IT Help
  Components: Examples, Java API, SQL
Affects Versions: 3.5.1
 Environment: Window 10  Java
Reporter: BrevinFu


I am looking for a data source implementation for Spark SQL 3.5.1 that can 
accept MQTT and REST interface, through Google search, I found the latest 
version is two years ago, and Java implementation is very few, I found that the 
custom data source has v1 and v2 and unbounded table, I am confused, which 
implementation should I use for 3.5.1, how to implement, Can you update the 
documentation or help me? thank you.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48634) Avoid statically initialize threadpool at ExecutePlanResponseReattachableIterator

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48634:


Assignee: Hyukjin Kwon

> Avoid statically initialize threadpool at 
> ExecutePlanResponseReattachableIterator
> -
>
> Key: SPARK-48634
> URL: https://issues.apache.org/jira/browse/SPARK-48634
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Avoid having ExecutePlanResponseReattachableIterator._release_thread_pool to 
> initialize ThreadPool which might be dragged in pickling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48634) Avoid statically initialize threadpool at ExecutePlanResponseReattachableIterator

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48634.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46993
[https://github.com/apache/spark/pull/46993]

> Avoid statically initialize threadpool at 
> ExecutePlanResponseReattachableIterator
> -
>
> Key: SPARK-48634
> URL: https://issues.apache.org/jira/browse/SPARK-48634
> Project: Spark
>  Issue Type: Bug
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Avoid having ExecutePlanResponseReattachableIterator._release_thread_pool to 
> initialize ThreadPool which might be dragged in pickling.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48656) ArrayIndexOutOfBoundsException in CartesianRDD getPartitions

2024-06-18 Thread Nick Young (Jira)
Nick Young created SPARK-48656:
--

 Summary: ArrayIndexOutOfBoundsException in CartesianRDD 
getPartitions
 Key: SPARK-48656
 URL: https://issues.apache.org/jira/browse/SPARK-48656
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 4.0.0
Reporter: Nick Young


```val rdd1 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = 65536)
val rdd2 = spark.sparkContext.parallelize(Seq(1, 2, 3), numSlices = 
65536)rdd2.cartesian(rdd1).partitions```

Throws `ArrayIndexOutOfBoundsException: 0` at CartesianRDD.scala:69 because 
`s1.index * numPartitionsInRdd2 + s2.index` overflows and wraps to 0. We should 
provide a better error message which indicates the number of partition 
overflows so it's easier for the user to debug.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48646) Refine Python data source API docstring and type hints

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48646?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48646.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47003
[https://github.com/apache/spark/pull/47003]

> Refine Python data source API docstring and type hints
> --
>
> Key: SPARK-48646
> URL: https://issues.apache.org/jira/browse/SPARK-48646
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Assignee: Allison Wang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Improve the type hints and docstrings for datasource.py



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48655) SPJ: Add tests for shuffle skipping for aggregate queries

2024-06-18 Thread Szehon Ho (Jira)
Szehon Ho created SPARK-48655:
-

 Summary: SPJ: Add tests for shuffle skipping for aggregate queries
 Key: SPARK-48655
 URL: https://issues.apache.org/jira/browse/SPARK-48655
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 4.0.0
Reporter: Szehon Ho






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48654) Kafka source should allow "enable.auto.commit" setting

2024-06-18 Thread Raghu Angadi (Jira)
Raghu Angadi created SPARK-48654:


 Summary: Kafka source should allow "enable.auto.commit" setting
 Key: SPARK-48654
 URL: https://issues.apache.org/jira/browse/SPARK-48654
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.4.3
Reporter: Raghu Angadi


Kafka source does not allow setting "enable.auto.commit" configuration. It is 
not clear why it does not. We should remove it, especially with new 
admin-client consumer (which is the current default).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48586) Remove lock acquisition in doMaintenance() by making a deep copy of RocksDBFileManager in load()

2024-06-18 Thread Riya Verma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48586?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Riya Verma updated SPARK-48586:
---
Summary: Remove lock acquisition in doMaintenance() by making a deep copy 
of RocksDBFileManager in load()  (was: Remove lock contention between 
maintenance and task threads)

> Remove lock acquisition in doMaintenance() by making a deep copy of 
> RocksDBFileManager in load()
> 
>
> Key: SPARK-48586
> URL: https://issues.apache.org/jira/browse/SPARK-48586
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.4.3
>Reporter: Riya Verma
>Priority: Major
>  Labels: pull-request-available
>
> Currently the lock of the *RocksDB* state store is acquired when uploading 
> the snapshot inside maintenance tasks when change log checkpointing is 
> enabled, which causes lock contention between query processing tasks and 
> state maintenance thread. To eliminate the lock contention, lock acquisition 
> inside maintenance tasks should be avoided. To prevent race conditions 
> between task and maintenance threads, we can ensure that *RocksDBFileManager* 
> has a linear history by ensuring a deep copy of *RocksDBFileManager* every 
> time a previous version is loaded. The original file manager is not affected 
> by future state update. The new file manager is not affected by background 
> snapshot uploading tasks that attempt to upload a snapshot.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48653) Fix Python data source error class references

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48653:
---
Labels: pull-request-available  (was: )

> Fix Python data source error class references
> -
>
> Key: SPARK-48653
> URL: https://issues.apache.org/jira/browse/SPARK-48653
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Fix invalid error class references.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48653) Fix Python data source error class references

2024-06-18 Thread Allison Wang (Jira)
Allison Wang created SPARK-48653:


 Summary: Fix Python data source error class references
 Key: SPARK-48653
 URL: https://issues.apache.org/jira/browse/SPARK-48653
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Fix invalid error class references.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48652) Casting Issue in Spark SQL: String Column Compared to Integer Value Yields Empty Results

2024-06-18 Thread Abhishek Singh (Jira)
Abhishek Singh created SPARK-48652:
--

 Summary: Casting Issue in Spark SQL: String Column Compared to 
Integer Value Yields Empty Results
 Key: SPARK-48652
 URL: https://issues.apache.org/jira/browse/SPARK-48652
 Project: Spark
  Issue Type: Question
  Components: Spark Core, SQL
Affects Versions: 3.3.2
Reporter: Abhishek Singh


In Spark SQL, comparing a string column to an integer value can lead to 
unexpected results due to implicit type casting. When a string column is 
compared to an integer, Spark attempts to cast the strings to integers, which 
fails for non-numeric strings, resulting in an empty result set.


{code:java}
case class Person(id: String, name: String)
val personDF = Seq(Person("a", "amit"), Person("b", "abhishek")).toDF()
personDF.createOrReplaceTempView("person_ddf")
val sqlQuery = "SELECT * FROM person_ddf WHERE id <> -1"
val resultDF = spark.sql(sqlQuery)
resultDF.show() // Empty result due to type casting issue 

{code}
Below is the logical and physical plan which I m getting
{code:java}
== Parsed Logical Plan ==
'Project [*]
+- 'Filter NOT ('id = -1)
   +- 'UnresolvedRelation [person_ddf], [], false

== Analyzed Logical Plan ==
id: string, name: string
Project [id#356, name#357]
+- Filter NOT (cast(id#356 as int) = -1)
   +- SubqueryAlias person_ddf
      +- View (`person_ddf`, [id#356,name#357])
         +- LocalRelation [id#356, name#357]

== Optimized Logical Plan ==
LocalRelation , [id#356, name#357]

== Physical Plan ==
LocalTableScan , [id#356, name#357]

== Physical Plan ==
LocalTableScan (1) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48573) Upgrade ICU version

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48573:
---
Labels: pull-request-available  (was: )

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48573) Upgrade ICU version

2024-06-18 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-48573:
--
Parent: (was: SPARK-46837)
Issue Type: Bug  (was: Sub-task)

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48573) Upgrade ICU version

2024-06-18 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-48573:
--
Epic Link: SPARK-46830

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48573) Upgrade ICU version

2024-06-18 Thread Mihailo Milosevic (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mihailo Milosevic updated SPARK-48573:
--
Summary: Upgrade ICU version  (was: TBD)

> Upgrade ICU version
> ---
>
> Key: SPARK-48573
> URL: https://issues.apache.org/jira/browse/SPARK-48573
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48651) Document configuring different JDK for Spark on YARN

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48651:
---
Labels: pull-request-available  (was: )

> Document configuring different JDK for Spark on YARN
> 
>
> Key: SPARK-48651
> URL: https://issues.apache.org/jira/browse/SPARK-48651
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: Cheng Pan
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48651) Document configuring different JDK for Spark on YARN

2024-06-18 Thread Cheng Pan (Jira)
Cheng Pan created SPARK-48651:
-

 Summary: Document configuring different JDK for Spark on YARN
 Key: SPARK-48651
 URL: https://issues.apache.org/jira/browse/SPARK-48651
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 4.0.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48280) Improve collation testing surface area using expression walking

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48280:
--

Assignee: (was: Apache Spark)

> Improve collation testing surface area using expression walking
> ---
>
> Key: SPARK-48280
> URL: https://issues.apache.org/jira/browse/SPARK-48280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48280) Improve collation testing surface area using expression walking

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot reassigned SPARK-48280:
--

Assignee: Apache Spark

> Improve collation testing surface area using expression walking
> ---
>
> Key: SPARK-48280
> URL: https://issues.apache.org/jira/browse/SPARK-48280
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Mihailo Milosevic
>Assignee: Apache Spark
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48459.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46789
[https://github.com/apache/spark/pull/46789]

> Implement DataFrameQueryContext in Spark Connect
> 
>
> Key: SPARK-48459
> URL: https://issues.apache.org/jira/browse/SPARK-48459
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Implements the same https://github.com/apache/spark/pull/45377 in Spark 
> Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48459) Implement DataFrameQueryContext in Spark Connect

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-48459:


Assignee: Hyukjin Kwon

> Implement DataFrameQueryContext in Spark Connect
> 
>
> Key: SPARK-48459
> URL: https://issues.apache.org/jira/browse/SPARK-48459
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>  Labels: pull-request-available
>
> Implements the same https://github.com/apache/spark/pull/45377 in Spark 
> Connect



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-48650) Display correct call site from IPython Notebook

2024-06-18 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-48650:
---
Labels: pull-request-available  (was: )

> Display correct call site from IPython Notebook
> ---
>
> Key: SPARK-48650
> URL: https://issues.apache.org/jira/browse/SPARK-48650
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 4.0.0
>Reporter: Haejoon Lee
>Priority: Major
>  Labels: pull-request-available
>
> Current IPython Notebook does not show proper DataFrameQueryContext



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-48650) Display correct call site from IPython Notebook

2024-06-18 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-48650:
---

 Summary: Display correct call site from IPython Notebook
 Key: SPARK-48650
 URL: https://issues.apache.org/jira/browse/SPARK-48650
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 4.0.0
Reporter: Haejoon Lee


Current IPython Notebook does not show proper DataFrameQueryContext



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48342) [M0] Parser support

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-48342.
-
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46665
[https://github.com/apache/spark/pull/46665]

> [M0] Parser support
> ---
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser - support only compound statements
>  * Parser testing
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48342) [M0] Parser support

2024-06-18 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-48342:
---

Assignee: David Milicevic

> [M0] Parser support
> ---
>
> Key: SPARK-48342
> URL: https://issues.apache.org/jira/browse/SPARK-48342
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 4.0.0
>Reporter: David Milicevic
>Assignee: David Milicevic
>Priority: Major
>  Labels: pull-request-available
>
> Implement parse for SQL scripting with all supporting changes for upcoming 
> interpreter implementation and future extensions of the parser:
>  * Parser - support only compound statements
>  * Parser testing
>  
> For more details, design doc can be found in parent Jira item.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48585) Make `JdbcDialect.classifyException` throw out the original exception

2024-06-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie resolved SPARK-48585.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 46937
[https://github.com/apache/spark/pull/46937]

> Make `JdbcDialect.classifyException` throw out the original exception
> -
>
> Key: SPARK-48585
> URL: https://issues.apache.org/jira/browse/SPARK-48585
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-48585) Make `JdbcDialect.classifyException` throw out the original exception

2024-06-18 Thread Yang Jie (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Jie reassigned SPARK-48585:


Assignee: BingKun Pan

> Make `JdbcDialect.classifyException` throw out the original exception
> -
>
> Key: SPARK-48585
> URL: https://issues.apache.org/jira/browse/SPARK-48585
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Critical
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-48647) Refine the error message for YearMonthIntervalType in df.collect

2024-06-18 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-48647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-48647.
--
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 47004
[https://github.com/apache/spark/pull/47004]

> Refine the error message for YearMonthIntervalType in df.collect
> 
>
> Key: SPARK-48647
> URL: https://issues.apache.org/jira/browse/SPARK-48647
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org