[jira] [Created] (SPARK-41231) Built-in SQL Function Improvement
Ruifeng Zheng created SPARK-41231: - Summary: Built-in SQL Function Improvement Key: SPARK-41231 URL: https://issues.apache.org/jira/browse/SPARK-41231 Project: Spark Issue Type: New Feature Components: PySpark, SQL Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown
[ https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jingxiong zhong updated SPARK-41229: Description: SQL1: ```with table_hive1 as(select * from db1.table_hive) select * from db1.table_hive1;``` It will throw exception `org.apache.spark.sql.AnalysisException: Table or view not found: db1.table_hive1;`but spark in 2.4.3 work well. SQL2: ```with table_hive1 as(select * from db1.table_hive) select * from table_hive1;``` It work well. I'm a little confused. Is this syntax with database name not supported. was: SQL1: ```with table_hive1 as(select * from db1.table_hive) select * from db1.table_hive1;``` It will throw exception `org.apache.spark.sql.AnalysisException: Table or view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. SQL2: ```with table_hive1 as(select * from db1.table_hive) select * from table_hive1;``` It work well. I'm a little confused. Is this syntax with database name not supported. > When using `db_ name.temp_ table_name`, an exception will be thrown > --- > > Key: SPARK-41229 > URL: https://issues.apache.org/jira/browse/SPARK-41229 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark3.2.0 > hadoop2.7.3 > hive-ms 2.3.9 >Reporter: jingxiong zhong >Priority: Blocker > > SQL1: > ```with table_hive1 as(select * from db1.table_hive) > select * from db1.table_hive1;``` > It will throw exception `org.apache.spark.sql.AnalysisException: Table or > view not found: db1.table_hive1;`but spark in 2.4.3 work well. > SQL2: > ```with table_hive1 as(select * from db1.table_hive) > select * from table_hive1;``` > It work well. > I'm a little confused. Is this syntax with database name not supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
[ https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637600#comment-17637600 ] Apache Spark commented on SPARK-41228: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/38769 > Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION > --- > > Key: SPARK-41228 > URL: https://issues.apache.org/jira/browse/SPARK-41228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
[ https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41228: Assignee: (was: Apache Spark) > Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION > --- > > Key: SPARK-41228 > URL: https://issues.apache.org/jira/browse/SPARK-41228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
[ https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637599#comment-17637599 ] Apache Spark commented on SPARK-41228: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/38769 > Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION > --- > > Key: SPARK-41228 > URL: https://issues.apache.org/jira/browse/SPARK-41228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
[ https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41228: Assignee: Apache Spark > Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION > --- > > Key: SPARK-41228 > URL: https://issues.apache.org/jira/browse/SPARK-41228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41230) Remove `str` from Aggregate expression type
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41230: Assignee: Apache Spark > Remove `str` from Aggregate expression type > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41230) Remove `str` from Aggregate expression type
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637597#comment-17637597 ] Apache Spark commented on SPARK-41230: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38768 > Remove `str` from Aggregate expression type > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41230) Remove `str` from Aggregate expression type
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41230: Assignee: (was: Apache Spark) > Remove `str` from Aggregate expression type > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41230) Remove `str` from Aggregate expression type
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637596#comment-17637596 ] Apache Spark commented on SPARK-41230: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/38768 > Remove `str` from Aggregate expression type > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown
[ https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637595#comment-17637595 ] jingxiong zhong commented on SPARK-41229: - [~cloud_fan] Could you help me about this? > When using `db_ name.temp_ table_name`, an exception will be thrown > --- > > Key: SPARK-41229 > URL: https://issues.apache.org/jira/browse/SPARK-41229 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark3.2.0 > hadoop2.7.3 > hive-ms 2.3.9 >Reporter: jingxiong zhong >Priority: Blocker > > SQL1: > ```with table_hive1 as(select * from db1.table_hive) > select * from db1.table_hive1;``` > It will throw exception `org.apache.spark.sql.AnalysisException: Table or > view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. > SQL2: > ```with table_hive1 as(select * from db1.table_hive) > select * from table_hive1;``` > It work well. > I'm a little confused. Is this syntax with database name not supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41230) Remove `str` from Aggregate expression type
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41230: - Summary: Remove `str` from Aggregate expression type (was: Remove `str` from Aggregate) > Remove `str` from Aggregate expression type > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41230) Remove `str` from Aggregate
[ https://issues.apache.org/jira/browse/SPARK-41230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-41230: - Summary: Remove `str` from Aggregate (was: Remove `str` from Class Aggregate in Plan.py) > Remove `str` from Aggregate > --- > > Key: SPARK-41230 > URL: https://issues.apache.org/jira/browse/SPARK-41230 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41227) Implement `DataFrame.crossJoin`
[ https://issues.apache.org/jira/browse/SPARK-41227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637593#comment-17637593 ] Rui Wang commented on SPARK-41227: -- +1 to have this to match existing PySpark API. > Implement `DataFrame.crossJoin` > --- > > Key: SPARK-41227 > URL: https://issues.apache.org/jira/browse/SPARK-41227 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41183) Add an extension API to do plan normalization for caching
[ https://issues.apache.org/jira/browse/SPARK-41183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637594#comment-17637594 ] Apache Spark commented on SPARK-41183: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/38767 > Add an extension API to do plan normalization for caching > - > > Key: SPARK-41183 > URL: https://issues.apache.org/jira/browse/SPARK-41183 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41230) Remove `str` from Class Aggregate in Plan.py
Rui Wang created SPARK-41230: Summary: Remove `str` from Class Aggregate in Plan.py Key: SPARK-41230 URL: https://issues.apache.org/jira/browse/SPARK-41230 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637592#comment-17637592 ] Apache Spark commented on SPARK-35531: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38765 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema
[ https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637591#comment-17637591 ] Apache Spark commented on SPARK-35531: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38765 > Can not insert into hive bucket table if create table with upper case schema > > > Key: SPARK-35531 > URL: https://issues.apache.org/jira/browse/SPARK-35531 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.1, 3.2.0 >Reporter: Hongyi Zhang >Assignee: angerszhu >Priority: Major > Fix For: 3.3.0, 3.1.4 > > > > > create table TEST1( > V1 BIGINT, > S1 INT) > partitioned by (PK BIGINT) > clustered by (V1) > sorted by (S1) > into 200 buckets > STORED AS PARQUET; > > insert into test1 > select > * from values(1,1,1); > > > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not > part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), > FieldSchema(name:s1, type:int, comment:null)] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown
[ https://issues.apache.org/jira/browse/SPARK-41229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jingxiong zhong updated SPARK-41229: Description: SQL1: ```with table_hive1 as(select * from db1.table_hive) select * from db1.table_hive1;``` It will throw exception `org.apache.spark.sql.AnalysisException: Table or view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. SQL2: ```with table_hive1 as(select * from db1.table_hive) select * from table_hive1;``` It work well. I'm a little confused. Is this syntax with database name not supported. was: ```with table_hive1 as(select * from db1.table_hive) select * from db1.table_hive1;``` It will throw exception `org.apache.spark.sql.AnalysisException: Table or view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. ```with table_hive1 as(select * from db1.table_hive) select * from table_hive1;``` It work well. I'm a little confused. Is this syntax with database name not supported. > When using `db_ name.temp_ table_name`, an exception will be thrown > --- > > Key: SPARK-41229 > URL: https://issues.apache.org/jira/browse/SPARK-41229 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 > Environment: spark3.2.0 > hadoop2.7.3 > hive-ms 2.3.9 >Reporter: jingxiong zhong >Priority: Blocker > > SQL1: > ```with table_hive1 as(select * from db1.table_hive) > select * from db1.table_hive1;``` > It will throw exception `org.apache.spark.sql.AnalysisException: Table or > view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. > SQL2: > ```with table_hive1 as(select * from db1.table_hive) > select * from table_hive1;``` > It work well. > I'm a little confused. Is this syntax with database name not supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41229) When using `db_ name.temp_ table_name`, an exception will be thrown
jingxiong zhong created SPARK-41229: --- Summary: When using `db_ name.temp_ table_name`, an exception will be thrown Key: SPARK-41229 URL: https://issues.apache.org/jira/browse/SPARK-41229 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.2.0 Environment: spark3.2.0 hadoop2.7.3 hive-ms 2.3.9 Reporter: jingxiong zhong ```with table_hive1 as(select * from db1.table_hive) select * from db1.table_hive1;``` It will throw exception `org.apache.spark.sql.AnalysisException: Table or view not found: bigdata_qa.zjx_hive1;`but spark in 2.4.3 work well. ```with table_hive1 as(select * from db1.table_hive) select * from table_hive1;``` It work well. I'm a little confused. Is this syntax with database name not supported. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637587#comment-17637587 ] Raza Jafri commented on SPARK-41219: Thank you for looking into this issue. I have also noticed that the `IntegralDivide` has the output dataType = LongType, so why is it also overriding the `resultDecimalType`?? It will never be called AFAIK, it's only called from `dataType` in `BinaryArithmetic` > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40948) Introduce new error class: PATH_NOT_FOUND
[ https://issues.apache.org/jira/browse/SPARK-40948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-40948. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38575 [https://github.com/apache/spark/pull/38575] > Introduce new error class: PATH_NOT_FOUND > - > > Key: SPARK-40948 > URL: https://issues.apache.org/jira/browse/SPARK-40948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > Recently we added many error classes named by LEGACY_ERROR_TEMP_. > We should update them to use proper error class name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40948) Introduce new error class: PATH_NOT_FOUND
[ https://issues.apache.org/jira/browse/SPARK-40948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-40948: Assignee: Haejoon Lee > Introduce new error class: PATH_NOT_FOUND > - > > Key: SPARK-40948 > URL: https://issues.apache.org/jira/browse/SPARK-40948 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > Recently we added many error classes named by LEGACY_ERROR_TEMP_. > We should update them to use proper error class name. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
[ https://issues.apache.org/jira/browse/SPARK-41228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637567#comment-17637567 ] Haejoon Lee commented on SPARK-41228: - I'm working on this > Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION > --- > > Key: SPARK-41228 > URL: https://issues.apache.org/jira/browse/SPARK-41228 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Priority: Major > > The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41228) Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION
Haejoon Lee created SPARK-41228: --- Summary: Rename COLUMN_NOT_IN_GROUP_BY_CLAUSE to MISSING_AGGREGATION Key: SPARK-41228 URL: https://issues.apache.org/jira/browse/SPARK-41228 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Haejoon Lee The error class name name is tricky, so we should fix the name properly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41206) Assign a name to the error class _LEGACY_ERROR_TEMP_1233
[ https://issues.apache.org/jira/browse/SPARK-41206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637557#comment-17637557 ] Apache Spark commented on SPARK-41206: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38764 > Assign a name to the error class _LEGACY_ERROR_TEMP_1233 > > > Key: SPARK-41206 > URL: https://issues.apache.org/jira/browse/SPARK-41206 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Assign a proper name to the legacy error class _LEGACY_ERROR_TEMP_1233 and > make it visible to users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41227) Implement `DataFrame.crossJoin`
Xinrong Meng created SPARK-41227: Summary: Implement `DataFrame.crossJoin` Key: SPARK-41227 URL: https://issues.apache.org/jira/browse/SPARK-41227 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Xinrong Meng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637521#comment-17637521 ] Apache Spark commented on SPARK-41201: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38763 > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41201. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38723 [https://github.com/apache/spark/pull/38723] > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41201) Implement `DataFrame.SelectExpr` in Python client
[ https://issues.apache.org/jira/browse/SPARK-41201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-41201: Assignee: Rui Wang > Implement `DataFrame.SelectExpr` in Python client > - > > Key: SPARK-41201 > URL: https://issues.apache.org/jira/browse/SPARK-41201 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41226) Refactor Spark types by introducing physical types
[ https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen updated SPARK-41226: Description: I am creating this one for Desmond Cheong since he can't signup for an account because of [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] His description for this improvement: The Spark type system currently supports multiple data types with the same physical representation in memory. For example {{DateType}} and {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. Because of this, operations on data types often involve case matching where multiple data types match to the same effects.To simplify this case matching logic, we can introduce the notion of logical and physical data types where multiple logical data types can be implemented with the same physical data type, then perform case matching on physical data types.Some areas that can utilize this logical/physical type separation are: * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} * {{getAccessor}} in {{InternalRow.scala}} * {{externalDataTypeFor}} in {{RowEncoder.scala}} * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} * {{doValidate}} in {{literals.scala}} was: I am creating this one for Desmond Cheong since he can't signup for an account because of [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] His description for this improvement: The Spark type system currently supports multiple data types with the same physical representation in memory. For example {{DateType}} and {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. Because of this, operations on data types often involve case matching where multiple data types match to the same effects.To simplify this case matching logic, we can introduce the notion of logical and physical data types where multiple logical data types can be implemented with the same physical data type, then perform case matching on physical data types.Some areas that can utilize this logical/physical type separation are: * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} * {{getAccessor}} in {{InternalRow.scala}} * {{externalDataTypeFor}} in {{RowEncoder.scala}} * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} * {{doValidate}} in {{literals.scala}} > Refactor Spark types by introducing physical types > -- > > Key: SPARK-41226 > URL: https://issues.apache.org/jira/browse/SPARK-41226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > I am creating this one for Desmond Cheong since he can't signup for an > account because of > [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] > > His description for this improvement: > The Spark type system currently supports multiple data types with the same > physical representation in memory. For example {{DateType}} and > {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. > Because of this, operations on data types often involve case matching where > multiple data types match to the same effects.To simplify this case matching > logic, we can introduce the notion of logical and physical data types where > multiple logical data types can be implemented with the same physical data > type, then perform case matching on physical data types.Some areas that can > utilize this logical/physical type separation are: > * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} > * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} > * {{getAccessor}} in {{InternalRow.scala}} > * {{externalDataTypeFor}} in {{RowEncoder.scala}} > * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} > * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} > * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) SPIP: Asynchronous Offset Management in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-39591: - Shepherd: Jungtaek Lim > SPIP: Asynchronous Offset Management in Structured Streaming > > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > Labels: SPIP > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. > > SPIP Doc: > > https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41226) Refactor Spark types by introducing physical types
[ https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41226: Assignee: Apache Spark > Refactor Spark types by introducing physical types > -- > > Key: SPARK-41226 > URL: https://issues.apache.org/jira/browse/SPARK-41226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > > I am creating this one for Desmond Cheong since he can't signup for an > account because of > [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] > > His description for this improvement: > The Spark type system currently supports multiple data types with the same > physical representation in memory. For example {{DateType}} and > {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. > Because of this, operations on data types often involve case matching where > multiple data types match to the same effects.To simplify this case matching > logic, we can introduce the notion of logical and physical data types where > multiple logical data types can be implemented with the same physical data > type, then perform case matching on physical data types.Some areas that can > utilize this logical/physical type separation are: * > {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} > * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} > * {{getAccessor}} in {{InternalRow.scala}} > * {{externalDataTypeFor}} in {{RowEncoder.scala}} > * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} > * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} > * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41226) Refactor Spark types by introducing physical types
[ https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41226: Assignee: (was: Apache Spark) > Refactor Spark types by introducing physical types > -- > > Key: SPARK-41226 > URL: https://issues.apache.org/jira/browse/SPARK-41226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > I am creating this one for Desmond Cheong since he can't signup for an > account because of > [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] > > His description for this improvement: > The Spark type system currently supports multiple data types with the same > physical representation in memory. For example {{DateType}} and > {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. > Because of this, operations on data types often involve case matching where > multiple data types match to the same effects.To simplify this case matching > logic, we can introduce the notion of logical and physical data types where > multiple logical data types can be implemented with the same physical data > type, then perform case matching on physical data types.Some areas that can > utilize this logical/physical type separation are: * > {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} > * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} > * {{getAccessor}} in {{InternalRow.scala}} > * {{externalDataTypeFor}} in {{RowEncoder.scala}} > * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} > * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} > * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41226) Refactor Spark types by introducing physical types
[ https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637464#comment-17637464 ] Apache Spark commented on SPARK-41226: -- User 'desmondcheongzx' has created a pull request for this issue: https://github.com/apache/spark/pull/38750 > Refactor Spark types by introducing physical types > -- > > Key: SPARK-41226 > URL: https://issues.apache.org/jira/browse/SPARK-41226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > I am creating this one for Desmond Cheong since he can't signup for an > account because of > [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] > > His description for this improvement: > The Spark type system currently supports multiple data types with the same > physical representation in memory. For example {{DateType}} and > {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. > Because of this, operations on data types often involve case matching where > multiple data types match to the same effects.To simplify this case matching > logic, we can introduce the notion of logical and physical data types where > multiple logical data types can be implemented with the same physical data > type, then perform case matching on physical data types.Some areas that can > utilize this logical/physical type separation are: * > {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} > * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} > * {{getAccessor}} in {{InternalRow.scala}} > * {{externalDataTypeFor}} in {{RowEncoder.scala}} > * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} > * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} > * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41226) Refactor Spark types by introducing physical types
[ https://issues.apache.org/jira/browse/SPARK-41226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637463#comment-17637463 ] Apache Spark commented on SPARK-41226: -- User 'desmondcheongzx' has created a pull request for this issue: https://github.com/apache/spark/pull/38750 > Refactor Spark types by introducing physical types > -- > > Key: SPARK-41226 > URL: https://issues.apache.org/jira/browse/SPARK-41226 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > > I am creating this one for Desmond Cheong since he can't signup for an > account because of > [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] > > His description for this improvement: > The Spark type system currently supports multiple data types with the same > physical representation in memory. For example {{DateType}} and > {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. > Because of this, operations on data types often involve case matching where > multiple data types match to the same effects.To simplify this case matching > logic, we can introduce the notion of logical and physical data types where > multiple logical data types can be implemented with the same physical data > type, then perform case matching on physical data types.Some areas that can > utilize this logical/physical type separation are: * > {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} > * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} > * {{getAccessor}} in {{InternalRow.scala}} > * {{externalDataTypeFor}} in {{RowEncoder.scala}} > * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} > * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} > * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41226) Refactor Spark types by introducing physical types
Gengliang Wang created SPARK-41226: -- Summary: Refactor Spark types by introducing physical types Key: SPARK-41226 URL: https://issues.apache.org/jira/browse/SPARK-41226 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Gengliang Wang I am creating this one for Desmond Cheong since he can't signup for an account because of [https://infra.apache.org/blog/jira-public-signup-disabled.html.|https://infra.apache.org/blog/jira-public-signup-disabled.html] His description for this improvement: The Spark type system currently supports multiple data types with the same physical representation in memory. For example {{DateType}} and {{YearMonthIntervalType}} are both implemented using {{{}IntegerType{}}}. Because of this, operations on data types often involve case matching where multiple data types match to the same effects.To simplify this case matching logic, we can introduce the notion of logical and physical data types where multiple logical data types can be implemented with the same physical data type, then perform case matching on physical data types.Some areas that can utilize this logical/physical type separation are: * {{SpecializedGettersReader}} in {{SpecializedGettersReader.java}} * {{copy}} in {{ColumnarBatchRow.java}} and {{ColumnarRow.java}} * {{getAccessor}} in {{InternalRow.scala}} * {{externalDataTypeFor}} in {{RowEncoder.scala}} * {{unsafeWriter}} in {{InterpretedUnsafeProjection.scala}} * {{getValue}} and {{javaType}} in {{CodeGenerator.scala}} * {{doValidate}} in {{literals.scala}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41225) Disable unsupported functions
[ https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41225: Assignee: (was: Apache Spark) > Disable unsupported functions > - > > Key: SPARK-41225 > URL: https://issues.apache.org/jira/browse/SPARK-41225 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Disable unsupported functions and throw a proper NotImplementedError in the > Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41225) Disable unsupported functions
[ https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41225: Assignee: Apache Spark > Disable unsupported functions > - > > Key: SPARK-41225 > URL: https://issues.apache.org/jira/browse/SPARK-41225 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Assignee: Apache Spark >Priority: Major > > Disable unsupported functions and throw a proper NotImplementedError in the > Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41225) Disable unsupported functions
[ https://issues.apache.org/jira/browse/SPARK-41225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637457#comment-17637457 ] Apache Spark commented on SPARK-41225: -- User 'grundprinzip' has created a pull request for this issue: https://github.com/apache/spark/pull/38762 > Disable unsupported functions > - > > Key: SPARK-41225 > URL: https://issues.apache.org/jira/browse/SPARK-41225 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Martin Grund >Priority: Major > > Disable unsupported functions and throw a proper NotImplementedError in the > Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41225) Disable unsupported functions
Martin Grund created SPARK-41225: Summary: Disable unsupported functions Key: SPARK-41225 URL: https://issues.apache.org/jira/browse/SPARK-41225 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Martin Grund Disable unsupported functions and throw a proper NotImplementedError in the Python client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Description: Currently in Structured Streaming, at the beginning of every micro-batch the offset to process up to for the current batch is persisted to durable storage. At the end of every micro-batch, a marker to indicate the completion of this current micro-batch is persisted to durable storage. For pipelines such as one that read from Kafka and write to Kafka, end-to-end exactly once is not support and latency is sensitive, we can allow users to configure offset commits to be written asynchronously thus this commit operation will not contribute to the batch duration and effectively lowering the overall latency of the pipeline. SPIP Doc: https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing was:Currently in Structured Streaming, at the beginning of every micro-batch the offset to process up to for the current batch is persisted to durable storage. At the end of every micro-batch, a marker to indicate the completion of this current micro-batch is persisted to durable storage. For pipelines such as one that read from Kafka and write to Kafka, end-to-end exactly once is not support and latency is sensitive, we can allow users to configure offset commits to be written asynchronously thus this commit operation will not contribute to the batch duration and effectively lowering the overall latency of the pipeline. > SPIP: Offset Management Improvements in Structured Streaming > > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > Labels: SPIP > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. > > SPIP Doc: > > https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) SPIP: Asynchronous Offset Management in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Summary: SPIP: Asynchronous Offset Management in Structured Streaming (was: SPIP: Offset Management Improvements in Structured Streaming) > SPIP: Asynchronous Offset Management in Structured Streaming > > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > Labels: SPIP > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. > > SPIP Doc: > > https://docs.google.com/document/d/1iPiI4YoGCM0i61pBjkxcggU57gHKf2jVwD7HWMHgH-Y/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Summary: SPIP: Offset Management Improvements in Structured Streaming (was: Offset Management Improvements in Structured Streaming) > SPIP: Offset Management Improvements in Structured Streaming > > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) SPIP: Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Labels: SPIP (was: ) > SPIP: Offset Management Improvements in Structured Streaming > > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > Labels: SPIP > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Labels: SPIP (was: ) > Offset Management Improvements in Structured Streaming > -- > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > Labels: SPIP > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39591) Offset Management Improvements in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-39591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Boyang Jerry Peng updated SPARK-39591: -- Labels: (was: SPIP) > Offset Management Improvements in Structured Streaming > -- > > Key: SPARK-39591 > URL: https://issues.apache.org/jira/browse/SPARK-39591 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Boyang Jerry Peng >Priority: Major > > Currently in Structured Streaming, at the beginning of every micro-batch the > offset to process up to for the current batch is persisted to durable > storage. At the end of every micro-batch, a marker to indicate the > completion of this current micro-batch is persisted to durable storage. For > pipelines such as one that read from Kafka and write to Kafka, end-to-end > exactly once is not support and latency is sensitive, we can allow users to > configure offset commits to be written asynchronously thus this commit > operation will not contribute to the batch duration and effectively lowering > the overall latency of the pipeline. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-41053: --- Description: After SPARK-18085, the Spark history server(SHS) becomes more scalable for processing large applications by supporting a persistent KV-store(LevelDB/RocksDB) as the storage layer. As for the live Spark UI, all the data is still stored in memory, which can bring memory pressures to the Spark driver for large applications. For better Spark UI scalability and Driver stability, I propose to * {*}Support storing all the UI data in a persistent KV store{*}. RocksDB/LevelDB provides low memory overhead. Their write/read performance is fast enough to serve the write/read workload for live UI. SHS can leverage the persistent KV store to fasten its startup. * *Support a new Protobuf serializer for all the UI data.* The new serializer is supposed to be faster, according to benchmarks. It will be the default serializer for the persistent KV store of live UI. As for event logs, it is optional. The current serializer for UI data is JSON. When writing persistent KV-store, there is GZip compression. Since there is compression support in RocksDB/LevelDB, the new serializer won’t compress the output before writing to the persistent KV store. Here is a benchmark of writing/reading 100,000 SQLExecutionUIData to/from RocksDB: |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total Size(MB)*|*Result total size in memory(MB)*| |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| |*Protobuf*|109.9|34.3|858|2105| I am also proposing to support RocksDB instead of both LevelDB & RocksDB in the live UI. SPIP: [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj was: After SPARK-18085, the Spark history server(SHS) becomes more scalable for processing large applications by supporting a persistent KV-store(LevelDB/RocksDB) as the storage layer. As for the live Spark UI, all the data is still stored in memory, which can bring memory pressures to the Spark driver for large applications. For better Spark UI scalability and Driver stability, I propose to * {*}Support storing all the UI data in a persistent KV store{*}. RocksDB/LevelDB provides low memory overhead. Their write/read performance is fast enough to serve the write/read workload for live UI. SHS can leverage the persistent KV store to fasten its startup. * *Support a new Protobuf serializer for all the UI data.* The new serializer is supposed to be faster, according to benchmarks. It will be the default serializer for the persistent KV store of live UI. As for event logs, it is optional. The current serializer for UI data is JSON. When writing persistent KV-store, there is GZip compression. Since there is compression support in RocksDB/LevelDB, the new serializer won’t compress the output before writing to the persistent KV store. Here is a benchmark of writing/reading 100,000 SQLExecutionUIData to/from RocksDB: |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total Size(MB)*|*Result total size in memory(MB)*| |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| |*Protobuf*|109.9|34.3|858|2105| I am also proposing to support RocksDB instead of both LevelDB & RocksDB in the live UI. SPIP: [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to
[jira] [Updated] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-41053: --- Attachment: Better Spark UI scalability and Driver stability for large applications.pdf > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41054) Support disk-based KVStore in live UI
[ https://issues.apache.org/jira/browse/SPARK-41054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41054. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38567 [https://github.com/apache/spark/pull/38567] > Support disk-based KVStore in live UI > - > > Key: SPARK-41054 > URL: https://issues.apache.org/jira/browse/SPARK-41054 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-37313) Child stage using merged output or not should be based on the availability of merged output from parent stage
[ https://issues.apache.org/jira/browse/SPARK-37313 ] Mars deleted comment on SPARK-37313: -- was (Author: JIRAUSER290821): as comment said [https://github.com/apache/spark/pull/34461#issuecomment-964557253] I'm working on this Issue and trying to implement this functionality [~minyang] [~mridul] > Child stage using merged output or not should be based on the availability of > merged output from parent stage > - > > Key: SPARK-37313 > URL: https://issues.apache.org/jira/browse/SPARK-37313 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.1 >Reporter: Minchu Yang >Priority: Minor > > As discussed in the > [thread|https://github.com/apache/spark/pull/34461#pullrequestreview-799701494] > in SPARK-37023, during a stage retry, if parent stage has already generated > merged output in the previous attempt, with current behavior, the child stage > would not able to fetch the merged output, as this is controlled by > dependency.shuffleMergeEnabled (see current implementation > [here|https://github.com/apache/spark/blob/31b6f614d3173c8a5852243bf7d0b6200788432d/core/src/main/scala/org/apache/spark/shuffle/sort/SortShuffleManager.scala#L134-L136]) > during the stage retry. > Instead of using a single variable to control behavior at both mapper side > (push side) and reducer side (using merged output), whether child stage uses > merged output or not must only be based on whether merged output is available > for it to use(as discussed > [here|https://github.com/apache/spark/pull/34461#issuecomment-964557253]). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637252#comment-17637252 ] Apache Spark commented on SPARK-40988: -- User 'rangareddy' has created a pull request for this issue: https://github.com/apache/spark/pull/38761 > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Minor > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40988: Assignee: Apache Spark > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Assignee: Apache Spark >Priority: Minor > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40988: Assignee: (was: Apache Spark) > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Minor > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40988) Test case for insert partition should verify value
[ https://issues.apache.org/jira/browse/SPARK-40988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637251#comment-17637251 ] Apache Spark commented on SPARK-40988: -- User 'rangareddy' has created a pull request for this issue: https://github.com/apache/spark/pull/38761 > Test case for insert partition should verify value > --- > > Key: SPARK-40988 > URL: https://issues.apache.org/jira/browse/SPARK-40988 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0, 3.2.0, 3.3.0 >Reporter: Ranga Reddy >Priority: Minor > > Spark3 has not validated the Partition Column type while inserting the data > but on the Hive side exception is thrown while inserting different type > values. > *Spark Code:* > > {code:java} > scala> val tableName="test_partition_table" > tableName: String = test_partition_table > scala>scala> spark.sql(s"DROP TABLE IF EXISTS $tableName") > res0: org.apache.spark.sql.DataFrame = [] > scala> spark.sql(s"CREATE EXTERNAL TABLE $tableName ( id INT, name STRING ) > PARTITIONED BY (age INT) LOCATION 'file:/tmp/spark-warehouse/$tableName'") > res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("SHOW tables").show(truncate=false) > +-+-+---+ > |namespace|tableName |isTemporary| > +-+-+---+ > |default |test_partition_table |false | > +-+-+---+ > scala> spark.sql("SET spark.sql.sources.validatePartitionColumns").show(50, > false) > +--+-+ > |key |value| > +--+-+ > |spark.sql.sources.validatePartitionColumns|true | > +--+-+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=25) VALUES (1, > 'Ranga')""") > res4: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > +-+ > |partition| > +-+ > |age=25 | > +-+ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+-+---+ > |id |name |age| > +---+-+---+ > |1 |Ranga|25 | > +---+-+---+ > scala> spark.sql(s"""INSERT INTO $tableName partition (age=\"test_age\") > VALUES (2, 'Nishanth')""") > res7: org.apache.spark.sql.DataFrame = []scala> spark.sql(s"show partitions > $tableName").show(50, false) > ++ > |partition | > ++ > |age=25 | > |age=test_age| > ++ > scala> spark.sql(s"select * from $tableName").show(50, false) > +---+++ > |id |name |age | > +---+++ > |1 |Ranga |25 | > |2 |Nishanth|null| > +---+++ {code} > *Hive Code:* > > > {code:java} > > INSERT INTO test_partition_table partition (age="test_age2") VALUES (3, > > 'Nishanth'); > Error: Error while compiling statement: FAILED: SemanticException [Error > 10248]: Cannot add partition column age of type string as it cannot be > converted to type int (state=42000,code=10248){code} > > *Expected Result:* > When *spark.sql.sources.validatePartitionColumns=true* it needs to be > validated the datatype value and exception needs to be thrown if we provide > wrong data type value. > *Reference:* > [https://spark.apache.org/docs/3.3.1/sql-migration-guide.html#data-sources] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637239#comment-17637239 ] Apache Spark commented on SPARK-41219: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/38760 > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41219: Assignee: (was: Apache Spark) > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41219: Assignee: Apache Spark > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Assignee: Apache Spark >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41184) Fill NA tests are flaky
[ https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637223#comment-17637223 ] Apache Spark commented on SPARK-41184: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Fill NA tests are flaky > --- > > Key: SPARK-41184 > URL: https://issues.apache.org/jira/browse/SPARK-41184 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's fill.na tests for python are flakey. We need to disable them, and > investigate what is going on with the typing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41184) Fill NA tests are flaky
[ https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637222#comment-17637222 ] Apache Spark commented on SPARK-41184: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Fill NA tests are flaky > --- > > Key: SPARK-41184 > URL: https://issues.apache.org/jira/browse/SPARK-41184 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's fill.na tests for python are flakey. We need to disable them, and > investigate what is going on with the typing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41184) Fill NA tests are flaky
[ https://issues.apache.org/jira/browse/SPARK-41184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637221#comment-17637221 ] Apache Spark commented on SPARK-41184: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Fill NA tests are flaky > --- > > Key: SPARK-41184 > URL: https://issues.apache.org/jira/browse/SPARK-41184 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's fill.na tests for python are flakey. We need to disable them, and > investigate what is going on with the typing. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41165) Arrow collect should factor in failures
[ https://issues.apache.org/jira/browse/SPARK-41165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637219#comment-17637219 ] Apache Spark commented on SPARK-41165: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Arrow collect should factor in failures > --- > > Key: SPARK-41165 > URL: https://issues.apache.org/jira/browse/SPARK-41165 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's arrow collect path does not factor in failures. If a failure occurs > the collect code path will hang. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41165) Arrow collect should factor in failures
[ https://issues.apache.org/jira/browse/SPARK-41165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637220#comment-17637220 ] Apache Spark commented on SPARK-41165: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Arrow collect should factor in failures > --- > > Key: SPARK-41165 > URL: https://issues.apache.org/jira/browse/SPARK-41165 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 3.4.0 > > > Connect's arrow collect path does not factor in failures. If a failure occurs > the collect code path will hang. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41224) Optimize Arrow collect to stream the result from server to client
[ https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637218#comment-17637218 ] Apache Spark commented on SPARK-41224: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/38759 > Optimize Arrow collect to stream the result from server to client > - > > Key: SPARK-41224 > URL: https://issues.apache.org/jira/browse/SPARK-41224 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/pull/38468 implemented Arrow-based collect > but they cannot stream the result from server to the client. We can stream > them if the first partition is collected first -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41224) Optimize Arrow collect to stream the result from server to client
[ https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41224: Assignee: Apache Spark > Optimize Arrow collect to stream the result from server to client > - > > Key: SPARK-41224 > URL: https://issues.apache.org/jira/browse/SPARK-41224 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > https://github.com/apache/spark/pull/38468 implemented Arrow-based collect > but they cannot stream the result from server to the client. We can stream > them if the first partition is collected first -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41224) Optimize Arrow collect to stream the result from server to client
[ https://issues.apache.org/jira/browse/SPARK-41224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41224: Assignee: (was: Apache Spark) > Optimize Arrow collect to stream the result from server to client > - > > Key: SPARK-41224 > URL: https://issues.apache.org/jira/browse/SPARK-41224 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > https://github.com/apache/spark/pull/38468 implemented Arrow-based collect > but they cannot stream the result from server to the client. We can stream > them if the first partition is collected first -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637216#comment-17637216 ] XiDuo You edited comment on SPARK-41219 at 11/22/22 11:59 AM: -- it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // reutrn 0 df.select(lit(BigDecimal(0)) as "c").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} was (Author: ulysses): it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637216#comment-17637216 ] XiDuo You commented on SPARK-41219: --- it seems the root reason is decimal.toPrecision will break when change to decimal(0, 0) {code:java} val df = Seq(0).toDF("a") // return 0 df.selectExpr("cast(0 as decimal(0,0))").show // return null df.select(lit(BigDecimal(0)) as "c").selectExpr("cast(c as decimal(0,0))").show {code} > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41224) Optimize Arrow collect to stream the result from server to client
Hyukjin Kwon created SPARK-41224: Summary: Optimize Arrow collect to stream the result from server to client Key: SPARK-41224 URL: https://issues.apache.org/jira/browse/SPARK-41224 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/pull/38468 implemented Arrow-based collect but they cannot stream the result from server to the client. We can stream them if the first partition is collected first -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41180) Assign an error class to "Cannot parse the data type"
[ https://issues.apache.org/jira/browse/SPARK-41180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41180: Assignee: (was: Apache Spark) > Assign an error class to "Cannot parse the data type" > - > > Key: SPARK-41180 > URL: https://issues.apache.org/jira/browse/SPARK-41180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > The code below shows the issue: > {code} > > select from_csv('1', 'a InvalidType'); > org.apache.spark.sql.AnalysisException > { > "errorClass" : "LEGACY", > "messageParameters" : { > "message" : "Cannot parse the data type: \n[PARSE_SYNTAX_ERROR] Syntax > error at or near 'InvalidType': extra input 'InvalidType'(line 1, pos > 2)\n\n== SQL ==\na InvalidType\n--^^^\n\nFailed fallback parsing: \nDataType > invalidtype is not supported.(line 1, pos 2)\n\n== SQL ==\na > InvalidType\n--^^^\n; line 1 pos 7" > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41180) Assign an error class to "Cannot parse the data type"
[ https://issues.apache.org/jira/browse/SPARK-41180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41180: Assignee: Apache Spark > Assign an error class to "Cannot parse the data type" > - > > Key: SPARK-41180 > URL: https://issues.apache.org/jira/browse/SPARK-41180 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > The code below shows the issue: > {code} > > select from_csv('1', 'a InvalidType'); > org.apache.spark.sql.AnalysisException > { > "errorClass" : "LEGACY", > "messageParameters" : { > "message" : "Cannot parse the data type: \n[PARSE_SYNTAX_ERROR] Syntax > error at or near 'InvalidType': extra input 'InvalidType'(line 1, pos > 2)\n\n== SQL ==\na InvalidType\n--^^^\n\nFailed fallback parsing: \nDataType > invalidtype is not supported.(line 1, pos 2)\n\n== SQL ==\na > InvalidType\n--^^^\n; line 1 pos 7" > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41223) Upgrade slf4j to 2.0.4
[ https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637207#comment-17637207 ] Apache Spark commented on SPARK-41223: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38758 > Upgrade slf4j to 2.0.4 > -- > > Key: SPARK-41223 > URL: https://issues.apache.org/jira/browse/SPARK-41223 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html#2.0.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41223) Upgrade slf4j to 2.0.4
[ https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637205#comment-17637205 ] Apache Spark commented on SPARK-41223: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38758 > Upgrade slf4j to 2.0.4 > -- > > Key: SPARK-41223 > URL: https://issues.apache.org/jira/browse/SPARK-41223 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html#2.0.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41223) Upgrade slf4j to 2.0.4
[ https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41223: Assignee: Apache Spark > Upgrade slf4j to 2.0.4 > -- > > Key: SPARK-41223 > URL: https://issues.apache.org/jira/browse/SPARK-41223 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > https://www.slf4j.org/news.html#2.0.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41223) Upgrade slf4j to 2.0.4
[ https://issues.apache.org/jira/browse/SPARK-41223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41223: Assignee: (was: Apache Spark) > Upgrade slf4j to 2.0.4 > -- > > Key: SPARK-41223 > URL: https://issues.apache.org/jira/browse/SPARK-41223 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: Yang Jie >Priority: Minor > > https://www.slf4j.org/news.html#2.0.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637201#comment-17637201 ] XiDuo You commented on SPARK-41219: --- I'm looking at this > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41223) Upgrade slf4j to 2.0.4
Yang Jie created SPARK-41223: Summary: Upgrade slf4j to 2.0.4 Key: SPARK-41223 URL: https://issues.apache.org/jira/browse/SPARK-41223 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: Yang Jie https://www.slf4j.org/news.html#2.0.4 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41222) Unify the typing definitions
[ https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637199#comment-17637199 ] Apache Spark commented on SPARK-41222: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38757 > Unify the typing definitions > > > Key: SPARK-41222 > URL: https://issues.apache.org/jira/browse/SPARK-41222 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41222) Unify the typing definitions
[ https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41222: Assignee: (was: Apache Spark) > Unify the typing definitions > > > Key: SPARK-41222 > URL: https://issues.apache.org/jira/browse/SPARK-41222 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41222) Unify the typing definitions
[ https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41222: Assignee: Apache Spark > Unify the typing definitions > > > Key: SPARK-41222 > URL: https://issues.apache.org/jira/browse/SPARK-41222 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41222) Unify the typing definitions
[ https://issues.apache.org/jira/browse/SPARK-41222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637195#comment-17637195 ] Apache Spark commented on SPARK-41222: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/38757 > Unify the typing definitions > > > Key: SPARK-41222 > URL: https://issues.apache.org/jira/browse/SPARK-41222 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41222) Unify the typing definitions
Ruifeng Zheng created SPARK-41222: - Summary: Unify the typing definitions Key: SPARK-41222 URL: https://issues.apache.org/jira/browse/SPARK-41222 Project: Spark Issue Type: Sub-task Components: Connect, PySpark Affects Versions: 3.4.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41220) Range partitioner sample supports column pruning
[ https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637149#comment-17637149 ] Apache Spark commented on SPARK-41220: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/38756 > Range partitioner sample supports column pruning > > > Key: SPARK-41220 > URL: https://issues.apache.org/jira/browse/SPARK-41220 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > When do a global sort, firstly we do sample to get range bounds, then we use > the range partitioner to do shuffle exchange. > The issue is, the sample plan is coupled with the shuffle plan that causes we > can not optimize the sample plan. What we need for sample plan is the columns > for sort order but the shuffle plan contains all data columns.So at least, we > can do column pruning for the sample plan to only fetch the ordering columns. > A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41220) Range partitioner sample supports column pruning
[ https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41220: Assignee: (was: Apache Spark) > Range partitioner sample supports column pruning > > > Key: SPARK-41220 > URL: https://issues.apache.org/jira/browse/SPARK-41220 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > When do a global sort, firstly we do sample to get range bounds, then we use > the range partitioner to do shuffle exchange. > The issue is, the sample plan is coupled with the shuffle plan that causes we > can not optimize the sample plan. What we need for sample plan is the columns > for sort order but the shuffle plan contains all data columns.So at least, we > can do column pruning for the sample plan to only fetch the ordering columns. > A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41220) Range partitioner sample supports column pruning
[ https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41220: Assignee: Apache Spark > Range partitioner sample supports column pruning > > > Key: SPARK-41220 > URL: https://issues.apache.org/jira/browse/SPARK-41220 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Apache Spark >Priority: Major > > When do a global sort, firstly we do sample to get range bounds, then we use > the range partitioner to do shuffle exchange. > The issue is, the sample plan is coupled with the shuffle plan that causes we > can not optimize the sample plan. What we need for sample plan is the columns > for sort order but the shuffle plan contains all data columns.So at least, we > can do column pruning for the sample plan to only fetch the ordering columns. > A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41220) Range partitioner sample supports column pruning
[ https://issues.apache.org/jira/browse/SPARK-41220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637148#comment-17637148 ] Apache Spark commented on SPARK-41220: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/38756 > Range partitioner sample supports column pruning > > > Key: SPARK-41220 > URL: https://issues.apache.org/jira/browse/SPARK-41220 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Priority: Major > > When do a global sort, firstly we do sample to get range bounds, then we use > the range partitioner to do shuffle exchange. > The issue is, the sample plan is coupled with the shuffle plan that causes we > can not optimize the sample plan. What we need for sample plan is the columns > for sort order but the shuffle plan contains all data columns.So at least, we > can do column pruning for the sample plan to only fetch the ordering columns. > A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41135) Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION
[ https://issues.apache.org/jira/browse/SPARK-41135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-41135: Assignee: Haejoon Lee > Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION > --- > > Key: SPARK-41135 > URL: https://issues.apache.org/jira/browse/SPARK-41135 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > > The name of `UNSUPPORTED_EMPTY_LOCATION` can be improved with its error > message -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41135) Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION
[ https://issues.apache.org/jira/browse/SPARK-41135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-41135. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38650 [https://github.com/apache/spark/pull/38650] > Rename UNSUPPORTED_EMPTY_LOCATION to INVALID_EMPTY_LOCATION > --- > > Key: SPARK-41135 > URL: https://issues.apache.org/jira/browse/SPARK-41135 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.4.0 > > > The name of `UNSUPPORTED_EMPTY_LOCATION` can be improved with its error > message -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41221) Add the error class INVALID_FORMAT
[ https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637129#comment-17637129 ] Apache Spark commented on SPARK-41221: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38755 > Add the error class INVALID_FORMAT > -- > > Key: SPARK-41221 > URL: https://issues.apache.org/jira/browse/SPARK-41221 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Introduce new error class for the errors related to invalid format or pattern. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41221) Add the error class INVALID_FORMAT
[ https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41221: Assignee: Max Gekk (was: Apache Spark) > Add the error class INVALID_FORMAT > -- > > Key: SPARK-41221 > URL: https://issues.apache.org/jira/browse/SPARK-41221 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Introduce new error class for the errors related to invalid format or pattern. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41221) Add the error class INVALID_FORMAT
[ https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637128#comment-17637128 ] Apache Spark commented on SPARK-41221: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/38755 > Add the error class INVALID_FORMAT > -- > > Key: SPARK-41221 > URL: https://issues.apache.org/jira/browse/SPARK-41221 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > > Introduce new error class for the errors related to invalid format or pattern. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41221) Add the error class INVALID_FORMAT
[ https://issues.apache.org/jira/browse/SPARK-41221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41221: Assignee: Apache Spark (was: Max Gekk) > Add the error class INVALID_FORMAT > -- > > Key: SPARK-41221 > URL: https://issues.apache.org/jira/browse/SPARK-41221 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Introduce new error class for the errors related to invalid format or pattern. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41221) Add the error class INVALID_FORMAT
Max Gekk created SPARK-41221: Summary: Add the error class INVALID_FORMAT Key: SPARK-41221 URL: https://issues.apache.org/jira/browse/SPARK-41221 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Assignee: Max Gekk Introduce new error class for the errors related to invalid format or pattern. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41212) Implement `DataFrame.isEmpty`
[ https://issues.apache.org/jira/browse/SPARK-41212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-41212. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38734 [https://github.com/apache/spark/pull/38734] > Implement `DataFrame.isEmpty` > - > > Key: SPARK-41212 > URL: https://issues.apache.org/jira/browse/SPARK-41212 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41212) Implement `DataFrame.isEmpty`
[ https://issues.apache.org/jira/browse/SPARK-41212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-41212: - Assignee: Ruifeng Zheng > Implement `DataFrame.isEmpty` > - > > Key: SPARK-41212 > URL: https://issues.apache.org/jira/browse/SPARK-41212 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Ruifeng Zheng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41219) Regression in IntegralDivide returning null instead of 0
[ https://issues.apache.org/jira/browse/SPARK-41219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637102#comment-17637102 ] Yuming Wang commented on SPARK-41219: - cc [~ulysses] > Regression in IntegralDivide returning null instead of 0 > > > Key: SPARK-41219 > URL: https://issues.apache.org/jira/browse/SPARK-41219 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Raza Jafri >Priority: Major > > There seems to be a regression in Spark 3.4 Integral Divide > > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | null| > | null| > +-+ > {code} > > While in Spark 3.3.0 > {code:java} > scala> val df = Seq("0.5944910","0.3314242").toDF("a") > df: org.apache.spark.sql.DataFrame = [a: string] > scala> df.selectExpr("cast(a as decimal(7,7)) div 100").show > +-+ > |(CAST(a AS DECIMAL(7,7)) div 100)| > +-+ > | 0| > | 0| > +-+ > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases
[ https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17637096#comment-17637096 ] Gabor Roczei commented on SPARK-38230: -- Hi [~coalchan], [Your pull request|https://github.com/apache/spark/pull/35549] has been automatically closed by the github action, I would like to create a new pull request based on yours and continue to work on this if you agree. > InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions > in most cases > --- > > Key: SPARK-38230 > URL: https://issues.apache.org/jira/browse/SPARK-38230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.2 >Reporter: Coal Chan >Priority: Major > > In > `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`, > `sparkSession.sessionState.catalog.listPartitions` will call method > `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore > client, this method will produce multiple queries per partition on hive > metastore db. So when you insert into a table which has too many > partitions(ie: 10k), it will produce too many queries on hive metastore > db(ie: n * 10k = 10nk), it puts a lot of strain on the database. > In fact, it calls method `listPartitions` in order to get locations of > partitions and get `customPartitionLocations`. But in most cases, we do not > have custom partitions, we can just get partition names, so we can call > method listPartitionNames. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41220) Range partitioner sample supports column pruning
XiDuo You created SPARK-41220: - Summary: Range partitioner sample supports column pruning Key: SPARK-41220 URL: https://issues.apache.org/jira/browse/SPARK-41220 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: XiDuo You When do a global sort, firstly we do sample to get range bounds, then we use the range partitioner to do shuffle exchange. The issue is, the sample plan is coupled with the shuffle plan that causes we can not optimize the sample plan. What we need for sample plan is the columns for sort order but the shuffle plan contains all data columns.So at least, we can do column pruning for the sample plan to only fetch the ordering columns. A common example is: `OPTIMIZE table ZORDER BY columns` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org