[jira] [Updated] (SPARK-45108) Improve the InjectRuntimeFilter for check probably shuffle

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45108:
---
Labels: pull-request-available  (was: )

> Improve the InjectRuntimeFilter for check probably shuffle
> --
>
> Key: SPARK-45108
> URL: https://issues.apache.org/jira/browse/SPARK-45108
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: jiaan.geng
>Priority: Major
>  Labels: pull-request-available
>
> InjectRuntimeFilter needs to check probably shuffle. But the current code may 
> lead to duplicate call of isProbablyShuffleJoin if we need the right side of 
> Join node as the application side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45108) Improve the InjectRuntimeFilter for check probably shuffle

2023-09-08 Thread jiaan.geng (Jira)
jiaan.geng created SPARK-45108:
--

 Summary: Improve the InjectRuntimeFilter for check probably shuffle
 Key: SPARK-45108
 URL: https://issues.apache.org/jira/browse/SPARK-45108
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 4.0.0
Reporter: jiaan.geng


InjectRuntimeFilter needs to check probably shuffle. But the current code may 
lead to duplicate call of isProbablyShuffleJoin if we need the right side of 
Join node as the application side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42750) Support INSERT INTO by name

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-42750:
---
Labels: pull-request-available  (was: )

> Support INSERT INTO by name
> ---
>
> Key: SPARK-42750
> URL: https://issues.apache.org/jira/browse/SPARK-42750
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Jose Torres
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> In some use cases, users have incoming dataframes with fixed column names 
> which might differ from the canonical order. Currently there's no way to 
> handle this easily through the INSERT INTO API - the user has to make sure 
> the columns are in the right order as they would when inserting a tuple. We 
> should add an optional BY NAME clause, such that:
> INSERT INTO tgt BY NAME 
> takes each column of  and inserts it into the column in `tgt` which 
> has the same name according to the configured `resolver` logic.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45107) Refine docstring of `explode`

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45107:
---
Labels: pull-request-available  (was: )

> Refine docstring of `explode`
> -
>
> Key: SPARK-45107
> URL: https://issues.apache.org/jira/browse/SPARK-45107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 4.0.0
>Reporter: Allison Wang
>Priority: Major
>  Labels: pull-request-available
>
> Refine the docstring of `explode`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44866) Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44866:
-

Assignee: Hayssam Saleh

> Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake
> -
>
> Key: SPARK-44866
> URL: https://issues.apache.org/jira/browse/SPARK-44866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Hayssam Saleh
>Assignee: Hayssam Saleh
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In Snowflake the Boolean type is represented by the Boolean data type 
> ([https://docs.snowflake.com/en/sql-reference/data-types-logical]), but Spark 
> rely on the default JdbcDialect to generate the mapping which maps _Boolean_ 
> to _BIT(1)_
> This should be probably handled by a dialect specific to Snowflake.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44866) Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44866.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42545
[https://github.com/apache/spark/pull/42545]

> Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake
> -
>
> Key: SPARK-44866
> URL: https://issues.apache.org/jira/browse/SPARK-44866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Hayssam Saleh
>Assignee: Hayssam Saleh
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In Snowflake the Boolean type is represented by the Boolean data type 
> ([https://docs.snowflake.com/en/sql-reference/data-types-logical]), but Spark 
> rely on the default JdbcDialect to generate the mapping which maps _Boolean_ 
> to _BIT(1)_
> This should be probably handled by a dialect specific to Snowflake.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45107) Refine docstring of `explode`

2023-09-08 Thread Allison Wang (Jira)
Allison Wang created SPARK-45107:


 Summary: Refine docstring of `explode`
 Key: SPARK-45107
 URL: https://issues.apache.org/jira/browse/SPARK-45107
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark
Affects Versions: 4.0.0
Reporter: Allison Wang


Refine the docstring of `explode`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44819) Make Python the first language in all Spark code snippet

2023-09-08 Thread Allison Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Allison Wang resolved SPARK-44819.
--
Resolution: Duplicate

Fixed in https://issues.apache.org/jira/browse/SPARK-42642

> Make Python the first language in all Spark code snippet
> 
>
> Key: SPARK-44819
> URL: https://issues.apache.org/jira/browse/SPARK-44819
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Allison Wang
>Priority: Major
> Attachments: Screenshot 2023-08-15 at 11.59.11.png
>
>
> Currently, the first and default language for all code snippets is Sacla. For 
> instance: https://spark.apache.org/docs/latest/quick-start.html
> We should make Python the first language for all the code snippets.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44866) Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44866:
---
Labels: pull-request-available  (was: )

> Spark wrongly map the BOOLEAN Type to BIT(1) in Snowflake
> -
>
> Key: SPARK-44866
> URL: https://issues.apache.org/jira/browse/SPARK-44866
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: Hayssam Saleh
>Priority: Major
>  Labels: pull-request-available
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In Snowflake the Boolean type is represented by the Boolean data type 
> ([https://docs.snowflake.com/en/sql-reference/data-types-logical]), but Spark 
> rely on the default JdbcDialect to generate the mapping which maps _Boolean_ 
> to _BIT(1)_
> This should be probably handled by a dialect specific to Snowflake.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43299) JVM Client throw StreamingQueryException when error handling is implemented

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43299:
---
Labels: pull-request-available  (was: )

> JVM Client throw StreamingQueryException when error handling is implemented
> ---
>
> Key: SPARK-43299
> URL: https://issues.apache.org/jira/browse/SPARK-43299
> Project: Spark
>  Issue Type: Task
>  Components: Connect, Structured Streaming
>Affects Versions: 3.5.0
>Reporter: Wei Liu
>Priority: Major
>  Labels: pull-request-available
>
> Currently the awaitTermination() method of connect's JVM client's 
> StreamingQuery won't throw error when there is an exception. 
>  
> In Python connect this is directly handled by python client's error-handling 
> framework but such is not existed in JVM client right now.
>  
> We should verify it works when JVM adds that
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45104) Upgrade graphlib-dot.min.js to 1.0.2

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45104.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42853
[https://github.com/apache/spark/pull/42853]

> Upgrade graphlib-dot.min.js to 1.0.2
> 
>
> Key: SPARK-45104
> URL: https://issues.apache.org/jira/browse/SPARK-45104
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45104) Upgrade graphlib-dot.min.js to 1.0.2

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45104:
-

Assignee: Kent Yao

> Upgrade graphlib-dot.min.js to 1.0.2
> 
>
> Key: SPARK-45104
> URL: https://issues.apache.org/jira/browse/SPARK-45104
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45104) Upgrade graphlib-dot.min.js to 1.0.2

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45104:
--
Summary: Upgrade graphlib-dot.min.js to 1.0.2  (was: Upgrade 
graphlib-dot.min.js from 0.6.4 to 1.0.2)

> Upgrade graphlib-dot.min.js to 1.0.2
> 
>
> Key: SPARK-45104
> URL: https://issues.apache.org/jira/browse/SPARK-45104
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45075) Alter table with invalid default value will not report error

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45075.
---
Fix Version/s: 3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42810
[https://github.com/apache/spark/pull/42810]

> Alter table with invalid default value will not report error
> 
>
> Key: SPARK-45075
> URL: https://issues.apache.org/jira/browse/SPARK-45075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0, 3.4.2
>
>
> create table t(i boolean, s bigint);
> alter table t alter column s set default badvalue;
>  
> The code wouldn't report error on DataSource V2, not align with V1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45075) Alter table with invalid default value will not report error

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45075:
---
Labels: pull-request-available  (was: )

> Alter table with invalid default value will not report error
> 
>
> Key: SPARK-45075
> URL: https://issues.apache.org/jira/browse/SPARK-45075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> create table t(i boolean, s bigint);
> alter table t alter column s set default badvalue;
>  
> The code wouldn't report error on DataSource V2, not align with V1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45075) Alter table with invalid default value will not report error

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45075:
-

Assignee: Jia Fan

> Alter table with invalid default value will not report error
> 
>
> Key: SPARK-45075
> URL: https://issues.apache.org/jira/browse/SPARK-45075
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0
>Reporter: Jia Fan
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
>
> create table t(i boolean, s bigint);
> alter table t alter column s set default badvalue;
>  
> The code wouldn't report error on DataSource V2, not align with V1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44805:
-

Assignee: Bruce Robbins

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.4.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness, pull-request-available
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44805.
---
Fix Version/s: 3.3.4
   3.5.1
   4.0.0
   3.4.2
   Resolution: Fixed

Issue resolved by pull request 42850
[https://github.com/apache/spark/pull/42850]

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.4.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: correctness, pull-request-available
> Fix For: 3.3.4, 3.5.1, 4.0.0, 3.4.2
>
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45088) Make `getitem` work with duplicated columns

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45088:
---
Labels: pull-request-available  (was: )

> Make `getitem` work with duplicated columns
> ---
>
> Key: SPARK-45088
> URL: https://issues.apache.org/jira/browse/SPARK-45088
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45104) Upgrade graphlib-dot.min.js from 0.6.4 to 1.0.2

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45104?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45104:
---
Labels: pull-request-available  (was: )

> Upgrade graphlib-dot.min.js from 0.6.4 to 1.0.2
> ---
>
> Key: SPARK-45104
> URL: https://issues.apache.org/jira/browse/SPARK-45104
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45106.
---
Fix Version/s: 3.5.1
 Assignee: Bruce Robbins
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/42857

>  percentile_cont gets internal error when user input fails runtime 
> replacement's input type check
> -
>
> Key: SPARK-45106
> URL: https://issues.apache.org/jira/browse/SPARK-45106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Assignee: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1
>
>
> This query throws an internal error rather than producing a useful error 
> message:
> {noformat}
> select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
> from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
> [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
> "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
> replaceable expression "percentile_cont(a, b)". The replacement is 
> unresolved: "percentile(a, b, 1)".
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
> ...
> {noformat}
> It should instead inform the user that the input expression must be foldable.
> {{PercentileCont}} does not check the user's input. If the runtime 
> replacement (an instance of {{Percentile}}) rejects the user's input, the 
> runtime replacement ends up unresolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45098) Custom jekyll-rediect-from redirect.html template

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45098:
---
Labels: pull-request-available  (was: )

> Custom jekyll-rediect-from redirect.html template
> -
>
> Key: SPARK-45098
> URL: https://issues.apache.org/jira/browse/SPARK-45098
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-45098) Custom jekyll-rediect-from redirect.html template

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-45098:
-

Assignee: Kent Yao

> Custom jekyll-rediect-from redirect.html template
> -
>
> Key: SPARK-45098
> URL: https://issues.apache.org/jira/browse/SPARK-45098
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45098) Custom jekyll-rediect-from redirect.html template

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-45098.
---
Fix Version/s: 3.5.1
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42848
[https://github.com/apache/spark/pull/42848]

> Custom jekyll-rediect-from redirect.html template
> -
>
> Key: SPARK-45098
> URL: https://issues.apache.org/jira/browse/SPARK-45098
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 3.5.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.1, 4.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bruce Robbins updated SPARK-45106:
--
Affects Version/s: 3.3.2

>  percentile_cont gets internal error when user input fails runtime 
> replacement's input type check
> -
>
> Key: SPARK-45106
> URL: https://issues.apache.org/jira/browse/SPARK-45106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> This query throws an internal error rather than producing a useful error 
> message:
> {noformat}
> select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
> from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
> [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
> "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
> replaceable expression "percentile_cont(a, b)". The replacement is 
> unresolved: "percentile(a, b, 1)".
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
> ...
> {noformat}
> It should instead inform the user that the input expression must be foldable.
> {{PercentileCont}} does not check the user's input. If the runtime 
> replacement (an instance of {{Percentile}}) rejects the user's input, the 
> runtime replacement ends up unresolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-44986.
---
Fix Version/s: 4.0.0
   Resolution: Fixed

Issue resolved by pull request 42702
[https://github.com/apache/spark/pull/42702]

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 4.0.0
>
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>
> Before:
> !image-2023-08-28-16-47-11-582.png|width=794,height=392!
>  
> After:
> !image-2023-08-28-16-46-04-705.png|width=744,height=329!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-44986:
-

Assignee: BingKun Pan

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>
> Before:
> !image-2023-08-28-16-47-11-582.png|width=794,height=392!
>  
> After:
> !image-2023-08-28-16-46-04-705.png|width=744,height=329!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44986) There should be a gap at the bottom of the HTML

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44986:
---
Labels: pull-request-available  (was: )

> There should be a gap at the bottom of the HTML
> ---
>
> Key: SPARK-44986
> URL: https://issues.apache.org/jira/browse/SPARK-44986
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 4.0.0
>
> Attachments: image-2023-08-28-16-46-04-705.png, 
> image-2023-08-28-16-47-11-582.png
>
>
> Before:
> !image-2023-08-28-16-47-11-582.png|width=794,height=392!
>  
> After:
> !image-2023-08-28-16-46-04-705.png|width=744,height=329!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45100) reflect() fails with an internal error on NULL class and method

2023-09-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-45100:
-
Fix Version/s: 3.3.4

> reflect() fails with an internal error on NULL class and method
> ---
>
> Key: SPARK-45100
> URL: https://issues.apache.org/jira/browse/SPARK-45100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0, 4.0.0, 3.3.4
>
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING));
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44647) Support SPJ when join key is subset of partition keys

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44647:
---
Labels: pull-request-available  (was: )

> Support SPJ when join key is subset of partition keys
> -
>
> Key: SPARK-44647
> URL: https://issues.apache.org/jira/browse/SPARK-44647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Szehon Ho
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43203) Fix DROP table behavior in session catalog

2023-09-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763137#comment-17763137
 ] 

Dongjoon Hyun commented on SPARK-43203:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/41765 

> Fix DROP table behavior in session catalog
> --
>
> Key: SPARK-43203
> URL: https://issues.apache.org/jira/browse/SPARK-43203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0
>
>
> DROP table behavior is not working correctly in 3.4.0 because we always 
> invoke V1 drop logic if the identifier looks like a V1 identifier. This is a 
> big blocker for external data sources that provide custom session catalogs.
> See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43203) Fix DROP table behavior in session catalog

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-43203:
--
Fix Version/s: 3.4.2

> Fix DROP table behavior in session catalog
> --
>
> Key: SPARK-43203
> URL: https://issues.apache.org/jira/browse/SPARK-43203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0
>
>
> DROP table behavior is not working correctly in 3.4.0 because we always 
> invoke V1 drop logic if the identifier looks like a V1 identifier. This is a 
> big blocker for external data sources that provide custom session catalogs.
> See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43203) Fix DROP table behavior in session catalog

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-43203:
---
Labels: pull-request-available  (was: )

> Fix DROP table behavior in session catalog
> --
>
> Key: SPARK-43203
> URL: https://issues.apache.org/jira/browse/SPARK-43203
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Anton Okolnychyi
>Assignee: Jia Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0
>
>
> DROP table behavior is not working correctly in 3.4.0 because we always 
> invoke V1 drop logic if the identifier looks like a V1 identifier. This is a 
> big blocker for external data sources that provide custom session catalogs.
> See [here|https://github.com/apache/spark/pull/37879/files#r1170501180] for 
> details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45100) reflect() fails with an internal error on NULL class and method

2023-09-08 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-45100:
--
Fix Version/s: 3.4.2

> reflect() fails with an internal error on NULL class and method
> ---
>
> Key: SPARK-45100
> URL: https://issues.apache.org/jira/browse/SPARK-45100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 3.5.0, 4.0.0
>
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING));
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45100) reflect() fails with an internal error on NULL class and method

2023-09-08 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17763131#comment-17763131
 ] 

Dongjoon Hyun commented on SPARK-45100:
---

This is backported to branch-3.4 via https://github.com/apache/spark/pull/42855

> reflect() fails with an internal error on NULL class and method
> ---
>
> Key: SPARK-45100
> URL: https://issues.apache.org/jira/browse/SPARK-45100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 4.0.0
>
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING));
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45100) reflect() fails with an internal error on NULL class and method

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45100:
---
Labels: pull-request-available  (was: )

> reflect() fails with an internal error on NULL class and method
> ---
>
> Key: SPARK-45100
> URL: https://issues.apache.org/jira/browse/SPARK-45100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.5.0, 4.0.0
>
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING));
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44805) Data lost after union using spark.sql.parquet.enableNestedColumnVectorizedReader=true

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-44805:
---
Labels: correctness pull-request-available  (was: correctness)

> Data lost after union using 
> spark.sql.parquet.enableNestedColumnVectorizedReader=true
> -
>
> Key: SPARK-44805
> URL: https://issues.apache.org/jira/browse/SPARK-44805
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.1, 3.4.1
> Environment: pySpark, linux, hadoop, parquet. 
>Reporter: Jakub Wozniak
>Priority: Major
>  Labels: correctness, pull-request-available
>
> When union-ing two DataFrames read from parquet containing nested structures 
> (2 fields of array types where one is double and second is integer) data from 
> the second field seems to be lost (zeros are set instead). 
> This seems to be the case only if nested vectorised reader is used 
> (spark.sql.parquet.enableNestedColumnVectorizedReader=true). 
> The following Python code reproduces the problem: 
> {code:java}
> from pyspark.sql import SparkSession
> from pyspark.sql.types import *
> # PREPARING DATA
> data1 = []
> data2 = []
> for i in range(2): 
>     data1.append( (([1,2,3],[1,1,2]),i))
>     data2.append( (([1.0,2.0,3.0],[1,1]),i+10))
> schema1 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(IntegerType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> schema2 = StructType([
>         StructField('value', StructType([
>              StructField('f1', ArrayType(DoubleType()), True),
>              StructField('f2', ArrayType(IntegerType()), True)             
>              ])),
>          StructField('id', IntegerType(), True)
> ])
> spark = SparkSession.builder.getOrCreate()
> data_dir = "/user//"
> df1 = spark.createDataFrame(data1, schema1)
> df1.write.mode('overwrite').parquet(data_dir + "data1") 
> df2 = spark.createDataFrame(data2, schema2)
> df2.write.mode('overwrite').parquet(data_dir + "data2") 
> # READING DATA
> parquet1 = spark.read.parquet(data_dir + "data1")
> parquet2 = spark.read.parquet(data_dir + "data2")
> # UNION
> out = parquet1.union(parquet2)
> parquet1.select("value.f2").distinct().show()
> out.select("value.f2").distinct().show()
> print(parquet1.collect())
> print(out.collect()) {code}
> Output: 
> {code:java}
> +-+
> |   f2|
> +-+
> |[1, 1, 2]|
> +-+
> +-+
> |   f2|
> +-+
> |[0, 0, 0]|
> |   [1, 1]|
> +-+
> [
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=0), 
> Row(value=Row(f1=[1, 2, 3], f2=[1, 1, 2]), id=1)
> ]
> [
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=0), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[0, 0, 0]), id=1), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=10), 
> Row(value=Row(f1=[1.0, 2.0, 3.0], f2=[1, 1]), id=11)
> ] {code}
> Please notice that values for the field f2 are lost after the union is done. 
> This only happens when this data is read from parquet files. 
> Could you please look into this? 
> Best regards,
> Jakub



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45106:
---
Labels: pull-request-available  (was: )

>  percentile_cont gets internal error when user input fails runtime 
> replacement's input type check
> -
>
> Key: SPARK-45106
> URL: https://issues.apache.org/jira/browse/SPARK-45106
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Bruce Robbins
>Priority: Major
>  Labels: pull-request-available
>
> This query throws an internal error rather than producing a useful error 
> message:
> {noformat}
> select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
> from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);
> [INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
> "percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
> org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
> replaceable expression "percentile_cont(a, b)". The replacement is 
> unresolved: "percentile(a, b, 1)".
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:92)
>   at 
> org.apache.spark.SparkException$.internalError(SparkException.scala:96)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
> ...
> {noformat}
> It should instead inform the user that the input expression must be foldable.
> {{PercentileCont}} does not check the user's input. If the runtime 
> replacement (an instance of {{Percentile}}) rejects the user's input, the 
> runtime replacement ends up unresolved.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45106) percentile_cont gets internal error when user input fails runtime replacement's input type check

2023-09-08 Thread Bruce Robbins (Jira)
Bruce Robbins created SPARK-45106:
-

 Summary:  percentile_cont gets internal error when user input 
fails runtime replacement's input type check
 Key: SPARK-45106
 URL: https://issues.apache.org/jira/browse/SPARK-45106
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0, 4.0.0
Reporter: Bruce Robbins


This query throws an internal error rather than producing a useful error 
message:
{noformat}
select percentile_cont(b) WITHIN GROUP (ORDER BY a DESC) as x 
from (values (12, 0.25), (13, 0.25), (22, 0.25)) as (a, b);

[INTERNAL_ERROR] Cannot resolve the runtime replaceable expression 
"percentile_cont(a, b)". The replacement is unresolved: "percentile(a, b, 1)".
org.apache.spark.SparkException: [INTERNAL_ERROR] Cannot resolve the runtime 
replaceable expression "percentile_cont(a, b)". The replacement is unresolved: 
"percentile(a, b, 1)".
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:92)
at 
org.apache.spark.SparkException$.internalError(SparkException.scala:96)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6(CheckAnalysis.scala:313)
at 
org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis0$6$adapted(CheckAnalysis.scala:277)
...
{noformat}
It should instead inform the user that the input expression must be foldable.

{{PercentileCont}} does not check the user's input. If the runtime replacement 
(an instance of {{Percentile}}) rejects the user's input, the runtime 
replacement ends up unresolved.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45105) Make hyperlinks in documents clickable

2023-09-08 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-45105:
---
Labels: pull-request-available  (was: )

> Make hyperlinks in documents clickable
> --
>
> Key: SPARK-45105
> URL: https://issues.apache.org/jira/browse/SPARK-45105
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 4.0.0
>Reporter: BingKun Pan
>Priority: Minor
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45105) Make hyperlinks in documents clickable

2023-09-08 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-45105:
---

 Summary: Make hyperlinks in documents clickable
 Key: SPARK-45105
 URL: https://issues.apache.org/jira/browse/SPARK-45105
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Affects Versions: 4.0.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45100) reflect() fails with an internal error on NULL class and method

2023-09-08 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45100?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-45100.
--
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42849
[https://github.com/apache/spark/pull/42849]

> reflect() fails with an internal error on NULL class and method
> ---
>
> Key: SPARK-45100
> URL: https://issues.apache.org/jira/browse/SPARK-45100
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> The example below demonstrates the issue:
> {code:sql}
> spark-sql (default)> select reflect('java.util.UUID', CAST(NULL AS STRING));
> [INTERNAL_ERROR] The Spark SQL phase analysis failed with an internal error. 
> You hit a bug in Spark or the Spark plugins you use. Please, report this bug 
> to the corresponding communities or vendors, and provide the full stack trace.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org