[jira] [Created] (SPARK-42992) Introduce PySparkRuntimeError

2023-03-31 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42992:
---

 Summary: Introduce PySparkRuntimeError
 Key: SPARK-42992
 URL: https://issues.apache.org/jira/browse/SPARK-42992
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Introduce PySparkRuntimeError to cover the RuntimeError in PySpark-specific way.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42991) Disable string +/- interval in ANSI mode

2023-03-31 Thread Yuming Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707158#comment-17707158
 ] 

Yuming Wang commented on SPARK-42991:
-

https://github.com/apache/spark/pull/40616

> Disable string +/- interval in ANSI mode
> 
>
> Key: SPARK-42991
> URL: https://issues.apache.org/jira/browse/SPARK-42991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42991) Disable string type +/- interval in ANSI mode

2023-03-31 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-42991:

Summary: Disable string type +/- interval in ANSI mode  (was: Disable 
string +/- interval in ANSI mode)

> Disable string type +/- interval in ANSI mode
> -
>
> Key: SPARK-42991
> URL: https://issues.apache.org/jira/browse/SPARK-42991
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: Yuming Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.

2023-03-31 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707143#comment-17707143
 ] 

liang yu edited comment on SPARK-42972 at 3/31/23 7:13 AM:
---

Using structured streaming, when we set the config to use dynamic allocation, 
there is a bug which will make the program hang. Here is how it happened:

{code:scala}
// Some comments here
  private def manageAllocation(): Unit = synchronized {
logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, 
$scalingDownRatio]")
if (batchProcTimeCount > 0) {
  val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount
  val ratio = averageBatchProcTime.toDouble / batchDurationMs
//When the ratio is lower than the scalingDownRatio, the client will try to 
kill executors, but if all executors are dead accidentally, the program will 
hung, because there is no executors to kill. 
  logInfo(s"Average: $averageBatchProcTime, ratio = $ratio")
  if (ratio >= scalingUpRatio) {
logDebug("Requesting executors")
val numNewExecutors = math.max(math.round(ratio).toInt, 1)
requestExecutors(numNewExecutors)
  } else if (ratio <= scalingDownRatio) {
logDebug("Killing executors")
killExecutor()
  }
}
batchProcTimeSum = 0
batchProcTimeCount = 0
//Then there will be no more batch jobs to complete, and batchProcTimeCount 
will always be 0, the program will stuck in suspended animation.
  }
{code}

 




was (Author: JIRAUSER299608):
When the ratio is lower than the scalingDownRatio, the client will try to kill 
executors, but if all executors are dead accidentally, the program will hung, 
because there is no executors to kill.  
Then there will be no more batch jobs to complete, and batchProcTimeCount will 
always be 0, the program will stuck in suspended animation.

> ExecutorAllocationManager cannot allocate new instances when all executors 
> down.
> 
>
> Key: SPARK-42972
> URL: https://issues.apache.org/jira/browse/SPARK-42972
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.

2023-03-31 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707143#comment-17707143
 ] 

liang yu edited comment on SPARK-42972 at 3/31/23 7:16 AM:
---

Using structured streaming, when we set the config to use dynamic allocation, 
there is a bug which will make the program hang. Here is how it happened:

{code:scala}
// Some comments here
  private def manageAllocation(): Unit = synchronized {
logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, 
$scalingDownRatio]")
if (batchProcTimeCount > 0) {
  val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount
  val ratio = averageBatchProcTime.toDouble / batchDurationMs
//When the ratio is lower than the scalingDownRatio, the client will try to 
kill executors, but if all executors are dead accidentally, the program will 
hung, because there is no executors to kill. 
  logInfo(s"Average: $averageBatchProcTime, ratio = $ratio")
  if (ratio >= scalingUpRatio) {
logDebug("Requesting executors")
val numNewExecutors = math.max(math.round(ratio).toInt, 1)
requestExecutors(numNewExecutors)
  } else if (ratio <= scalingDownRatio) {
logDebug("Killing executors")
killExecutor()
  }
}
batchProcTimeSum = 0
batchProcTimeCount = 0
//Then there will be no more batch jobs to complete, and batchProcTimeCount 
will always be 0, the program will stuck in suspended animation.
  }
{code}

 When the ratio is lowe than the scalingDownRatio, the client will try to kill 
executors, but if all executors are dead accidentally at the same time, the 
program will hung, because there is no executors to kill. Then there will be no 
more batch jobs to complete, and batchProcTimeCount will always be 0, the 
program will stuck in suspended animation, because last time it tried to kill 
executors and requestExecutors function will never be triggered




was (Author: JIRAUSER299608):
Using structured streaming, when we set the config to use dynamic allocation, 
there is a bug which will make the program hang. Here is how it happened:

{code:scala}
// Some comments here
  private def manageAllocation(): Unit = synchronized {
logInfo(s"Managing executor allocation with ratios = [$scalingUpRatio, 
$scalingDownRatio]")
if (batchProcTimeCount > 0) {
  val averageBatchProcTime = batchProcTimeSum / batchProcTimeCount
  val ratio = averageBatchProcTime.toDouble / batchDurationMs
//When the ratio is lower than the scalingDownRatio, the client will try to 
kill executors, but if all executors are dead accidentally, the program will 
hung, because there is no executors to kill. 
  logInfo(s"Average: $averageBatchProcTime, ratio = $ratio")
  if (ratio >= scalingUpRatio) {
logDebug("Requesting executors")
val numNewExecutors = math.max(math.round(ratio).toInt, 1)
requestExecutors(numNewExecutors)
  } else if (ratio <= scalingDownRatio) {
logDebug("Killing executors")
killExecutor()
  }
}
batchProcTimeSum = 0
batchProcTimeCount = 0
//Then there will be no more batch jobs to complete, and batchProcTimeCount 
will always be 0, the program will stuck in suspended animation.
  }
{code}

 



> ExecutorAllocationManager cannot allocate new instances when all executors 
> down.
> 
>
> Key: SPARK-42972
> URL: https://issues.apache.org/jira/browse/SPARK-42972
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42972) ExecutorAllocationManager cannot allocate new instances when all executors down.

2023-03-31 Thread liang yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707177#comment-17707177
 ] 

liang yu commented on SPARK-42972:
--

I created a PR [PR-40621|https://github.com/apache/spark/pull/40621]
[~vkolpakov]
Please help me review this PR

> ExecutorAllocationManager cannot allocate new instances when all executors 
> down.
> 
>
> Key: SPARK-42972
> URL: https://issues.apache.org/jira/browse/SPARK-42972
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.3.2
>Reporter: Jiandan Yang 
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42993) Make Torch Distributor support Spark Connect

2023-03-31 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42993:
-

 Summary: Make Torch Distributor support Spark Connect
 Key: SPARK-42993
 URL: https://issues.apache.org/jira/browse/SPARK-42993
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42994) Support sc.resources in Connect

2023-03-31 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-42994:
-

 Summary: Support sc.resources in Connect
 Key: SPARK-42994
 URL: https://issues.apache.org/jira/browse/SPARK-42994
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, ML
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42978) Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name.

2023-03-31 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-42978.
--
Fix Version/s: 3.5.0
 Assignee: Kent Yao
   Resolution: Fixed

issue resolved by [https://github.com/apache/spark/pull/40602]

>  Derby&PG: RENAME cannot qualify a new-table-Name with a schema-Name.
> -
>
> Key: SPARK-42978
> URL: https://issues.apache.org/jira/browse/SPARK-42978
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0
>
>
> https://db.apache.org/derby/docs/10.2/ref/rrefnewtablename.html#rrefnewtablename



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42995) Migrate Spark Connect DataFrame errors into error class

2023-03-31 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42995:
---

 Summary: Migrate Spark Connect DataFrame errors into error class
 Key: SPARK-42995
 URL: https://issues.apache.org/jira/browse/SPARK-42995
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, PySpark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


We should migrate the all errors into error class to leverage the PySpark error 
framework.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42935) Optimze shuffle for union spark plan

2023-03-31 Thread Jeff Min (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Min updated SPARK-42935:
-
Description: 
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
set spark.sql.unionRequiredDistributionPushdown.enabled=true;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan does not contain exchange shuffle plan
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#7, name#8 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#0, id#7, name#8 DESC NULLS LAST
  +- Sort id#7 ASC NULLS FIRST, name#8 DESC NULLS LAST, false, 0
     +- UnionZip ClusteredDistribution(ArrayBuffer(id#7),false,None), 
ClusteredDistribution(ArrayBuffer(id#9),false,None), hashpartitioning(id#7, 200)
        :- FileScan csv spark_catalog.default.table1id#7,name#8
        +- FileScan csv spark_catalog.default.table2id#9,name#10 {code}
 

 

  was:
Union plan does not take full advantage of children plan output partitionings 
when output partitoning can't match parent plan's required distribution. For 
example, Table1 and table2 are all bucketed table with bucket column id and 
bucket number 100. We will do row_number window function after union the two 
tables.
{code:sql}
create table table1 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table1 values(1, "s1");
insert into table1 values(2, "s2");
​
create table table2 (id int, name string) using csv CLUSTERED BY (id) INTO 100 
BUCKETS;
insert into table2 values(1, "s3");
​
set spark.sql.shuffle.partitions=100;
explain select *, row_number() over(partition by id order by name desc) 
id_row_number from (select * from table1 union all select * from table2);{code}
The physical plan is 
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number() windowspecdefinition(id#35, name#36 DESC NULLS LAST, 
specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS 
id_row_number#28, id#35, name#36 DESC NULLS LAST
  +- Sort id#35 ASC NULLS FIRST, name#36 DESC NULLS LAST, false, 0
     +- Exchange hashpartitioning(id#35, 100), ENSURE_REQUIREMENTS, [plan_id=88]
        +- Union
           :- FileScan csv spark_catalog.default.table1id#35,name#36
           +- FileScan csv spark_catalog.default.table2id#37,name#38 {code}
 

Although the two tables are bucketed by id column, there's still a exchange 
plan after union.The reason is that union plan's output partitioning is null.

We can indroduce a new idea to optimize exchange plan:
 # First introduce a new RDD, it consists of parent rdds that has the same 
partition size. The ith parttition corresponds to ith partition of each parent 
rdd.

 # Then push the required distribution to union plan's children. If any child 
output partitioning matches the required distribution , we can reduce this 
child shuffle operation.

After doing these, the physical plan does not contain exchange shuffle plan
{code:bash}
AdaptiveSparkPlan isFinalPlan=false
+- Window row_number()

[jira] [Created] (SPARK-42996) Adding reason for test failure on Spark Connect parity tests.

2023-03-31 Thread Haejoon Lee (Jira)
Haejoon Lee created SPARK-42996:
---

 Summary: Adding reason for test failure on Spark Connect parity 
tests.
 Key: SPARK-42996
 URL: https://issues.apache.org/jira/browse/SPARK-42996
 Project: Spark
  Issue Type: Sub-task
  Components: Connect, Pandas API on Spark
Affects Versions: 3.5.0
Reporter: Haejoon Lee


Adding details to parity tests instead of just "Fails in Spark Connect, should 
enable".



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-42918.
-
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40545
[https://github.com/apache/spark/pull/40545]

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Priority: Minor
> Fix For: 3.5.0
>
>
> A first step towards allowing file format implementations to inject custom 
> metadata fields into plans is to make the handling of metadata attributes in 
> `FileSourceStrategy` more generic.
> Today in `FileSourceStrategy` , the lists of constant and generated metadata 
> fields are created manually, checking for known generated fields on one hand 
> and considering the remaining fields as constant metadata fields. We need 
> instead to introduce a way of declaring metadata fields as generated or 
> constant directly in `FileFormat` and propagate that information to 
> `FileSourceStrategy`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42918) Generalize handling of metadata attributes in FileSourceStrategy

2023-03-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-42918:
---

Assignee: Johan Lasperas

> Generalize handling of metadata attributes in FileSourceStrategy
> 
>
> Key: SPARK-42918
> URL: https://issues.apache.org/jira/browse/SPARK-42918
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.4.1
>Reporter: Johan Lasperas
>Assignee: Johan Lasperas
>Priority: Minor
> Fix For: 3.5.0
>
>
> A first step towards allowing file format implementations to inject custom 
> metadata fields into plans is to make the handling of metadata attributes in 
> `FileSourceStrategy` more generic.
> Today in `FileSourceStrategy` , the lists of constant and generated metadata 
> fields are created manually, checking for known generated fields on one hand 
> and considering the remaining fields as constant metadata fields. We need 
> instead to introduce a way of declaring metadata fields as generated or 
> constant directly in `FileFormat` and propagate that information to 
> `FileSourceStrategy`.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42977) spark sql Disable vectorized faild

2023-03-31 Thread Jacek Laskowski (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707256#comment-17707256
 ] 

Jacek Laskowski commented on SPARK-42977:
-

Unless you can reproduce it without Iceberg, it's probably an Iceberg issue and 
should be reported in https://github.com/apache/iceberg/issues.

> spark sql Disable vectorized  faild
> ---
>
> Key: SPARK-42977
> URL: https://issues.apache.org/jira/browse/SPARK-42977
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: liu
>Priority: Major
> Fix For: 3.3.2
>
>
> spark-sql config
> {code:java}
> ./spark-sql --packages 
> org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:1.2.0\
>     --conf   
> spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
>  \
>     --conf 
> spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
>     --conf spark.sql.catalog.spark_catalog.type=hive \
>     --conf spark.sql.iceberg.handle-timestamp-without-timezone=true \
>     --conf spark.sql.parquet.binaryAsString=true \
>     --conf spark.sql.parquet.enableVectorizedReader=false \
>     --conf spark.sql.parquet.enableNestedColumnVectorizedReader=true \
>     --conf spark.sql.parquet.recordLevelFilter=true  {code}
>  
> Now that I have configured spark. sql. queue. 
> enableVectorizedReader=false,but i query a iceberg parquet table,the 
> following error occurred:
>  
>    
> {code:java}
> at scala.collection.AbstractIterable.foreach(Iterable.scala:56)     at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processLine(SparkSQLCLIDriver.scala:498)
>      at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:286)
>      at 
> org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
>      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)     at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
>     at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>      at java.lang.reflect.Method.invoke(Method.java:498)     at 
> org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)  
>    at 
> org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:958)
>      at 
> org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)     at 
> org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)     at 
> org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)     at 
> org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1046)  
>    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1055)     
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: 
> java.lang.UnsupportedOperationException: Cannot support vectorized reads for 
> column [hzxm] optional binary hzxm = 8 with encoding DELTA_BYTE_ARRAY. 
> Disable vectorized reads to read this table/file     at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedPageIterator.initDataReader(VectorizedPageIterator.java:100)
>      at 
> org.apache.iceberg.parquet.BasePageIterator.initFromPage(BasePageIterator.java:140)
>      at 
> org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:105)
>      at 
> org.apache.iceberg.parquet.BasePageIterator$1.visit(BasePageIterator.java:96) 
>     at 
> org.apache.iceberg.shaded.org.apache.parquet.column.page.DataPageV2.accept(DataPageV2.java:192)
>      at 
> org.apache.iceberg.parquet.BasePageIterator.setPage(BasePageIterator.java:95) 
>     at 
> org.apache.iceberg.parquet.BaseColumnIterator.advance(BaseColumnIterator.java:61)
>      at 
> org.apache.iceberg.parquet.BaseColumnIterator.setPageSource(BaseColumnIterator.java:50)
>      at 
> org.apache.iceberg.arrow.vectorized.parquet.VectorizedColumnIterator.setRowGroupInfo(Vec
>  {code}
>  
>  
> *{color:#FF}Caused by: java.lang.UnsupportedOperationException: Cannot 
> support vectorized reads for column [hzxm] optional binary hzxm = 8 with 
> encoding DELTA_BYTE_ARRAY. Disable vectorized reads to read this 
> table/file{color}*
>  
>  
> Now it seems that this parameter has not worked. How can I turn off this 
> function so that I can successfully query the table



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42663) Fix `default_session` to work properly

2023-03-31 Thread Wencong Liu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707279#comment-17707279
 ] 

Wencong Liu commented on SPARK-42663:
-

Hello [~itholic] , I am a beginner in Spark and I am very interested in this 
ticket. I understand that the key point of this issue is to make Spark remember 
the previous key in the config. I would like to ask if this problem only 
applies to "default_index_type" or if it is just an example. If you could help 
me by providing the specific code path, I would be very happy and willing to 
take on this issue. :)

> Fix `default_session` to work properly
> --
>
> Key: SPARK-42663
> URL: https://issues.apache.org/jira/browse/SPARK-42663
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, Pandas API on Spark
>Affects Versions: 3.5.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Currently, default_session is not working properly in Spark Connect as below:
> {code:java}
> >>> spark = default_session()
> >>> spark.conf.set("default_index_type", "sequence")
> >>> spark.conf.get("default_index_type")
> 'sequence'
> >>>
> >>> spark = default_session()
> >>> spark.conf.get("default_index_type")
> Traceback (most recent call last):
> ...
> pyspark.errors.exceptions.connect.SparkConnectGrpcException: 
> (java.util.NoSuchElementException) default_index_type
> {code}
> It should work as expected in regular PySpark as below:
> {code:java}
> >>> spark = default_session()
> >>> spark.conf.set("default_index_type", "sequence")
> >>> spark.conf.get("default_index_type")
> 'sequence'
> >>>
> >>> spark = default_session()
> >>> spark.conf.get("default_index_type")
> 'sequence'{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42973) Upgrade buf to v1.16.0

2023-03-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-42973:
-

Assignee: BingKun Pan

> Upgrade buf to v1.16.0
> --
>
> Key: SPARK-42973
> URL: https://issues.apache.org/jira/browse/SPARK-42973
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42973) Upgrade buf to v1.16.0

2023-03-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-42973:
--
Affects Version/s: 3.5.0
   (was: 3.4.1)

> Upgrade buf to v1.16.0
> --
>
> Key: SPARK-42973
> URL: https://issues.apache.org/jira/browse/SPARK-42973
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.5.0
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42973) Upgrade buf to v1.16.0

2023-03-31 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-42973.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 40596
[https://github.com/apache/spark/pull/40596]

> Upgrade buf to v1.16.0
> --
>
> Key: SPARK-42973
> URL: https://issues.apache.org/jira/browse/SPARK-42973
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Connect
>Affects Versions: 3.4.1
>Reporter: BingKun Pan
>Assignee: BingKun Pan
>Priority: Minor
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42997) TableOutputResolver must use correct column paths in error messages for arrays and maps

2023-03-31 Thread Anton Okolnychyi (Jira)
Anton Okolnychyi created SPARK-42997:


 Summary: TableOutputResolver must use correct column paths in 
error messages for arrays and maps
 Key: SPARK-42997
 URL: https://issues.apache.org/jira/browse/SPARK-42997
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.2, 3.3.1, 3.3.0, 3.3.3, 3.4.0, 3.4.1, 3.5.0
Reporter: Anton Okolnychyi


TableOutputResolver must use correct column paths in error messages for arrays 
and maps.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode

2023-03-31 Thread xiaochen zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402
 ] 

xiaochen zhou commented on SPARK-42860:
---

https://github.com/apache/spark/pull/40626

> Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
> ---
>
> Key: SPARK-42860
> URL: https://issues.apache.org/jira/browse/SPARK-42860
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: arindam patra
>Priority: Blocker
>
> We have a service that submits spark sql jobs to a spark cluster .
> we want to validate the sql query before submitting the job . We are 
> currently using df.explain(extended=true) which generates parsed , analysed , 
> optimised logical plan and physical plan . 
> But generating  optimised logical plan  sometimes takes more time for e.g if 
> you have applied a filter on a partitioned column , spark will list all 
> directories and take the required ones . 
> For our query validation purpose this doesnt make sense and it would be great 
> if there is a explain mode that will only print the parsed and analysed 
> logical plans only



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42998) Fix DataFrame.collect with null struct.

2023-03-31 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-42998:
-

 Summary: Fix DataFrame.collect with null struct.
 Key: SPARK-42998
 URL: https://issues.apache.org/jira/browse/SPARK-42998
 Project: Spark
  Issue Type: Sub-task
  Components: Connect
Affects Versions: 3.4.0
Reporter: Takuya Ueshin


In Spark Connect:

{code:python}
>>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)")
>>> df.show()
+++
|   a|   b|
+++
|   1| {a}|
|null|null|
+++

>>> df.collect()
[Row(a=1, b=Row(x='a')), Row(a=None, b=)]
{code}

whereas PySpark:

{code:python}
>>> df.collect()
[Row(a=1, b=Row(x='a')), Row(a=None, b=None)]
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42999) Impl Dataset#foreach, foreachPartitions

2023-03-31 Thread Zhen Li (Jira)
Zhen Li created SPARK-42999:
---

 Summary: Impl Dataset#foreach, foreachPartitions
 Key: SPARK-42999
 URL: https://issues.apache.org/jira/browse/SPARK-42999
 Project: Spark
  Issue Type: Improvement
  Components: Connect
Affects Versions: 3.5.0
Reporter: Zhen Li


Impl the missing methods in Scala Client Dataset API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode

2023-03-31 Thread xiaochen zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402
 ] 

xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:16 AM:
---

[https://github.com/apache/spark/pull/40631]


was (Author: zxcoccer):
https://github.com/apache/spark/pull/40626

> Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
> ---
>
> Key: SPARK-42860
> URL: https://issues.apache.org/jira/browse/SPARK-42860
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: arindam patra
>Priority: Blocker
>
> We have a service that submits spark sql jobs to a spark cluster .
> we want to validate the sql query before submitting the job . We are 
> currently using df.explain(extended=true) which generates parsed , analysed , 
> optimised logical plan and physical plan . 
> But generating  optimised logical plan  sometimes takes more time for e.g if 
> you have applied a filter on a partitioned column , spark will list all 
> directories and take the required ones . 
> For our query validation purpose this doesnt make sense and it would be great 
> if there is a explain mode that will only print the parsed and analysed 
> logical plans only



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode

2023-03-31 Thread xiaochen zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402
 ] 

xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:19 AM:
---

[hi,I would like to deal with it. Can you assign this ticket to me 
?|https://github.com/apache/spark/pull/40631]

[https://github.com/apache/spark/pull/40631]


was (Author: zxcoccer):
[https://github.com/apache/spark/pull/40631]

> Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
> ---
>
> Key: SPARK-42860
> URL: https://issues.apache.org/jira/browse/SPARK-42860
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: arindam patra
>Priority: Blocker
>
> We have a service that submits spark sql jobs to a spark cluster .
> we want to validate the sql query before submitting the job . We are 
> currently using df.explain(extended=true) which generates parsed , analysed , 
> optimised logical plan and physical plan . 
> But generating  optimised logical plan  sometimes takes more time for e.g if 
> you have applied a filter on a partitioned column , spark will list all 
> directories and take the required ones . 
> For our query validation purpose this doesnt make sense and it would be great 
> if there is a explain mode that will only print the parsed and analysed 
> logical plans only



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-42860) Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode

2023-03-31 Thread xiaochen zhou (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707402#comment-17707402
 ] 

xiaochen zhou edited comment on SPARK-42860 at 4/1/23 1:20 AM:
---

[https://github.com/apache/spark/pull/40631]


was (Author: zxcoccer):
[hi,I would like to deal with it. Can you assign this ticket to me 
?|https://github.com/apache/spark/pull/40631]

[https://github.com/apache/spark/pull/40631]

> Add analysed logical mode in org.apache.spark.sql.execution.ExplainMode
> ---
>
> Key: SPARK-42860
> URL: https://issues.apache.org/jira/browse/SPARK-42860
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: arindam patra
>Priority: Blocker
>
> We have a service that submits spark sql jobs to a spark cluster .
> we want to validate the sql query before submitting the job . We are 
> currently using df.explain(extended=true) which generates parsed , analysed , 
> optimised logical plan and physical plan . 
> But generating  optimised logical plan  sometimes takes more time for e.g if 
> you have applied a filter on a partitioned column , spark will list all 
> directories and take the required ones . 
> For our query validation purpose this doesnt make sense and it would be great 
> if there is a explain mode that will only print the parsed and analysed 
> logical plans only



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42998) Fix DataFrame.collect with null struct.

2023-03-31 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-42998.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 40627
[https://github.com/apache/spark/pull/40627]

> Fix DataFrame.collect with null struct.
> ---
>
> Key: SPARK-42998
> URL: https://issues.apache.org/jira/browse/SPARK-42998
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.4.0
>
>
> In Spark Connect:
> {code:python}
> >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)")
> >>> df.show()
> +++
> |   a|   b|
> +++
> |   1| {a}|
> |null|null|
> +++
> >>> df.collect()
> [Row(a=1, b=Row(x='a')), Row(a=None, b=)]
> {code}
> whereas PySpark:
> {code:python}
> >>> df.collect()
> [Row(a=1, b=Row(x='a')), Row(a=None, b=None)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42998) Fix DataFrame.collect with null struct.

2023-03-31 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-42998:
-

Assignee: Takuya Ueshin

> Fix DataFrame.collect with null struct.
> ---
>
> Key: SPARK-42998
> URL: https://issues.apache.org/jira/browse/SPARK-42998
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>
> In Spark Connect:
> {code:python}
> >>> df = spark.sql("values (1, struct('a' as x)), (null, null) as t(a, b)")
> >>> df.show()
> +++
> |   a|   b|
> +++
> |   1| {a}|
> |null|null|
> +++
> >>> df.collect()
> [Row(a=1, b=Row(x='a')), Row(a=None, b=)]
> {code}
> whereas PySpark:
> {code:python}
> >>> df.collect()
> [Row(a=1, b=Row(x='a')), Row(a=None, b=None)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42993) Make Torch Distributor compatible with Spark Connect

2023-03-31 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng updated SPARK-42993:
--
Summary: Make Torch Distributor compatible with Spark Connect  (was: Make 
Torch Distributor support Spark Connect)

> Make Torch Distributor compatible with Spark Connect
> 
>
> Key: SPARK-42993
> URL: https://issues.apache.org/jira/browse/SPARK-42993
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42993) Make Torch Distributor compatible with Spark Connect

2023-03-31 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707493#comment-17707493
 ] 

Snoot.io commented on SPARK-42993:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/40607

> Make Torch Distributor compatible with Spark Connect
> 
>
> Key: SPARK-42993
> URL: https://issues.apache.org/jira/browse/SPARK-42993
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, ML
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41628) Support async query execution

2023-03-31 Thread Jia Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707501#comment-17707501
 ] 

Jia Fan commented on SPARK-41628:
-

I'm working on it

> Support async query execution
> -
>
> Key: SPARK-41628
> URL: https://issues.apache.org/jira/browse/SPARK-41628
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Martin Grund
>Priority: Major
>
> Today the query execution is completely synchronous, add an additional 
> asynchronous API that allows to submit and polll for the result.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-37829) An outer-join using joinWith on DataFrames returns Rows with null fields instead of null values

2023-03-31 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-37829?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17707502#comment-17707502
 ] 

Clément de Groc commented on SPARK-37829:
-

I'm not planning to resume. I don't know that part of the codebase well enough 
to submit a better fix other than the one I already submitted in my PR.

> An outer-join using joinWith on DataFrames returns Rows with null fields 
> instead of null values
> ---
>
> Key: SPARK-37829
> URL: https://issues.apache.org/jira/browse/SPARK-37829
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.0.3, 3.1.0, 3.1.1, 3.1.2, 3.2.0
>Reporter: Clément de Groc
>Priority: Major
>
> Doing an outer-join using {{joinWith}} on {{{}DataFrame{}}}s used to return 
> missing values as {{null}} in Spark 2.4.8, but returns them as {{Rows}} with 
> {{null}} values in Spark 3+.
> The issue can be reproduced with [the following 
> test|https://github.com/cdegroc/spark/commit/79f4d6a1ec6c69b10b72dbc8f92ab6490d5ef5e5]
>  that succeeds on Spark 2.4.8 but fails starting from Spark 3.0.0.
> The problem only arises when working with DataFrames: Datasets of case 
> classes work as expected as demonstrated by [this other 
> test|https://github.com/apache/spark/blob/v3.0.0/sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala#L1200-L1223].
> I couldn't find an explanation for this change in the Migration guide so I'm 
> assuming this is a bug.
> A {{git bisect}} pointed me to [that 
> commit|https://github.com/apache/spark/commit/cd92f25be5a221e0d4618925f7bc9dfd3bb8cb59].
> Reverting the commit solves the problem.
> A similar solution,  but without reverting, is shown 
> [here|https://github.com/cdegroc/spark/commit/684c675bf070876a475a9b225f6c2f92edce4c8a].
> Happy to help if you think of another approach / can provide some guidance.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org