[jira] [Updated] (SPARK-44348) Reenable Session-based artifact test cases

2023-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-44348:
-
Summary: Reenable Session-based artifact test cases  (was: Reeanble 
Session-based artifact test cases)

> Reenable Session-based artifact test cases
> --
>
> Key: SPARK-44348
> URL: https://issues.apache.org/jira/browse/SPARK-44348
> Project: Spark
>  Issue Type: Task
>  Components: PySpark, Tests
>Affects Versions: 3.5.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Several tests in https://github.com/apache/spark/pull/41495 were skipped. 
> Should be investigated and reenabled back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44349) Add math functions to SparkR

2023-07-09 Thread Ruifeng Zheng (Jira)
Ruifeng Zheng created SPARK-44349:
-

 Summary: Add math functions to SparkR
 Key: SPARK-44349
 URL: https://issues.apache.org/jira/browse/SPARK-44349
 Project: Spark
  Issue Type: Sub-task
  Components: R
Affects Versions: 3.5.0
Reporter: Ruifeng Zheng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44348) Reeanble Session-based artifact test cases

2023-07-09 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-44348:


 Summary: Reeanble Session-based artifact test cases
 Key: SPARK-44348
 URL: https://issues.apache.org/jira/browse/SPARK-44348
 Project: Spark
  Issue Type: Task
  Components: PySpark, Tests
Affects Versions: 3.5.0
Reporter: Hyukjin Kwon


Several tests in https://github.com/apache/spark/pull/41495 were skipped. 
Should be investigated and reenabled back.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44324) Move CaseInsensitiveMap to sql/api

2023-07-09 Thread Snoot.io (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741453#comment-17741453
 ] 

Snoot.io commented on SPARK-44324:
--

User 'amaliujia' has created a pull request for this issue:
https://github.com/apache/spark/pull/41882

> Move CaseInsensitiveMap to sql/api
> --
>
> Key: SPARK-44324
> URL: https://issues.apache.org/jira/browse/SPARK-44324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, SQL
>Affects Versions: 3.5.0
>Reporter: Rui Wang
>Assignee: Rui Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2023-07-09 Thread Sandish Kumar HN (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741449#comment-17741449
 ] 

Sandish Kumar HN commented on SPARK-24815:
--

[~pavan0831] I'm excited to see what you come up with. You can create a pull 
request with the design details, or you can attach a Google design doc to the 
pull request. Either way, I'm looking forward to reviewing your work.

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation

2023-07-09 Thread Pavan Kotikalapudi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741448#comment-17741448
 ] 

Pavan Kotikalapudi commented on SPARK-24815:


Hi, I added some features to traditional DRA to work for structured streaming 
usecase based on the heuristics of trigger interval. It has been working as 
expected for the last few months in my company. I would like to contribute back 
to the community, Should I directly send a pull request for review or a 
document explaining how the dra is modified for structured streaming usecase?

> Structured Streaming should support dynamic allocation
> --
>
> Key: SPARK-24815
> URL: https://issues.apache.org/jira/browse/SPARK-24815
> Project: Spark
>  Issue Type: Improvement
>  Components: Scheduler, Spark Core, Structured Streaming
>Affects Versions: 2.3.1
>Reporter: Karthik Palaniappan
>Priority: Minor
>
> For batch jobs, dynamic allocation is very useful for adding and removing 
> containers to match the actual workload. On multi-tenant clusters, it ensures 
> that a Spark job is taking no more resources than necessary. In cloud 
> environments, it enables autoscaling.
> However, if you set spark.dynamicAllocation.enabled=true and run a structured 
> streaming job, the batch dynamic allocation algorithm kicks in. It requests 
> more executors if the task backlog is a certain size, and removes executors 
> if they idle for a certain period of time.
> Quick thoughts:
> 1) Dynamic allocation should be pluggable, rather than hardcoded to a 
> particular implementation in SparkContext.scala (this should be a separate 
> JIRA).
> 2) We should make a structured streaming algorithm that's separate from the 
> batch algorithm. Eventually, continuous processing might need its own 
> algorithm.
> 3) Spark should print a warning if you run a structured streaming job when 
> Core's dynamic allocation is enabled



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41651) Test parity: pyspark.sql.tests.test_dataframe

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41651.
--
Resolution: Done

> Test parity: pyspark.sql.tests.test_dataframe
> -
>
> Key: SPARK-41651
> URL: https://issues.apache.org/jira/browse/SPARK-41651
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
>
> After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses 
> the same test cases, see 
> {{python/pyspark/sql/tests/connect/test_parity_dataframe.py}}.
> We should remove all the test cases defined there, and fix Spark Connect 
> behaviours accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42018) Test parity: pyspark.sql.tests.test_types

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42018.
--
Resolution: Done

> Test parity: pyspark.sql.tests.test_types
> -
>
> Key: SPARK-42018
> URL: https://issues.apache.org/jira/browse/SPARK-42018
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect, Tests
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-41652 and 
> https://issues.apache.org/jira/browse/SPARK-41651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41652) Test parity: pyspark.sql.tests.test_functions

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41652.
--
Resolution: Done

> Test parity: pyspark.sql.tests.test_functions
> -
>
> Key: SPARK-41652
> URL: https://issues.apache.org/jira/browse/SPARK-41652
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Sandeep Singh
>Priority: Major
>
> After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses 
> the same test cases, see 
> {{python/pyspark/sql/tests/connect/test_parity_functions.py}}.
> We should remove all the test cases defined there, and fix Spark Connect 
> behaviours accordingly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41997) Test parity: pyspark.sql.tests.test_readwriter

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-41997.
--
Resolution: Done

> Test parity: pyspark.sql.tests.test_readwriter
> --
>
> Key: SPARK-41997
> URL: https://issues.apache.org/jira/browse/SPARK-41997
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-41652 and 
> https://issues.apache.org/jira/browse/SPARK-41651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-42006) Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-42006.
--
Resolution: Done

> Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and 
> test_column
> ---
>
> Key: SPARK-42006
> URL: https://issues.apache.org/jira/browse/SPARK-42006
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-41652 and 
> https://issues.apache.org/jira/browse/SPARK-41651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-42006) Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column

2023-07-09 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-42006:


Assignee: Hyukjin Kwon

> Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and 
> test_column
> ---
>
> Key: SPARK-42006
> URL: https://issues.apache.org/jira/browse/SPARK-42006
> Project: Spark
>  Issue Type: Umbrella
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> See https://issues.apache.org/jira/browse/SPARK-41652 and 
> https://issues.apache.org/jira/browse/SPARK-41651



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44342) Replace SQLContext with SparkSession for GenTPCDSData

2023-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-44342.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41900
[https://github.com/apache/spark/pull/41900]

> Replace SQLContext with SparkSession for GenTPCDSData
> -
>
> Key: SPARK-44342
> URL: https://issues.apache.org/jira/browse/SPARK-44342
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.5.0
>
>
> The SQLContext is an old API for Spark SQL.
> But GenTPCDSData still use it directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44342) Replace SQLContext with SparkSession for GenTPCDSData

2023-07-09 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44342:


Assignee: jiaan.geng

> Replace SQLContext with SparkSession for GenTPCDSData
> -
>
> Key: SPARK-44342
> URL: https://issues.apache.org/jira/browse/SPARK-44342
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.5.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> The SQLContext is an old API for Spark SQL.
> But GenTPCDSData still use it directly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44329) Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and Python

2023-07-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44329.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41907
[https://github.com/apache/spark/pull/41907]

> Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and 
> Python
> --
>
> Key: SPARK-44329
> URL: https://issues.apache.org/jira/browse/SPARK-44329
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44329) Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and Python

2023-07-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44329:
-

Assignee: BingKun Pan

> Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and 
> Python
> --
>
> Key: SPARK-44329
> URL: https://issues.apache.org/jira/browse/SPARK-44329
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark, SQL
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44347) Upgrade janino to 3.1.10

2023-07-09 Thread BingKun Pan (Jira)
BingKun Pan created SPARK-44347:
---

 Summary: Upgrade janino to 3.1.10
 Key: SPARK-44347
 URL: https://issues.apache.org/jira/browse/SPARK-44347
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.5.0
Reporter: BingKun Pan






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44331) Add bitmap functions to Scala and Python

2023-07-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng reassigned SPARK-44331:
-

Assignee: BingKun Pan

> Add bitmap functions to Scala and Python
> 
>
> Key: SPARK-44331
> URL: https://issues.apache.org/jira/browse/SPARK-44331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
>
> * bitmap_bucket_number
> * bitmap_bit_position
> * bitmap_construct_agg
> * bitmap_count
> * bitmap_or_agg



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44331) Add bitmap functions to Scala and Python

2023-07-09 Thread Ruifeng Zheng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruifeng Zheng resolved SPARK-44331.
---
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41902
[https://github.com/apache/spark/pull/41902]

> Add bitmap functions to Scala and Python
> 
>
> Key: SPARK-44331
> URL: https://issues.apache.org/jira/browse/SPARK-44331
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect, PySpark
>Affects Versions: 3.5.0
>Reporter: Ruifeng Zheng
>Assignee: BingKun Pan
>Priority: Major
> Fix For: 3.5.0
>
>
> * bitmap_bucket_number
> * bitmap_bit_position
> * bitmap_construct_agg
> * bitmap_count
> * bitmap_or_agg



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44323) Scala None shows up as null for Aggregator BUF or OUT

2023-07-09 Thread koert kuipers (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741400#comment-17741400
 ] 

koert kuipers commented on SPARK-44323:
---

not sure why pullreq isnt getting linked automatically but its here:

https://github.com/apache/spark/pull/41903

> Scala None shows up as null for Aggregator BUF or OUT  
> ---
>
> Key: SPARK-44323
> URL: https://issues.apache.org/jira/browse/SPARK-44323
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: koert kuipers
>Priority: Major
>
> when doing an upgrade from spark 3.3.1 to spark 3.4.1 we suddenly started 
> getting null pointer exceptions in Aggregators (classes extending 
> org.apache.spark.sql.expressions.Aggregator) that use scala Option for BUF 
> and/or OUT. basically None is now showing up as null.
> after adding a simple test case and doing a binary search on commits we 
> landed on SPARK-37829 being the cause.
> we observed the issue at first with NPE inside Aggregator.merge because None 
> was null. i am having a hard time replicating that in a spark unit test, but 
> i did manage to get a None become null in the output. simple test that now 
> fails:
>  
> {code:java}
> diff --git 
> a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala 
> b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> index e9daa825dd4..a1959d7065d 100644
> --- 
> a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> +++ 
> b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala
> @@ -228,6 +228,16 @@ case class FooAgg(s: Int) extends Aggregator[Row, Int, 
> Int] {
>    def outputEncoder: Encoder[Int] = Encoders.scalaInt
>  }
>  
> +object OptionStringAgg extends Aggregator[Option[String], Option[String], 
> Option[String]] {
> +  override def zero: Option[String] = None
> +  override def reduce(b: Option[String], a: Option[String]): Option[String] 
> = merge(b, a)
> +  override def finish(reduction: Option[String]): Option[String] = reduction
> +  override def merge(b1: Option[String], b2: Option[String]): Option[String] 
> =
> +    b1.map{ b1v => b2.map{ b2v => b1v ++ b2v }.getOrElse(b1v) }.orElse(b2)
> +  override def bufferEncoder: Encoder[Option[String]] = ExpressionEncoder()
> +  override def outputEncoder: Encoder[Option[String]] = ExpressionEncoder()
> +}
> +
>  class DatasetAggregatorSuite extends QueryTest with SharedSparkSession {
>    import testImplicits._
>  
> @@ -432,4 +442,15 @@ class DatasetAggregatorSuite extends QueryTest with 
> SharedSparkSession {
>      val agg = df.select(mode(col("a"))).as[String]
>      checkDataset(agg, "3")
>    }
> +
> +  test("typed aggregation: option string") {
> +    val ds = Seq((1, Some("a")), (1, None), (1, Some("c")), (2, None)).toDS()
> +
> +    checkDataset(
> +      ds.groupByKey(_._1).mapValues(_._2).agg(
> +        OptionStringAgg.toColumn
> +      ),
> +      (1, Some("ac")), (2, None)
> +    )
> +  }
>  }
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44346) Python worker exited unexpectedly - java.io.EOFException on DataInputStream.readInt - cluster doesn't terminate

2023-07-09 Thread Dmitry Goldenberg (Jira)
Dmitry Goldenberg created SPARK-44346:
-

 Summary: Python worker exited unexpectedly - java.io.EOFException 
on DataInputStream.readInt - cluster doesn't terminate
 Key: SPARK-44346
 URL: https://issues.apache.org/jira/browse/SPARK-44346
 Project: Spark
  Issue Type: Question
  Components: Spark Core
Affects Versions: 3.3.2
 Environment: AWS EMR emr-6.11.0
Spark 3.3.2
pandas 1.3.5
pyarrow 12.0.0

"spark.sql.shuffle.partitions": "210",
"spark.default.parallelism": "210",
"spark.yarn.stagingDir": "hdfs:///tmp",
"spark.sql.adaptive.enabled": "true",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.execution.arrow.pyspark.enabled": "true",
"spark.dynamicAllocation.enabled": "false",
"hive.metastore.client.factory.class": 
"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
Reporter: Dmitry Goldenberg


I am getting the below exception as a WARN. Apparently, a worker crashes.

Multiple issues here:
- What is the cause of the crash? Is it something to do with pyarrow; some kind 
of a versioning mismatch?
- Error handling in Spark. The error is too low-level to make sense of. Can it 
be caught in Spark and dealth with properly?
- The cluster doesn't recover or cleanly terminate. It essentially just hangs. 
EMR doesn't terminate it either.

Stack traces:

```
23/07/05 22:43:47 WARN TaskSetManager: Lost task 1.0 in stage 81.0 (TID 2761) 
(ip-10-2-250-114.awsinternal.audiomack.com executor 2): 
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574)
    at 
scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740)
    at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505)
    at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
    at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown
 Source)
    at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    at 
org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50)
    at 
org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346)
    at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890)
    at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:138)
    at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readInt(DataInputStream.java:392)
    at 
org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748)
    ... 26 more

23/07/05 22:43:47 INFO TaskSetManager: Starting task 1.1 in stage 81.0 (TID 
2763) (ip-10-2-250-114.awsinternal.audiomack.com, executor 2, partition 1, 
NODE_LOCAL, 5020 bytes) taskResourceAssignments Map()
23/07/05 23:30:17 INFO TaskSetManager: Finished task 2.0 in stage 81.0 (TID 
2762) in 8603522 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) 
(1/3)
23/07/05 23:39:09 INFO TaskSetManager: Finished task 0.0 in stage 81.0 (TID 
2760) in 9135125 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) 
(2/3)

[jira] [Resolved] (SPARK-38477) Use error classes in org.apache.spark.storage

2023-07-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38477.
--
Fix Version/s: 3.5.0
   Resolution: Fixed

Issue resolved by pull request 41575
[https://github.com/apache/spark/pull/41575]

> Use error classes in org.apache.spark.storage
> -
>
> Key: SPARK-38477
> URL: https://issues.apache.org/jira/browse/SPARK-38477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
> Fix For: 3.5.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38477) Use error classes in org.apache.spark.storage

2023-07-09 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk reassigned SPARK-38477:


Assignee: Bo Zhang

> Use error classes in org.apache.spark.storage
> -
>
> Key: SPARK-38477
> URL: https://issues.apache.org/jira/browse/SPARK-38477
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None

2023-07-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741360#comment-17741360
 ] 

ASF GitHub Bot commented on SPARK-43389:


User 'gdhuper' has created a pull request for this issue:
https://github.com/apache/spark/pull/41904

> spark.read.csv throws NullPointerException when lineSep is set to None
> --
>
> Key: SPARK-43389
> URL: https://issues.apache.org/jira/browse/SPARK-43389
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.3.1
>Reporter: Zach Liu
>Priority: Trivial
>
> lineSep was defined as Optional[str] yet i'm unable to explicitly set it as 
> None:
> reader = spark.read.format("csv")
> read_options={'inferSchema': False, 'header': True, 'mode': 'DROPMALFORMED', 
> 'sep': '\t', 'escape': '\\', 'multiLine': False, 'lineSep': None}
> for option, option_value in read_options.items():
> reader = reader.option(option, option_value)
> df = reader.load("s3://")
> raises exception:
> py4j.protocol.Py4JJavaError: An error occurred while calling o126.load.
> : java.lang.NullPointerException
>   at 
> scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
>   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30)
>   at 
> scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30)
>   at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33)
>   at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:143)
>   at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:143)
>   at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:216)
>   at scala.Option.map(Option.scala:230)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:215)
>   at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:47)
>   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210)
>   at scala.Option.orElse(Option.scala:447)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
>   at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411)
>   at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228)
>   at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210)
>   at scala.Option.getOrElse(Option.scala:189)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
>   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
>   at py4j.Gateway.invoke(Gateway.java:282)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at 
> py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
>   at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
>   at java.lang.Thread.run(Thread.java:750)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org