[jira] [Updated] (SPARK-44348) Reenable Session-based artifact test cases
[ https://issues.apache.org/jira/browse/SPARK-44348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-44348: - Summary: Reenable Session-based artifact test cases (was: Reeanble Session-based artifact test cases) > Reenable Session-based artifact test cases > -- > > Key: SPARK-44348 > URL: https://issues.apache.org/jira/browse/SPARK-44348 > Project: Spark > Issue Type: Task > Components: PySpark, Tests >Affects Versions: 3.5.0 >Reporter: Hyukjin Kwon >Priority: Major > > Several tests in https://github.com/apache/spark/pull/41495 were skipped. > Should be investigated and reenabled back. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44349) Add math functions to SparkR
Ruifeng Zheng created SPARK-44349: - Summary: Add math functions to SparkR Key: SPARK-44349 URL: https://issues.apache.org/jira/browse/SPARK-44349 Project: Spark Issue Type: Sub-task Components: R Affects Versions: 3.5.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44348) Reeanble Session-based artifact test cases
Hyukjin Kwon created SPARK-44348: Summary: Reeanble Session-based artifact test cases Key: SPARK-44348 URL: https://issues.apache.org/jira/browse/SPARK-44348 Project: Spark Issue Type: Task Components: PySpark, Tests Affects Versions: 3.5.0 Reporter: Hyukjin Kwon Several tests in https://github.com/apache/spark/pull/41495 were skipped. Should be investigated and reenabled back. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44324) Move CaseInsensitiveMap to sql/api
[ https://issues.apache.org/jira/browse/SPARK-44324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741453#comment-17741453 ] Snoot.io commented on SPARK-44324: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/41882 > Move CaseInsensitiveMap to sql/api > -- > > Key: SPARK-44324 > URL: https://issues.apache.org/jira/browse/SPARK-44324 > Project: Spark > Issue Type: Sub-task > Components: Connect, SQL >Affects Versions: 3.5.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741449#comment-17741449 ] Sandish Kumar HN commented on SPARK-24815: -- [~pavan0831] I'm excited to see what you come up with. You can create a pull request with the design details, or you can attach a Google design doc to the pull request. Either way, I'm looking forward to reviewing your work. > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24815) Structured Streaming should support dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-24815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741448#comment-17741448 ] Pavan Kotikalapudi commented on SPARK-24815: Hi, I added some features to traditional DRA to work for structured streaming usecase based on the heuristics of trigger interval. It has been working as expected for the last few months in my company. I would like to contribute back to the community, Should I directly send a pull request for review or a document explaining how the dra is modified for structured streaming usecase? > Structured Streaming should support dynamic allocation > -- > > Key: SPARK-24815 > URL: https://issues.apache.org/jira/browse/SPARK-24815 > Project: Spark > Issue Type: Improvement > Components: Scheduler, Spark Core, Structured Streaming >Affects Versions: 2.3.1 >Reporter: Karthik Palaniappan >Priority: Minor > > For batch jobs, dynamic allocation is very useful for adding and removing > containers to match the actual workload. On multi-tenant clusters, it ensures > that a Spark job is taking no more resources than necessary. In cloud > environments, it enables autoscaling. > However, if you set spark.dynamicAllocation.enabled=true and run a structured > streaming job, the batch dynamic allocation algorithm kicks in. It requests > more executors if the task backlog is a certain size, and removes executors > if they idle for a certain period of time. > Quick thoughts: > 1) Dynamic allocation should be pluggable, rather than hardcoded to a > particular implementation in SparkContext.scala (this should be a separate > JIRA). > 2) We should make a structured streaming algorithm that's separate from the > batch algorithm. Eventually, continuous processing might need its own > algorithm. > 3) Spark should print a warning if you run a structured streaming job when > Core's dynamic allocation is enabled -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41651) Test parity: pyspark.sql.tests.test_dataframe
[ https://issues.apache.org/jira/browse/SPARK-41651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41651. -- Resolution: Done > Test parity: pyspark.sql.tests.test_dataframe > - > > Key: SPARK-41651 > URL: https://issues.apache.org/jira/browse/SPARK-41651 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_dataframe.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42018) Test parity: pyspark.sql.tests.test_types
[ https://issues.apache.org/jira/browse/SPARK-42018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42018. -- Resolution: Done > Test parity: pyspark.sql.tests.test_types > - > > Key: SPARK-42018 > URL: https://issues.apache.org/jira/browse/SPARK-42018 > Project: Spark > Issue Type: Umbrella > Components: Connect, Tests >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-41652 and > https://issues.apache.org/jira/browse/SPARK-41651 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41652) Test parity: pyspark.sql.tests.test_functions
[ https://issues.apache.org/jira/browse/SPARK-41652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41652. -- Resolution: Done > Test parity: pyspark.sql.tests.test_functions > - > > Key: SPARK-41652 > URL: https://issues.apache.org/jira/browse/SPARK-41652 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Sandeep Singh >Priority: Major > > After https://github.com/apache/spark/pull/39041 (SPARK-41528), we now reuses > the same test cases, see > {{python/pyspark/sql/tests/connect/test_parity_functions.py}}. > We should remove all the test cases defined there, and fix Spark Connect > behaviours accordingly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41997) Test parity: pyspark.sql.tests.test_readwriter
[ https://issues.apache.org/jira/browse/SPARK-41997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-41997. -- Resolution: Done > Test parity: pyspark.sql.tests.test_readwriter > -- > > Key: SPARK-41997 > URL: https://issues.apache.org/jira/browse/SPARK-41997 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-41652 and > https://issues.apache.org/jira/browse/SPARK-41651 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42006) Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column
[ https://issues.apache.org/jira/browse/SPARK-42006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-42006. -- Resolution: Done > Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and > test_column > --- > > Key: SPARK-42006 > URL: https://issues.apache.org/jira/browse/SPARK-42006 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-41652 and > https://issues.apache.org/jira/browse/SPARK-41651 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42006) Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and test_column
[ https://issues.apache.org/jira/browse/SPARK-42006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-42006: Assignee: Hyukjin Kwon > Test parity: pyspark.sql.tests.test_group, test_serde, test_datasources and > test_column > --- > > Key: SPARK-42006 > URL: https://issues.apache.org/jira/browse/SPARK-42006 > Project: Spark > Issue Type: Umbrella > Components: Connect >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > See https://issues.apache.org/jira/browse/SPARK-41652 and > https://issues.apache.org/jira/browse/SPARK-41651 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44342) Replace SQLContext with SparkSession for GenTPCDSData
[ https://issues.apache.org/jira/browse/SPARK-44342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-44342. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41900 [https://github.com/apache/spark/pull/41900] > Replace SQLContext with SparkSession for GenTPCDSData > - > > Key: SPARK-44342 > URL: https://issues.apache.org/jira/browse/SPARK-44342 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.5.0 > > > The SQLContext is an old API for Spark SQL. > But GenTPCDSData still use it directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44342) Replace SQLContext with SparkSession for GenTPCDSData
[ https://issues.apache.org/jira/browse/SPARK-44342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao reassigned SPARK-44342: Assignee: jiaan.geng > Replace SQLContext with SparkSession for GenTPCDSData > - > > Key: SPARK-44342 > URL: https://issues.apache.org/jira/browse/SPARK-44342 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > The SQLContext is an old API for Spark SQL. > But GenTPCDSData still use it directly. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44329) Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-44329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44329. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41907 [https://github.com/apache/spark/pull/41907] > Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and > Python > -- > > Key: SPARK-44329 > URL: https://issues.apache.org/jira/browse/SPARK-44329 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44329) Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-44329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44329: - Assignee: BingKun Pan > Add hll_sketch_agg, hll_union_agg, to_varchar, try_aes_decrypt to Scala and > Python > -- > > Key: SPARK-44329 > URL: https://issues.apache.org/jira/browse/SPARK-44329 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark, SQL >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44347) Upgrade janino to 3.1.10
BingKun Pan created SPARK-44347: --- Summary: Upgrade janino to 3.1.10 Key: SPARK-44347 URL: https://issues.apache.org/jira/browse/SPARK-44347 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.5.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44331) Add bitmap functions to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-44331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng reassigned SPARK-44331: - Assignee: BingKun Pan > Add bitmap functions to Scala and Python > > > Key: SPARK-44331 > URL: https://issues.apache.org/jira/browse/SPARK-44331 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > > * bitmap_bucket_number > * bitmap_bit_position > * bitmap_construct_agg > * bitmap_count > * bitmap_or_agg -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44331) Add bitmap functions to Scala and Python
[ https://issues.apache.org/jira/browse/SPARK-44331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruifeng Zheng resolved SPARK-44331. --- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41902 [https://github.com/apache/spark/pull/41902] > Add bitmap functions to Scala and Python > > > Key: SPARK-44331 > URL: https://issues.apache.org/jira/browse/SPARK-44331 > Project: Spark > Issue Type: Sub-task > Components: Connect, PySpark >Affects Versions: 3.5.0 >Reporter: Ruifeng Zheng >Assignee: BingKun Pan >Priority: Major > Fix For: 3.5.0 > > > * bitmap_bucket_number > * bitmap_bit_position > * bitmap_construct_agg > * bitmap_count > * bitmap_or_agg -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44323) Scala None shows up as null for Aggregator BUF or OUT
[ https://issues.apache.org/jira/browse/SPARK-44323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741400#comment-17741400 ] koert kuipers commented on SPARK-44323: --- not sure why pullreq isnt getting linked automatically but its here: https://github.com/apache/spark/pull/41903 > Scala None shows up as null for Aggregator BUF or OUT > --- > > Key: SPARK-44323 > URL: https://issues.apache.org/jira/browse/SPARK-44323 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: koert kuipers >Priority: Major > > when doing an upgrade from spark 3.3.1 to spark 3.4.1 we suddenly started > getting null pointer exceptions in Aggregators (classes extending > org.apache.spark.sql.expressions.Aggregator) that use scala Option for BUF > and/or OUT. basically None is now showing up as null. > after adding a simple test case and doing a binary search on commits we > landed on SPARK-37829 being the cause. > we observed the issue at first with NPE inside Aggregator.merge because None > was null. i am having a hard time replicating that in a spark unit test, but > i did manage to get a None become null in the output. simple test that now > fails: > > {code:java} > diff --git > a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > index e9daa825dd4..a1959d7065d 100644 > --- > a/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > +++ > b/sql/core/src/test/scala/org/apache/spark/sql/DatasetAggregatorSuite.scala > @@ -228,6 +228,16 @@ case class FooAgg(s: Int) extends Aggregator[Row, Int, > Int] { > def outputEncoder: Encoder[Int] = Encoders.scalaInt > } > > +object OptionStringAgg extends Aggregator[Option[String], Option[String], > Option[String]] { > + override def zero: Option[String] = None > + override def reduce(b: Option[String], a: Option[String]): Option[String] > = merge(b, a) > + override def finish(reduction: Option[String]): Option[String] = reduction > + override def merge(b1: Option[String], b2: Option[String]): Option[String] > = > + b1.map{ b1v => b2.map{ b2v => b1v ++ b2v }.getOrElse(b1v) }.orElse(b2) > + override def bufferEncoder: Encoder[Option[String]] = ExpressionEncoder() > + override def outputEncoder: Encoder[Option[String]] = ExpressionEncoder() > +} > + > class DatasetAggregatorSuite extends QueryTest with SharedSparkSession { > import testImplicits._ > > @@ -432,4 +442,15 @@ class DatasetAggregatorSuite extends QueryTest with > SharedSparkSession { > val agg = df.select(mode(col("a"))).as[String] > checkDataset(agg, "3") > } > + > + test("typed aggregation: option string") { > + val ds = Seq((1, Some("a")), (1, None), (1, Some("c")), (2, None)).toDS() > + > + checkDataset( > + ds.groupByKey(_._1).mapValues(_._2).agg( > + OptionStringAgg.toColumn > + ), > + (1, Some("ac")), (2, None) > + ) > + } > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44346) Python worker exited unexpectedly - java.io.EOFException on DataInputStream.readInt - cluster doesn't terminate
Dmitry Goldenberg created SPARK-44346: - Summary: Python worker exited unexpectedly - java.io.EOFException on DataInputStream.readInt - cluster doesn't terminate Key: SPARK-44346 URL: https://issues.apache.org/jira/browse/SPARK-44346 Project: Spark Issue Type: Question Components: Spark Core Affects Versions: 3.3.2 Environment: AWS EMR emr-6.11.0 Spark 3.3.2 pandas 1.3.5 pyarrow 12.0.0 "spark.sql.shuffle.partitions": "210", "spark.default.parallelism": "210", "spark.yarn.stagingDir": "hdfs:///tmp", "spark.sql.adaptive.enabled": "true", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", "spark.sql.execution.arrow.pyspark.enabled": "true", "spark.dynamicAllocation.enabled": "false", "hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory", Reporter: Dmitry Goldenberg I am getting the below exception as a WARN. Apparently, a worker crashes. Multiple issues here: - What is the cause of the crash? Is it something to do with pyarrow; some kind of a versioning mismatch? - Error handling in Spark. The error is too low-level to make sense of. Can it be caught in Spark and dealth with properly? - The cluster doesn't recover or cleanly terminate. It essentially just hangs. EMR doesn't terminate it either. Stack traces: ``` 23/07/05 22:43:47 WARN TaskSetManager: Lost task 1.0 in stage 81.0 (TID 2761) (ip-10-2-250-114.awsinternal.audiomack.com executor 2): org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:592) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:574) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:763) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:740) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:505) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:491) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:955) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at org.apache.spark.sql.execution.AbstractUnsafeExternalRowSorter.sort(AbstractUnsafeExternalRowSorter.java:50) at org.apache.spark.sql.execution.SortExecBase.$anonfun$doExecute$1(SortExec.scala:346) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:890) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:890) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365) at org.apache.spark.rdd.RDD.iterator(RDD.scala:329) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:138) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1516) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:748) ... 26 more 23/07/05 22:43:47 INFO TaskSetManager: Starting task 1.1 in stage 81.0 (TID 2763) (ip-10-2-250-114.awsinternal.audiomack.com, executor 2, partition 1, NODE_LOCAL, 5020 bytes) taskResourceAssignments Map() 23/07/05 23:30:17 INFO TaskSetManager: Finished task 2.0 in stage 81.0 (TID 2762) in 8603522 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) (1/3) 23/07/05 23:39:09 INFO TaskSetManager: Finished task 0.0 in stage 81.0 (TID 2760) in 9135125 ms on ip-10-2-250-114.awsinternal.audiomack.com (executor 2) (2/3)
[jira] [Resolved] (SPARK-38477) Use error classes in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-38477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk resolved SPARK-38477. -- Fix Version/s: 3.5.0 Resolution: Fixed Issue resolved by pull request 41575 [https://github.com/apache/spark/pull/41575] > Use error classes in org.apache.spark.storage > - > > Key: SPARK-38477 > URL: https://issues.apache.org/jira/browse/SPARK-38477 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > Fix For: 3.5.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-38477) Use error classes in org.apache.spark.storage
[ https://issues.apache.org/jira/browse/SPARK-38477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk reassigned SPARK-38477: Assignee: Bo Zhang > Use error classes in org.apache.spark.storage > - > > Key: SPARK-38477 > URL: https://issues.apache.org/jira/browse/SPARK-38477 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None
[ https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17741360#comment-17741360 ] ASF GitHub Bot commented on SPARK-43389: User 'gdhuper' has created a pull request for this issue: https://github.com/apache/spark/pull/41904 > spark.read.csv throws NullPointerException when lineSep is set to None > -- > > Key: SPARK-43389 > URL: https://issues.apache.org/jira/browse/SPARK-43389 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.3.1 >Reporter: Zach Liu >Priority: Trivial > > lineSep was defined as Optional[str] yet i'm unable to explicitly set it as > None: > reader = spark.read.format("csv") > read_options={'inferSchema': False, 'header': True, 'mode': 'DROPMALFORMED', > 'sep': '\t', 'escape': '\\', 'multiLine': False, 'lineSep': None} > for option, option_value in read_options.items(): > reader = reader.option(option, option_value) > df = reader.load("s3://") > raises exception: > py4j.protocol.Py4JJavaError: An error occurred while calling o126.load. > : java.lang.NullPointerException > at > scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51) > at scala.collection.immutable.StringOps.length(StringOps.scala:51) > at > scala.collection.IndexedSeqOptimized.isEmpty(IndexedSeqOptimized.scala:30) > at > scala.collection.IndexedSeqOptimized.isEmpty$(IndexedSeqOptimized.scala:30) > at scala.collection.immutable.StringOps.isEmpty(StringOps.scala:33) > at scala.collection.TraversableOnce.nonEmpty(TraversableOnce.scala:143) > at scala.collection.TraversableOnce.nonEmpty$(TraversableOnce.scala:143) > at scala.collection.immutable.StringOps.nonEmpty(StringOps.scala:33) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:216) > at scala.Option.map(Option.scala:230) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:215) > at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:47) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:60) > at > org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$11(DataSource.scala:210) > at scala.Option.orElse(Option.scala:447) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:411) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) > at scala.Option.getOrElse(Option.scala:189) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:185) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) > at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) > at py4j.Gateway.invoke(Gateway.java:282) > at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) > at py4j.commands.CallCommand.execute(CallCommand.java:79) > at > py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) > at py4j.ClientServerConnection.run(ClientServerConnection.java:106) > at java.lang.Thread.run(Thread.java:750) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org