[jira] [Commented] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606433#comment-17606433 ] Jungtaek Lim commented on SPARK-40460: -- [~yaohua] Just to clarify, streaming metadata column for DSv1 seems to be introduced in Spark 3.3. https://issues.apache.org/jira/browse/SPARK-38323 Do I understand correctly? If then affect versions don't seem to be correct. > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40460) Streaming metrics is zero when select _metadata
[ https://issues.apache.org/jira/browse/SPARK-40460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-40460. -- Fix Version/s: 3.4.0 Assignee: Yaohua Zhao Resolution: Fixed Issue resolved via https://github.com/apache/spark/pull/37905 > Streaming metrics is zero when select _metadata > --- > > Key: SPARK-40460 > URL: https://issues.apache.org/jira/browse/SPARK-40460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Fix For: 3.4.0 > > > Streaming metrics report all 0 (`processedRowsPerSecond`, etc) when selecting > `_metadata` column. Because the logical plan from the batch and the actual > planned logical are mismatched: > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ProgressReporter.scala#L348] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
[ https://issues.apache.org/jira/browse/SPARK-40482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40482. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37896 [https://github.com/apache/spark/pull/37896] > Revert SPARK-24544 Print actual failure cause when look up function failed > -- > > Key: SPARK-40482 > URL: https://issues.apache.org/jira/browse/SPARK-40482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Trivial > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
[ https://issues.apache.org/jira/browse/SPARK-40482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40482: --- Assignee: Wenchen Fan > Revert SPARK-24544 Print actual failure cause when look up function failed > -- > > Key: SPARK-40482 > URL: https://issues.apache.org/jira/browse/SPARK-40482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Wenchen Fan >Priority: Trivial > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40367) Total size of serialized results of 3730 tasks (64.0 GB) is bigger than spark.driver.maxResultSize (64.0 GB)
[ https://issues.apache.org/jira/browse/SPARK-40367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606414#comment-17606414 ] Senthil Kumar commented on SPARK-40367: --- Hi [~jackyjfhu] Check if you are sending bytes/rows which are more than "spark.driver.maxResultSize". If so, you need to keep increasing "spark.driver.maxResultSize" untill it is fixing this issue. But while increasing spark.driver.maxResultSize you should be careful that it should not exceed driver-memory. _Note: driver-memory > spark.driver.maxResultSize > row/bytes sent to driver_ > Total size of serialized results of 3730 tasks (64.0 GB) is bigger than > spark.driver.maxResultSize (64.0 GB) > - > > Key: SPARK-40367 > URL: https://issues.apache.org/jira/browse/SPARK-40367 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.2 >Reporter: jackyjfhu >Priority: Blocker > > I use this > code:spark.sql("xx").selectExpr(spark.table(target).columns:_*).write.mode("overwrite").insertInto(target),I > get an error > > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Total size of serialized results of 3730 tasks (64.0 GB) is bigger than > spark.driver.maxResultSize (64.0 GB) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1609) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1597) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1596) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1596) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) > at scala.Option.foreach(Option.scala:257) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1830) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1779) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1768) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2034) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2055) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2074) > at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099) > at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:939) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) > at org.apache.spark.rdd.RDD.collect(RDD.scala:938) > at > org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:304) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:76) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73) > at > org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:97) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72) > at > org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > > --conf spark.driver.maxResultSize=64g > --conf spark.sql.broadcastTimeout=36000 > -conf spark.sql.autoBroadcastJoinThreshold=204857600 > --conf spark.memory.offHeap.enabled=true > --conf spark.memory.offHeap.size=4g > --num-executors 500 >
[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaonan Yang updated SPARK-40474: - Description: In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced the support of date type in CSV schema inference. The schema inference behavior on date time columns now is: * For a column only containing dates, we will infer it as Date type * For a column only containing timestamps, we will infer it as Timestamp type * For a column containing a mixture of dates and timestamps, we will infer it as Timestamp type However, we found that we are too ambitious on the last scenario, to support which we have introduced much complexity in code and caused a lot of performance concerns. Thus, we want to simplify the behavior of the last scenario as: * For a column containing a mixture of dates and timestamps, we will infer it as String type was: In this ticket, we introduced the support of date type in CSV schema inference. The schema inference behavior on date time columns now is: * For a column only containing dates, we will infer it as Date type * For a column only containing timestamps, we will infer it as Timestamp type * For a column containing a mixture of dates and timestamps, we will infer it as Timestamp type However, we found that we are too ambitious on the last scenario, to support which we have introduced much complexity in code and caused a lot of performance concerns. Thus, we want to simplify the behavior of the last scenario as: * For a column containing a mixture of dates and timestamps, we will infer it as String type > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Priority: Major > > In this ticket https://issues.apache.org/jira/browse/SPARK-39469, we > introduced the support of date type in CSV schema inference. The schema > inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40483: Assignee: Apache Spark > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40483: Assignee: (was: Apache Spark) > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606404#comment-17606404 ] Apache Spark commented on SPARK-40483: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/37925 > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40483) Add `CONNECT` label
[ https://issues.apache.org/jira/browse/SPARK-40483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40483: - Parent: SPARK-39375 Issue Type: Sub-task (was: Improvement) > Add `CONNECT` label > --- > > Key: SPARK-40483 > URL: https://issues.apache.org/jira/browse/SPARK-40483 > Project: Spark > Issue Type: Sub-task > Components: Connect, Project Infra >Affects Versions: 3.4.0 >Reporter: Hyukjin Kwon >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40483) Add `CONNECT` label
Hyukjin Kwon created SPARK-40483: Summary: Add `CONNECT` label Key: SPARK-40483 URL: https://issues.apache.org/jira/browse/SPARK-40483 Project: Spark Issue Type: Improvement Components: Connect, Project Infra Affects Versions: 3.4.0 Reporter: Hyukjin Kwon -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40472) Improve pyspark.sql.function example experience
[ https://issues.apache.org/jira/browse/SPARK-40472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606398#comment-17606398 ] Hyukjin Kwon commented on SPARK-40472: -- I was thinking about this too but maybe it's fine as is because we can assume that users are visiting the page of a package, and they would likely know that they would need to import the current package they are visiting. > Improve pyspark.sql.function example experience > --- > > Key: SPARK-40472 > URL: https://issues.apache.org/jira/browse/SPARK-40472 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 3.3.0 >Reporter: deshanxiao >Priority: Minor > > There are many exanple in pyspark.sql.function: > {code:java} > Examples > > >>> df = spark.range(1) > >>> df.select(lit(5).alias('height'), df.id).show() > +--+---+ > |height| id| > +--+---+ > | 5| 0| > +--+---+ {code} > We can add import statements so that the user can directly run it. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40474) Infer columns with mixed date and timestamp as String in CSV schema inference
[ https://issues.apache.org/jira/browse/SPARK-40474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-40474: - Fix Version/s: (was: 3.4.0) > Infer columns with mixed date and timestamp as String in CSV schema inference > - > > Key: SPARK-40474 > URL: https://issues.apache.org/jira/browse/SPARK-40474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Xiaonan Yang >Priority: Major > > In this ticket, we introduced the support of date type in CSV schema > inference. The schema inference behavior on date time columns now is: > * For a column only containing dates, we will infer it as Date type > * For a column only containing timestamps, we will infer it as Timestamp type > * For a column containing a mixture of dates and timestamps, we will infer > it as Timestamp type > However, we found that we are too ambitious on the last scenario, to support > which we have introduced much complexity in code and caused a lot of > performance concerns. Thus, we want to simplify the behavior of the last > scenario as: > * For a column containing a mixture of dates and timestamps, we will infer > it as String type -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40404) Fix the wrong description related to `spark.shuffle.service.db` in the document
[ https://issues.apache.org/jira/browse/SPARK-40404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40404. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37853 [https://github.com/apache/spark/pull/37853] > Fix the wrong description related to `spark.shuffle.service.db` in the > document > --- > > Key: SPARK-40404 > URL: https://issues.apache.org/jira/browse/SPARK-40404 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > From the context from pr of SPARK-17321, YarnShuffleService will persist data > into Level/RocksDB when Yarn NM recovery is enabled. This behavior is not > controlled by `spark.shuffle.service.db.enabled` and is not always enabled. > So the description of `spark.shuffle.service.db.enabled` in > `spark-standalone.md` is misleading > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40404) Fix the wrong description related to `spark.shuffle.service.db` in the document
[ https://issues.apache.org/jira/browse/SPARK-40404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40404: - Assignee: Yang Jie > Fix the wrong description related to `spark.shuffle.service.db` in the > document > --- > > Key: SPARK-40404 > URL: https://issues.apache.org/jira/browse/SPARK-40404 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > From the context from pr of SPARK-17321, YarnShuffleService will persist data > into Level/RocksDB when Yarn NM recovery is enabled. This behavior is not > controlled by `spark.shuffle.service.db.enabled` and is not always enabled. > So the description of `spark.shuffle.service.db.enabled` in > `spark-standalone.md` is misleading > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
[ https://issues.apache.org/jira/browse/SPARK-40482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40482: Assignee: (was: Apache Spark) > Revert SPARK-24544 Print actual failure cause when look up function failed > -- > > Key: SPARK-40482 > URL: https://issues.apache.org/jira/browse/SPARK-40482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
[ https://issues.apache.org/jira/browse/SPARK-40482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40482: Assignee: Apache Spark > Revert SPARK-24544 Print actual failure cause when look up function failed > -- > > Key: SPARK-40482 > URL: https://issues.apache.org/jira/browse/SPARK-40482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
[ https://issues.apache.org/jira/browse/SPARK-40482?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606383#comment-17606383 ] Apache Spark commented on SPARK-40482: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/37896 > Revert SPARK-24544 Print actual failure cause when look up function failed > -- > > Key: SPARK-40482 > URL: https://issues.apache.org/jira/browse/SPARK-40482 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40482) Revert SPARK-24544 Print actual failure cause when look up function failed
Dongjoon Hyun created SPARK-40482: - Summary: Revert SPARK-24544 Print actual failure cause when look up function failed Key: SPARK-40482 URL: https://issues.apache.org/jira/browse/SPARK-40482 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40424) Refactor ChromeUIHistoryServerSuite to test rocksdb
[ https://issues.apache.org/jira/browse/SPARK-40424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40424: - Assignee: Yang Jie > Refactor ChromeUIHistoryServerSuite to test rocksdb > --- > > Key: SPARK-40424 > URL: https://issues.apache.org/jira/browse/SPARK-40424 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > ChromeUIHistoryServerSuite only test leveldb backend now -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40424) Refactor ChromeUIHistoryServerSuite to test rocksdb
[ https://issues.apache.org/jira/browse/SPARK-40424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40424. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 37878 [https://github.com/apache/spark/pull/37878] > Refactor ChromeUIHistoryServerSuite to test rocksdb > --- > > Key: SPARK-40424 > URL: https://issues.apache.org/jira/browse/SPARK-40424 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.4.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.4.0 > > > ChromeUIHistoryServerSuite only test leveldb backend now -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40468) Column pruning is not handled correctly in CSV when _corrupt_record is used
[ https://issues.apache.org/jira/browse/SPARK-40468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40468: -- Labels: correctness (was: ) > Column pruning is not handled correctly in CSV when _corrupt_record is used > --- > > Key: SPARK-40468 > URL: https://issues.apache.org/jira/browse/SPARK-40468 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0, 3.2.2, 3.4.0 >Reporter: Ivan Sadikov >Assignee: Ivan Sadikov >Priority: Major > Labels: correctness > Fix For: 3.4.0, 3.3.2 > > > I have found that depending on the name of the corrupt record in CSV, the > field is populated incorrectly. Here is an example: > {code:java} > 1,a > /tmp/file.csv > === > val df = spark.read > .schema("c1 int, c2 string, x string, _corrupt_record string") > .csv("file:/tmp/file.csv") > .withColumn("x", lit("A")) > Result: > +---+---+---+---+ > |c1 |c2 |x |_corrupt_record| > +---+---+---+---+ > |1 |a |A |1,a | > +---+---+---+---+{code} > > However, if you rename the {{_corrupt_record}} column to something else, the > result is different: > {code:java} > val df = spark.read > .option("columnNameCorruptRecord", "corrupt_record") > .schema("c1 int, c2 string, x string, corrupt_record string") > .csv("file:/tmp/file.csv") .withColumn("x", lit("A")) > Result: > +---+---+---+--+ > |c1 |c2 |x |corrupt_record| > +---+---+---+--+ > |1 |a |A |null | > +---+---+---+--+{code} > > This is due to inconsistency in CSVFileFormat, when enabling columnPruning, > we check SQLConf option for corrupt records but CSV reader relies on > {{columnNameCorruptRecord}} option instead. > Also, this disables column pruning which used to work in Spark version prior > to > https://github.com/apache/spark/commit/959694271e30879c944d7fd5de2740571012460a. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606359#comment-17606359 ] Apache Spark commented on SPARK-40481: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37924 > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40481: Assignee: Apache Spark > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Apache Spark >Priority: Minor > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606358#comment-17606358 ] Apache Spark commented on SPARK-40481: -- User 'warrenzhu25' has created a pull request for this issue: https://github.com/apache/spark/pull/37924 > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40481: Assignee: (was: Apache Spark) > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Priority: Minor > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
Zhongwei Zhu created SPARK-40481: Summary: Ignore stage fetch failure caused by decommissioned executor Key: SPARK-40481 URL: https://issues.apache.org/jira/browse/SPARK-40481 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.3.0 Reporter: Zhongwei Zhu When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. It would be better not to count such failure in `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40334) Implement `GroupBy.prod`.
[ https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606345#comment-17606345 ] Apache Spark commented on SPARK-40334: -- User 'ayudovin' has created a pull request for this issue: https://github.com/apache/spark/pull/37923 > Implement `GroupBy.prod`. > - > > Key: SPARK-40334 > URL: https://issues.apache.org/jira/browse/SPARK-40334 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Artsiom Yudovin >Priority: Major > > We should implement `GroupBy.prod` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40334) Implement `GroupBy.prod`.
[ https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40334: Assignee: Apache Spark (was: Artsiom Yudovin) > Implement `GroupBy.prod`. > - > > Key: SPARK-40334 > URL: https://issues.apache.org/jira/browse/SPARK-40334 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Apache Spark >Priority: Major > > We should implement `GroupBy.prod` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40334) Implement `GroupBy.prod`.
[ https://issues.apache.org/jira/browse/SPARK-40334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40334: Assignee: Artsiom Yudovin (was: Apache Spark) > Implement `GroupBy.prod`. > - > > Key: SPARK-40334 > URL: https://issues.apache.org/jira/browse/SPARK-40334 > Project: Spark > Issue Type: Sub-task > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Haejoon Lee >Assignee: Artsiom Yudovin >Priority: Major > > We should implement `GroupBy.prod` for increasing pandas API coverage. > pandas docs: > https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.GroupBy.prod.html -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40480) Remove push-based shuffle data after query finished
[ https://issues.apache.org/jira/browse/SPARK-40480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606314#comment-17606314 ] Apache Spark commented on SPARK-40480: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37922 > Remove push-based shuffle data after query finished > --- > > Key: SPARK-40480 > URL: https://issues.apache.org/jira/browse/SPARK-40480 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > Now spark will only cleanup shuffle data files except push-based shuffle > files. > In our production cluster, push-based shuffle service will create too many > shuffle merge data files as there are several spark thrift server. > Could we cleanup the merged data files after the query finished? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40480) Remove push-based shuffle data after query finished
[ https://issues.apache.org/jira/browse/SPARK-40480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40480: Assignee: (was: Apache Spark) > Remove push-based shuffle data after query finished > --- > > Key: SPARK-40480 > URL: https://issues.apache.org/jira/browse/SPARK-40480 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > Now spark will only cleanup shuffle data files except push-based shuffle > files. > In our production cluster, push-based shuffle service will create too many > shuffle merge data files as there are several spark thrift server. > Could we cleanup the merged data files after the query finished? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40480) Remove push-based shuffle data after query finished
[ https://issues.apache.org/jira/browse/SPARK-40480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606313#comment-17606313 ] Apache Spark commented on SPARK-40480: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/37922 > Remove push-based shuffle data after query finished > --- > > Key: SPARK-40480 > URL: https://issues.apache.org/jira/browse/SPARK-40480 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Priority: Major > > Now spark will only cleanup shuffle data files except push-based shuffle > files. > In our production cluster, push-based shuffle service will create too many > shuffle merge data files as there are several spark thrift server. > Could we cleanup the merged data files after the query finished? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40480) Remove push-based shuffle data after query finished
[ https://issues.apache.org/jira/browse/SPARK-40480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40480: Assignee: Apache Spark > Remove push-based shuffle data after query finished > --- > > Key: SPARK-40480 > URL: https://issues.apache.org/jira/browse/SPARK-40480 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Apache Spark >Priority: Major > > Now spark will only cleanup shuffle data files except push-based shuffle > files. > In our production cluster, push-based shuffle service will create too many > shuffle merge data files as there are several spark thrift server. > Could we cleanup the merged data files after the query finished? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40480) Remove push-based shuffle data after query finished
Wan Kun created SPARK-40480: --- Summary: Remove push-based shuffle data after query finished Key: SPARK-40480 URL: https://issues.apache.org/jira/browse/SPARK-40480 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.3.0 Reporter: Wan Kun Now spark will only cleanup shuffle data files except push-based shuffle files. In our production cluster, push-based shuffle service will create too many shuffle merge data files as there are several spark thrift server. Could we cleanup the merged data files after the query finished? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606256#comment-17606256 ] Apache Spark commented on SPARK-40479: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37921 > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17606257#comment-17606257 ] Apache Spark commented on SPARK-40479: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/37921 > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40479: Assignee: Apache Spark > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Apache Spark >Priority: Major > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40479) Migrate unexpected input type error to an error class
[ https://issues.apache.org/jira/browse/SPARK-40479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40479: Assignee: (was: Apache Spark) > Migrate unexpected input type error to an error class > - > > Key: SPARK-40479 > URL: https://issues.apache.org/jira/browse/SPARK-40479 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Priority: Major > > Migrate the function ExpectsInputTypes.checkInputDataTypes onto > DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40479) Migrate unexpected input type error to an error class
Max Gekk created SPARK-40479: Summary: Migrate unexpected input type error to an error class Key: SPARK-40479 URL: https://issues.apache.org/jira/browse/SPARK-40479 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Max Gekk Migrate the function ExpectsInputTypes.checkInputDataTypes onto DataTypeMismatch and introduce new error class. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-39512) Document the Spark Docker container release process
[ https://issues.apache.org/jira/browse/SPARK-39512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang resolved SPARK-39512. - Fix Version/s: 3.4.0 Assignee: Holden Karau Resolution: Fixed > Document the Spark Docker container release process > --- > > Key: SPARK-39512 > URL: https://issues.apache.org/jira/browse/SPARK-39512 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org