[jira] [Created] (SPARK-40662) Serialization of MapStatuses is somtimes much larger on scala 2.13
Emil Ejbyfeldt created SPARK-40662: -- Summary: Serialization of MapStatuses is somtimes much larger on scala 2.13 Key: SPARK-40662 URL: https://issues.apache.org/jira/browse/SPARK-40662 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: Emil Ejbyfeldt We have observed a case where the same job run against spark on scala 2.13 fails going out of memory due to the the broadcast for the MapStatuses being huge. In the logs around the time the job fails it tries to create a broadcast of size 4.8GiB. ``` 2022-09-18 22:46:01,418 INFO memory.MemoryStore: Block broadcast_17 stored as values in memory (estimated size 4.8 GiB, free 12.9 GiB) ``` The same broadcast of the MapStatus for the same job running on 2.12 is 391.5 Mib so ``` 2022-09-18 16:11:58,753 INFO memory.MemoryStore: Block broadcast_17 stored as values in memory (estimated size 391.5 MiB, free 26.4 GiB) ``` in this particular case it seems the broadcast for MapStatuses more than 10 large when using 2.13. This is not something universal for all MapStatus broadcast as we have have many other jobs using Scala 2.13 where the status is ruffly the same size. This has been observed on 3.3.0 but I also tested it against 3.3.1-rc2 and build of 3.4.0-SNAPSHOT and both of those also reproduced the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40660: --- Assignee: Yuming Wang > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40660. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38106 [https://github.com/apache/spark/pull/38106] > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.4.0 > > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40587) SELECT * shouldn't be empty project list in proto.
[ https://issues.apache.org/jira/browse/SPARK-40587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-40587: --- Assignee: Rui Wang > SELECT * shouldn't be empty project list in proto. > -- > > Key: SPARK-40587 > URL: https://issues.apache.org/jira/browse/SPARK-40587 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > > Current proto uses empty project list for `SELECT *`. However, this is an > implicit way that it is hard to differentiate `not set` and `set but empty`. > For longer term proto compatibility, we should always use explicit fields for > passing through information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40587) SELECT * shouldn't be empty project list in proto.
[ https://issues.apache.org/jira/browse/SPARK-40587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-40587. - Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38023 [https://github.com/apache/spark/pull/38023] > SELECT * shouldn't be empty project list in proto. > -- > > Key: SPARK-40587 > URL: https://issues.apache.org/jira/browse/SPARK-40587 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > > Current proto uses empty project list for `SELECT *`. However, this is an > implicit way that it is hard to differentiate `not set` and `set but empty`. > For longer term proto compatibility, we should always use explicit fields for > passing through information. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
[ https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612849#comment-17612849 ] Raghu Angadi commented on SPARK-40659: -- For 1): No schema evolution is necessary. We keep reading old and latest messages. For 2): Schema evolution is for this case so that we don't drop fields. Say a streaming application reads from Kafka and writes all the fields to a delta table. This pipeline keeps running for a long time. Meanwhile customer adds a new field 'zip_code' to the schema. What should happen? * (a) Without schema evolution: 'zip_code' field would be dropped and would not appear in the destination table. * (b) With schema evolution: we create new column 'zip_code' and populate the column. We want to have (b). In terms of implementation, if throw a specific error, structured streaming stops the pipeline and restarts, which will fetch the new schema and handle 'zip_code' correctly. > Schema evolution for protobuf (and Avro too?) > - > > Key: SPARK-40659 > URL: https://issues.apache.org/jira/browse/SPARK-40659 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Protobuf & Avro should support schema evolution in streaming. We need to > throw a specific error message when we detect newer version of the the schema > in schema registry. > A couple of options for detecting version change at runtime: > * How do we detect newer version from schema registry? It is contacted only > during planning currently. > * We could detect version id in coming messages. > ** What if the id in the incoming message is newer than what our > schema-registry reports after the restart? > *** This indicates delayed syncs between customers schema-registry servers > (should be rare). We can keep erroring out until it is fixed. > *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
[ https://issues.apache.org/jira/browse/SPARK-40651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612845#comment-17612845 ] Yang Jie commented on SPARK-40651: -- Is there an overall removal plan? What can I do to help? > Drop Hadoop2 binary distribtuion from release process > - > > Key: SPARK-40651 > URL: https://issues.apache.org/jira/browse/SPARK-40651 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40661) Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914
[ https://issues.apache.org/jira/browse/SPARK-40661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40661: Assignee: (was: Apache Spark) > Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914 > -- > > Key: SPARK-40661 > URL: https://issues.apache.org/jira/browse/SPARK-40661 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40661) Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914
[ https://issues.apache.org/jira/browse/SPARK-40661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40661: Assignee: Apache Spark > Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914 > -- > > Key: SPARK-40661 > URL: https://issues.apache.org/jira/browse/SPARK-40661 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40661) Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914
[ https://issues.apache.org/jira/browse/SPARK-40661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612840#comment-17612840 ] Apache Spark commented on SPARK-40661: -- User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/38107 > Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914 > -- > > Key: SPARK-40661 > URL: https://issues.apache.org/jira/browse/SPARK-40661 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.4.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40661) Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914
BingKun Pan created SPARK-40661: --- Summary: Upgrade `jetty-http` from 9.4.48.v20220622 to 9.4.49.v20220914 Key: SPARK-40661 URL: https://issues.apache.org/jira/browse/SPARK-40661 Project: Spark Issue Type: Improvement Components: Build Affects Versions: 3.4.0 Reporter: BingKun Pan -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612839#comment-17612839 ] Apache Spark commented on SPARK-40660: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/38106 > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40660: Assignee: (was: Apache Spark) > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612838#comment-17612838 ] Apache Spark commented on SPARK-40660: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/38106 > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40660: Assignee: Apache Spark > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40645) Throw exception for Collect() and recommend to use toPandas()
[ https://issues.apache.org/jira/browse/SPARK-40645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-40645: Assignee: Rui Wang > Throw exception for Collect() and recommend to use toPandas() > - > > Key: SPARK-40645 > URL: https://issues.apache.org/jira/browse/SPARK-40645 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > > Current connect `Collect()` return Pandas DataFrame, which does not match > with PySpark DataFrame API: > https://github.com/apache/spark/blob/ceb8527413288b4d5c54d3afd76d00c9e26817a1/python/pyspark/sql/connect/data_frame.py#L227. > The underlying implementation has been generating Pandas DataFrame though. In > this case, we can choose to use to `toPandas()` and throw exception for > `Collect()`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40645) Throw exception for Collect() and recommend to use toPandas()
[ https://issues.apache.org/jira/browse/SPARK-40645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-40645. -- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38089 [https://github.com/apache/spark/pull/38089] > Throw exception for Collect() and recommend to use toPandas() > - > > Key: SPARK-40645 > URL: https://issues.apache.org/jira/browse/SPARK-40645 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > > Current connect `Collect()` return Pandas DataFrame, which does not match > with PySpark DataFrame API: > https://github.com/apache/spark/blob/ceb8527413288b4d5c54d3afd76d00c9e26817a1/python/pyspark/sql/connect/data_frame.py#L227. > The underlying implementation has been generating Pandas DataFrame though. In > this case, we can choose to use to `toPandas()` and throw exception for > `Collect()`. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40660) Switch to XORShiftRandom to distribute elements
[ https://issues.apache.org/jira/browse/SPARK-40660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-40660: Summary: Switch to XORShiftRandom to distribute elements (was: Switch XORShiftRandom to distribute elements) > Switch to XORShiftRandom to distribute elements > --- > > Key: SPARK-40660 > URL: https://issues.apache.org/jira/browse/SPARK-40660 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > import java.util.Random > import org.apache.spark.util.random.XORShiftRandom > import scala.util.hashing > def distribution(count: Int, partition: Int) = { > println((1 to count).map(partitionId => new > Random(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > Random(hashing.byteswap32(partitionId)).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > println((1 to count).map(partitionId => new > XORShiftRandom(partitionId).nextInt(partition)) > .groupBy(f => f) > .map(_._2.size).mkString(". ")) > } > distribution(200, 4) > {code} > {noformat} > 200 > 50. 60. 46. 44 > 55. 48. 43. 54 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40660) Switch XORShiftRandom to distribute elements
Yuming Wang created SPARK-40660: --- Summary: Switch XORShiftRandom to distribute elements Key: SPARK-40660 URL: https://issues.apache.org/jira/browse/SPARK-40660 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Yuming Wang {code:scala} import java.util.Random import org.apache.spark.util.random.XORShiftRandom import scala.util.hashing def distribution(count: Int, partition: Int) = { println((1 to count).map(partitionId => new Random(partitionId).nextInt(partition)) .groupBy(f => f) .map(_._2.size).mkString(". ")) println((1 to count).map(partitionId => new Random(hashing.byteswap32(partitionId)).nextInt(partition)) .groupBy(f => f) .map(_._2.size).mkString(". ")) println((1 to count).map(partitionId => new XORShiftRandom(partitionId).nextInt(partition)) .groupBy(f => f) .map(_._2.size).mkString(". ")) } distribution(200, 4) {code} {noformat} 200 50. 60. 46. 44 55. 48. 43. 54 {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39725: -- Fix Version/s: 3.3.2 > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0, 3.3.2 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612831#comment-17612831 ] Dongjoon Hyun commented on SPARK-39725: --- This landed to branch-3.3 via [https://github.com/apache/spark/pull/38098] > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40281) Memory Profiler on Executors
[ https://issues.apache.org/jira/browse/SPARK-40281?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xinrong Meng updated SPARK-40281: - Description: Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler. However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate. The umbrella proposes to implement a Memory Profiler on Executors. was: Profiling is critical to performance engineering. Memory consumption is a key indicator of how efficient a PySpark program is. There is an existing effort on memory profiling of Python progrms, Memory Profiler ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] PySpark applications run as independent sets of processes on a cluster, coordinated by the SparkContext object in the driver program. On the driver side, PySpark is a regular Python process, thus, we can profile it as a normal Python program using Memory Profiler. However, on the executors side, we are missing such memory profiler. Since executors are distributed on different nodes in the cluster, we need to need to aggregate profiles. Furthermore, Python worker processes are spawned per executor for the Python/Pandas UDF execution, which makes the memory profiling more intricate. The umbrella proposes to implement a Memory Profiler on Executors. > Memory Profiler on Executors > > > Key: SPARK-40281 > URL: https://issues.apache.org/jira/browse/SPARK-40281 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Xinrong Meng >Priority: Major > > Profiling is critical to performance engineering. Memory consumption is a key > indicator of how efficient a PySpark program is. There is an existing effort > on memory profiling of Python progrms, Memory Profiler > ([https://pypi.org/project/memory-profiler/).|https://pypi.org/project/memory-profiler/] > PySpark applications run as independent sets of processes on a cluster, > coordinated by the SparkContext object in the driver program. On the driver > side, PySpark is a regular Python process, thus, we can profile it as a > normal Python program using Memory Profiler. > However, on the executors side, we are missing such memory profiler. Since > executors are distributed on different nodes in the cluster, we need to > aggregate profiles. Furthermore, Python worker processes are spawned per > executor for the Python/Pandas UDF execution, which makes the memory > profiling more intricate. > The umbrella proposes to implement a Memory Profiler on Executors. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40428) Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources during abnormal shutdown
[ https://issues.apache.org/jira/browse/SPARK-40428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-40428. -- Fix Version/s: 3.4.0 Assignee: Holden Karau Resolution: Fixed > Add a shutdownhook to CoarseGrained scheduler to avoid dangling resources > during abnormal shutdown > -- > > Key: SPARK-40428 > URL: https://issues.apache.org/jira/browse/SPARK-40428 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Add a shutdown hook in the CoarseGrainedSchedulerBackend to call stop > since we've got zombie pods hanging around since the resource tie isn't > perfect. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40540) Migrate compilation errors onto error classes
[ https://issues.apache.org/jira/browse/SPARK-40540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612829#comment-17612829 ] Apache Spark commented on SPARK-40540: -- User 'itholic' has created a pull request for this issue: https://github.com/apache/spark/pull/38104 > Migrate compilation errors onto error classes > - > > Key: SPARK-40540 > URL: https://issues.apache.org/jira/browse/SPARK-40540 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Fix For: 3.4.0 > > > Use temporary error classes in the compilation exceptions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40585) Support double-quoted identifiers
[ https://issues.apache.org/jira/browse/SPARK-40585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-40585. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38022 [https://github.com/apache/spark/pull/38022] > Support double-quoted identifiers > - > > Key: SPARK-40585 > URL: https://issues.apache.org/jira/browse/SPARK-40585 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > Fix For: 3.4.0 > > > In many SQL identifiers can be unquoted or quoted with double quotes. > In Spark double quoted literals imply strings. > In this proposal we allow for a config: > double_quoted_identifiers > which, when set, switches the interpretation from string to identifier. > Note that back ticks are still allowed. > Also the treatment of escapes is not changed as part of this work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40585) Support double-quoted identifiers
[ https://issues.apache.org/jira/browse/SPARK-40585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-40585: -- Assignee: Serge Rielau > Support double-quoted identifiers > - > > Key: SPARK-40585 > URL: https://issues.apache.org/jira/browse/SPARK-40585 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Assignee: Serge Rielau >Priority: Major > > In many SQL identifiers can be unquoted or quoted with double quotes. > In Spark double quoted literals imply strings. > In this proposal we allow for a config: > double_quoted_identifiers > which, when set, switches the interpretation from string to identifier. > Note that back ticks are still allowed. > Also the treatment of escapes is not changed as part of this work. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40617) Assertion failed in ExecutorMetricsPoller "task count shouldn't below 0"
[ https://issues.apache.org/jira/browse/SPARK-40617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Attila Zsolt Piros updated SPARK-40617: --- Fix Version/s: 3.2.0 > Assertion failed in ExecutorMetricsPoller "task count shouldn't below 0" > > > Key: SPARK-40617 > URL: https://issues.apache.org/jira/browse/SPARK-40617 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Attila Zsolt Piros >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.2.0, 3.4.0, 3.3.1 > > > Spurious failures because of the assert: > {noformat} > 22/09/29 09:46:24 ERROR SparkUncaughtExceptionHandler: Uncaught exception in > thread Thread[Executor task launch worker for task 3063.0 in stage 1997.0 > (TID 677249),5,main] > java.lang.AssertionError: assertion failed: task count shouldn't below 0 > at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.executor.ExecutorMetricsPoller.decrementCount$1(ExecutorMetricsPoller.scala:130) > at > org.apache.spark.executor.ExecutorMetricsPoller.$anonfun$onTaskCompletion$3(ExecutorMetricsPoller.scala:135) > at > java.base/java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822) > at > org.apache.spark.executor.ExecutorMetricsPoller.onTaskCompletion(ExecutorMetricsPoller.scala:135) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:737) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > 22/09/29 09:46:24 INFO MemoryStore: MemoryStore cleared > 22/09/29 09:46:24 INFO BlockManager: BlockManager stopped > 22/09/29 09:46:24 INFO ShutdownHookManager: Shutdown hook called > 22/09/29 09:46:24 INFO ShutdownHookManager: Deleting directory > /mnt/yarn/usercache/hadoop/appcache/application_1664443624160_0001/spark-93efc2d4-84de-494b-a3b7-2cb1c3a45426 > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
[ https://issues.apache.org/jira/browse/SPARK-40659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612811#comment-17612811 ] Mohan Parthasarathy commented on SPARK-40659: - [~rangadi] A few clarifications. I am trying to understand the conditions under which the error is thrown. Using Confluent schema registry terminology, let's take a couple of examples: 1) BACKWARDS: Assuming the schema has evolved as per the rules, the consumer using the latest schema can read messages both from old and latest schema 2) FORWARDS: Similarly, the consumer using the older schema can read messages from a later schema; it would just ignore the new fields. In these cases, it will continue to work. Why would we throw error in these cases ? What other cases needs an error to be thrown ? Could you elaborate ? > Schema evolution for protobuf (and Avro too?) > - > > Key: SPARK-40659 > URL: https://issues.apache.org/jira/browse/SPARK-40659 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Protobuf & Avro should support schema evolution in streaming. We need to > throw a specific error message when we detect newer version of the the schema > in schema registry. > A couple of options for detecting version change at runtime: > * How do we detect newer version from schema registry? It is contacted only > during planning currently. > * We could detect version id in coming messages. > ** What if the id in the incoming message is newer than what our > schema-registry reports after the restart? > *** This indicates delayed syncs between customers schema-registry servers > (should be rare). We can keep erroring out until it is fixed. > *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40659) Schema evolution for protobuf (and Avro too?)
Raghu Angadi created SPARK-40659: Summary: Schema evolution for protobuf (and Avro too?) Key: SPARK-40659 URL: https://issues.apache.org/jira/browse/SPARK-40659 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi Protobuf & Avro should support schema evolution in streaming. We need to throw a specific error message when we detect newer version of the the schema in schema registry. A couple of options for detecting version change at runtime: * How do we detect newer version from schema registry? It is contacted only during planning currently. * We could detect version id in coming messages. ** What if the id in the incoming message is newer than what our schema-registry reports after the restart? *** This indicates delayed syncs between customers schema-registry servers (should be rare). We can keep erroring out until it is fixed. *** Make sure we log the schema id used during planning. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40658) Protobuf v2 & v3 support
Raghu Angadi created SPARK-40658: Summary: Protobuf v2 & v3 support Key: SPARK-40658 URL: https://issues.apache.org/jira/browse/SPARK-40658 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi We want to ensure Protobuf functions support both Protobuf version 2 and version 3 schemas (e.g. descriptor file or compiled classes with v2 and v3). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40657) Add support for compiled classes (Java classes)
Raghu Angadi created SPARK-40657: Summary: Add support for compiled classes (Java classes) Key: SPARK-40657 URL: https://issues.apache.org/jira/browse/SPARK-40657 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi For some users, it is more convenient to provide compiled classes rather than a descriptor file. We can support java compiled classes. Python could also use the same since all the processing happens in Scala. Supporting python compiled classes is out-of-scope for this. It is not clear how well we can support that, short of using python UDF. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40654) Protobuf support MVP with descriptor files
[ https://issues.apache.org/jira/browse/SPARK-40654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40654: Assignee: Apache Spark > Protobuf support MVP with descriptor files > -- > > Key: SPARK-40654 > URL: https://issues.apache.org/jira/browse/SPARK-40654 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Apache Spark >Priority: Major > > This is the MVP implementation of protobuf support with descriptor files. > Currently in PR https://github.com/apache/spark/pull/37972 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40654) Protobuf support MVP with descriptor files
[ https://issues.apache.org/jira/browse/SPARK-40654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40654: Assignee: (was: Apache Spark) > Protobuf support MVP with descriptor files > -- > > Key: SPARK-40654 > URL: https://issues.apache.org/jira/browse/SPARK-40654 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > This is the MVP implementation of protobuf support with descriptor files. > Currently in PR https://github.com/apache/spark/pull/37972 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40654) Protobuf support MVP with descriptor files
[ https://issues.apache.org/jira/browse/SPARK-40654?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612802#comment-17612802 ] Apache Spark commented on SPARK-40654: -- User 'SandishKumarHN' has created a pull request for this issue: https://github.com/apache/spark/pull/37972 > Protobuf support MVP with descriptor files > -- > > Key: SPARK-40654 > URL: https://issues.apache.org/jira/browse/SPARK-40654 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > This is the MVP implementation of protobuf support with descriptor files. > Currently in PR https://github.com/apache/spark/pull/37972 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40655) Protobuf functions in Python
[ https://issues.apache.org/jira/browse/SPARK-40655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40655: Assignee: (was: Apache Spark) > Protobuf functions in Python > - > > Key: SPARK-40655 > URL: https://issues.apache.org/jira/browse/SPARK-40655 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Add Python support for Protobuf functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40648: - Assignee: Yang Jie (was: Apache Spark) > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0, 3.3.2 > > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40656) Schema-registry support for Protobuf format
Raghu Angadi created SPARK-40656: Summary: Schema-registry support for Protobuf format Key: SPARK-40656 URL: https://issues.apache.org/jira/browse/SPARK-40656 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi Add support for reading protobuf schema (definition) from Confluent schema-registry. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40648: -- Fix Version/s: 3.3.2 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0, 3.3.2 > > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612799#comment-17612799 ] Dongjoon Hyun commented on SPARK-40648: --- This landed to branch-3.3 via [https://github.com/apache/spark/pull/38096] > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0, 3.3.2 > > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40655) Protobuf functions in Python
[ https://issues.apache.org/jira/browse/SPARK-40655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612800#comment-17612800 ] Apache Spark commented on SPARK-40655: -- User 'SandishKumarHN' has created a pull request for this issue: https://github.com/apache/spark/pull/38100 > Protobuf functions in Python > - > > Key: SPARK-40655 > URL: https://issues.apache.org/jira/browse/SPARK-40655 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Add Python support for Protobuf functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40655) Protobuf functions in Python
[ https://issues.apache.org/jira/browse/SPARK-40655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40655: Assignee: Apache Spark > Protobuf functions in Python > - > > Key: SPARK-40655 > URL: https://issues.apache.org/jira/browse/SPARK-40655 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Assignee: Apache Spark >Priority: Major > > Add Python support for Protobuf functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40655) Protobuf functions in Python
Raghu Angadi created SPARK-40655: Summary: Protobuf functions in Python Key: SPARK-40655 URL: https://issues.apache.org/jira/browse/SPARK-40655 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi Add Python support for Protobuf functions -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40654) Protobuf support MVP with descriptor files
Raghu Angadi created SPARK-40654: Summary: Protobuf support MVP with descriptor files Key: SPARK-40654 URL: https://issues.apache.org/jira/browse/SPARK-40654 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi This is the MVP implementation of protobuf support with descriptor files. Currently in PR https://github.com/apache/spark/pull/37972 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40648. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38095 [https://github.com/apache/spark/pull/38095] > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > Fix For: 3.4.0 > > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
[ https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612796#comment-17612796 ] Apache Spark commented on SPARK-40652: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/38101 > Add MASK_PHONE and TRY_MASK_PHONE functions > --- > > Key: SPARK-40652 > URL: https://issues.apache.org/jira/browse/SPARK-40652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
[ https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40652: Assignee: (was: Apache Spark) > Add MASK_PHONE and TRY_MASK_PHONE functions > --- > > Key: SPARK-40652 > URL: https://issues.apache.org/jira/browse/SPARK-40652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-40636: - Assignee: Zhongwei Zhu > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > > BlockManagerDecommissioner should log correct remained shuffles. > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
[ https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40652: Assignee: Apache Spark > Add MASK_PHONE and TRY_MASK_PHONE functions > --- > > Key: SPARK-40652 > URL: https://issues.apache.org/jira/browse/SPARK-40652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
[ https://issues.apache.org/jira/browse/SPARK-40652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612795#comment-17612795 ] Apache Spark commented on SPARK-40652: -- User 'dtenedor' has created a pull request for this issue: https://github.com/apache/spark/pull/38101 > Add MASK_PHONE and TRY_MASK_PHONE functions > --- > > Key: SPARK-40652 > URL: https://issues.apache.org/jira/browse/SPARK-40652 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-40636. --- Fix Version/s: 3.3.2 3.2.3 3.4.0 Resolution: Fixed Issue resolved by pull request 38078 [https://github.com/apache/spark/pull/38078] > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.3.2, 3.2.3, 3.4.0 > > > BlockManagerDecommissioner should log correct remained shuffles. > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40653) Protobuf Support in Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-40653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612793#comment-17612793 ] Raghu Angadi commented on SPARK-40653: -- cc: [~sanysand...@gmail.com] , [~mparthas] > Protobuf Support in Structured Streaming > > > Key: SPARK-40653 > URL: https://issues.apache.org/jira/browse/SPARK-40653 > Project: Spark > Issue Type: Epic > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: Raghu Angadi >Priority: Major > > Add support for Protobuf messages in streaming sources. This would be similar > to Avro format support. This includes features like schema-registry, Python > support, schema evolution, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40653) Protobuf Support in Structured Streaming
Raghu Angadi created SPARK-40653: Summary: Protobuf Support in Structured Streaming Key: SPARK-40653 URL: https://issues.apache.org/jira/browse/SPARK-40653 Project: Spark Issue Type: Epic Components: Structured Streaming Affects Versions: 3.3.0 Reporter: Raghu Angadi Add support for Protobuf messages in streaming sources. This would be similar to Avro format support. This includes features like schema-registry, Python support, schema evolution, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40652) Add MASK_PHONE and TRY_MASK_PHONE functions
Daniel created SPARK-40652: -- Summary: Add MASK_PHONE and TRY_MASK_PHONE functions Key: SPARK-40652 URL: https://issues.apache.org/jira/browse/SPARK-40652 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.4.0 Reporter: Daniel -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
[ https://issues.apache.org/jira/browse/SPARK-40651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612764#comment-17612764 ] Apache Spark commented on SPARK-40651: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38099 > Drop Hadoop2 binary distribtuion from release process > - > > Key: SPARK-40651 > URL: https://issues.apache.org/jira/browse/SPARK-40651 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
[ https://issues.apache.org/jira/browse/SPARK-40651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612762#comment-17612762 ] Apache Spark commented on SPARK-40651: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/38099 > Drop Hadoop2 binary distribtuion from release process > - > > Key: SPARK-40651 > URL: https://issues.apache.org/jira/browse/SPARK-40651 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
[ https://issues.apache.org/jira/browse/SPARK-40651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40651: Assignee: Apache Spark > Drop Hadoop2 binary distribtuion from release process > - > > Key: SPARK-40651 > URL: https://issues.apache.org/jira/browse/SPARK-40651 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
[ https://issues.apache.org/jira/browse/SPARK-40651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40651: Assignee: (was: Apache Spark) > Drop Hadoop2 binary distribtuion from release process > - > > Key: SPARK-40651 > URL: https://issues.apache.org/jira/browse/SPARK-40651 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40651) Drop Hadoop2 binary distribtuion from release process
Dongjoon Hyun created SPARK-40651: - Summary: Drop Hadoop2 binary distribtuion from release process Key: SPARK-40651 URL: https://issues.apache.org/jira/browse/SPARK-40651 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40650) Infer date type for Json schema inference
Xiaonan Yang created SPARK-40650: Summary: Infer date type for Json schema inference Key: SPARK-40650 URL: https://issues.apache.org/jira/browse/SPARK-40650 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.4.0 Reporter: Xiaonan Yang Fix For: 3.4.0 In ticket https://issues.apache.org/jira/browse/SPARK-39469, we introduced date type support in CSV schema inference. We want to introduce similar support for Json data source. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40649) Infer date type for Json schema inference
[ https://issues.apache.org/jira/browse/SPARK-40649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaonan Yang updated SPARK-40649: - Fix Version/s: (was: 3.4.0) > Infer date type for Json schema inference > - > > Key: SPARK-40649 > URL: https://issues.apache.org/jira/browse/SPARK-40649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Xiaonan Yang >Assignee: Jonathan Cui >Priority: Major > > 1. If a column contains only dates, it should be of “date” type in the > inferred schema > * If the date format and the timestamp format are identical (e.g. both are > /mm/dd), entries will default to being interpreted as Date > 2. If a column contains dates and timestamps, it should be of “timestamp” > type in the inferred schema > > A similar issue was opened in the past but was reverted due to the lack of > strict pattern matching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-40649) Infer date type for Json schema inference
[ https://issues.apache.org/jira/browse/SPARK-40649?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiaonan Yang resolved SPARK-40649. -- Resolution: Duplicate > Infer date type for Json schema inference > - > > Key: SPARK-40649 > URL: https://issues.apache.org/jira/browse/SPARK-40649 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.1 >Reporter: Xiaonan Yang >Assignee: Jonathan Cui >Priority: Major > Fix For: 3.4.0 > > > 1. If a column contains only dates, it should be of “date” type in the > inferred schema > * If the date format and the timestamp format are identical (e.g. both are > /mm/dd), entries will default to being interpreted as Date > 2. If a column contains dates and timestamps, it should be of “timestamp” > type in the inferred schema > > A similar issue was opened in the past but was reverted due to the lack of > strict pattern matching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40649) Infer date type for Json schema inference
Xiaonan Yang created SPARK-40649: Summary: Infer date type for Json schema inference Key: SPARK-40649 URL: https://issues.apache.org/jira/browse/SPARK-40649 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.1 Reporter: Xiaonan Yang Assignee: Jonathan Cui Fix For: 3.4.0 1. If a column contains only dates, it should be of “date” type in the inferred schema * If the date format and the timestamp format are identical (e.g. both are /mm/dd), entries will default to being interpreted as Date 2. If a column contains dates and timestamps, it should be of “timestamp” type in the inferred schema A similar issue was opened in the past but was reverted due to the lack of strict pattern matching. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612727#comment-17612727 ] Apache Spark commented on SPARK-39725: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/38098 > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612711#comment-17612711 ] Bjørn Jørgensen commented on SPARK-39725: - I created a branch from 3.3 but it start to build to master. https://github.com/bjornjorgensen/spark/tree/3.3_jetty_48.xx > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612707#comment-17612707 ] Sean R. Owen commented on SPARK-39725: -- [~bjornjorgensen] well, it would need to be a change vs branch-3.3, which is already on 9.4.46: https://github.com/apache/spark/blob/branch-3.3/pom.xml#L136 But it's a simple change yes. > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725 ] Bjørn Jørgensen deleted comment on SPARK-39725: - was (Author: bjornjorgensen): [~srowen] LIke [this|https://github.com/bjornjorgensen/spark/tree/3.3-etty_48.v20220622] one? > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612704#comment-17612704 ] Bjørn Jørgensen commented on SPARK-39725: - [~srowen] LIke [this|https://github.com/bjornjorgensen/spark/tree/3.3-etty_48.v20220622] one? > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612689#comment-17612689 ] Sean R. Owen commented on SPARK-39725: -- I don't know if this affects Spark, but I think it's fine to back-port this update to 3.3.x. [~bjornjorgensen] are you able to do that now? we could get it in for 3.3.1. Or I can. > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612680#comment-17612680 ] Apache Spark commented on SPARK-40648: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38097 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39725) Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622
[ https://issues.apache.org/jira/browse/SPARK-39725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612681#comment-17612681 ] phoebe chen commented on SPARK-39725: - Thanks(y) > Upgrade jetty-http from 9.4.46.v20220331 to 9.4.48.v20220622 > > > Key: SPARK-39725 > URL: https://issues.apache.org/jira/browse/SPARK-39725 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Bjørn Jørgensen >Priority: Major > Fix For: 3.4.0 > > Attachments: jetty-io-spark.png > > > [Release note |https://github.com/eclipse/jetty.project/releases] > [CVE-2022-2047|https://nvd.nist.gov/vuln/detail/CVE-2022-2047] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612679#comment-17612679 ] Apache Spark commented on SPARK-40648: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38097 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612676#comment-17612676 ] Apache Spark commented on SPARK-40648: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38096 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612674#comment-17612674 ] Apache Spark commented on SPARK-40648: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38096 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40648: Assignee: Apache Spark > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612653#comment-17612653 ] Apache Spark commented on SPARK-40648: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/38095 > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40648: Assignee: Apache Spark > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40648: Assignee: (was: Apache Spark) > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40648) Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40648: - Summary: Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module (was: Add `@ExtendedLevelDBTest` to the testing leveldb in the yarn module) > Add `@ExtendedLevelDBTest` to the leveldb relevant tests in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40648) Add `@ExtendedLevelDBTest` to the testing leveldb in the yarn module
[ https://issues.apache.org/jira/browse/SPARK-40648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-40648: - Summary: Add `@ExtendedLevelDBTest` to the testing leveldb in the yarn module (was: Add `@ExtendedLevelDBTest` for the case of testing leveldb in the yarn module) > Add `@ExtendedLevelDBTest` to the testing leveldb in the yarn module > -- > > Key: SPARK-40648 > URL: https://issues.apache.org/jira/browse/SPARK-40648 > Project: Spark > Issue Type: Improvement > Components: Tests, YARN >Affects Versions: 3.2.2, 3.4.0, 3.3.1 >Reporter: Yang Jie >Priority: Major > > SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` > starts to verify the registeredExecFile reload test scenario again,so we need > to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the > `MacOs/Apple Silicon` can skip relevant tests through > `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-40648) Add `@ExtendedLevelDBTest` for the case of testing leveldb in the yarn module
Yang Jie created SPARK-40648: Summary: Add `@ExtendedLevelDBTest` for the case of testing leveldb in the yarn module Key: SPARK-40648 URL: https://issues.apache.org/jira/browse/SPARK-40648 Project: Spark Issue Type: Improvement Components: Tests, YARN Affects Versions: 3.2.2, 3.4.0, 3.3.1 Reporter: Yang Jie SPARK-40490 make the test case related to `YarnShuffleIntegrationSuite` starts to verify the registeredExecFile reload test scenario again,so we need to add `@ExtendedLevelDBTest` for the test case using LevelDB so that the `MacOs/Apple Silicon` can skip relevant tests through `-Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40618) Bug in MergeScalarSubqueries rule attempting to merge nested subquery with parent
[ https://issues.apache.org/jira/browse/SPARK-40618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612517#comment-17612517 ] Apache Spark commented on SPARK-40618: -- User 'peter-toth' has created a pull request for this issue: https://github.com/apache/spark/pull/38093 > Bug in MergeScalarSubqueries rule attempting to merge nested subquery with > parent > - > > Key: SPARK-40618 > URL: https://issues.apache.org/jira/browse/SPARK-40618 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Daniel >Assignee: Daniel >Priority: Major > Fix For: 3.4.0 > > > There is a bug in the `MergeScalarSubqueries` rule for queries with subquery > expressions nested inside each other, wherein the rule attempts to merge the > nested subquery with its enclosing parent subquery. The result is not a valid > plan and raises an exception in the optimizer. Here is a minimal reproducing > case: > > ``` > sql("create table test(col int) using csv") > checkAnswer(sql("select(select sum((select sum(col) from test)) from test)"), > Row(null)) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40647) DAGScheduler should fail job until all related running tasks have been killed
[ https://issues.apache.org/jira/browse/SPARK-40647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40647: Assignee: (was: Apache Spark) > DAGScheduler should fail job until all related running tasks have been killed > - > > Key: SPARK-40647 > URL: https://issues.apache.org/jira/browse/SPARK-40647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Wechar >Priority: Major > > *Issue Description* > The staging directory within table location is not removed when {{CTAS}} > fails sometimes. > It is a trouble if the new table is a Managed Table when we want to recreate > it. > *Root Cause* > SchedulerBackend kills tasks via {{KillTask}} message which is asynchronous, > so we may have failed a job but the tasks are still running and create the > tmp file. Even if the running tasks will failed and delete the generated file > finally, but the temporary directory was left. > *Solution* > Before killing a job, we should make sure that all related running tasks have > been killed. > *How to Reproduce* > Step 1: create a source table and insert data to make the file number exceeds > 20 on HDFS > {code:sql} > -- create source table > CREATE TABLE IF NOT EXISTS default.test_wechar > (name string) > PARTITIONED BY (grass_date date) > STORED AS PARQUET > -- insert data 24 times > insert into default.test_wechar partition (grass_date='2022-09-03') > select uuid() > lateral view explode(sequence(1,2000)) as temp_view; > {code} > Step 2: create a new path for new table and setQuota to 20 > {code:bash} > $hadoop fs -count -q hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > 20 19none inf1 >0 0 > hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > {code} > Step 3: create new table from source table > {code:sql} > create table if not exists default.test_wechar_tmp > location 'hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp' > as select * from default.test_wechar; > {code} > > Step 4: check the location of new table after the job failed > {code:bash} > $hadoop fs -ls /user/weiqiang.yu/tmp/test_wechar_tmp/* > Found 1 items > drwxrwxr-x - weiqiang.yu weiqiang.yu 0 2022-10-04 12:56 > /user/weiqiang.yu/tmp/test_wechar_tmp/.hive-staging_hive_2022-10-04_12-56-21_545_2745177084386740362-1/-ext-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40647) DAGScheduler should fail job until all related running tasks have been killed
[ https://issues.apache.org/jira/browse/SPARK-40647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612503#comment-17612503 ] Apache Spark commented on SPARK-40647: -- User 'wecharyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38092 > DAGScheduler should fail job until all related running tasks have been killed > - > > Key: SPARK-40647 > URL: https://issues.apache.org/jira/browse/SPARK-40647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Wechar >Priority: Major > > *Issue Description* > The staging directory within table location is not removed when {{CTAS}} > fails sometimes. > It is a trouble if the new table is a Managed Table when we want to recreate > it. > *Root Cause* > SchedulerBackend kills tasks via {{KillTask}} message which is asynchronous, > so we may have failed a job but the tasks are still running and create the > tmp file. Even if the running tasks will failed and delete the generated file > finally, but the temporary directory was left. > *Solution* > Before killing a job, we should make sure that all related running tasks have > been killed. > *How to Reproduce* > Step 1: create a source table and insert data to make the file number exceeds > 20 on HDFS > {code:sql} > -- create source table > CREATE TABLE IF NOT EXISTS default.test_wechar > (name string) > PARTITIONED BY (grass_date date) > STORED AS PARQUET > -- insert data 24 times > insert into default.test_wechar partition (grass_date='2022-09-03') > select uuid() > lateral view explode(sequence(1,2000)) as temp_view; > {code} > Step 2: create a new path for new table and setQuota to 20 > {code:bash} > $hadoop fs -count -q hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > 20 19none inf1 >0 0 > hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > {code} > Step 3: create new table from source table > {code:sql} > create table if not exists default.test_wechar_tmp > location 'hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp' > as select * from default.test_wechar; > {code} > > Step 4: check the location of new table after the job failed > {code:bash} > $hadoop fs -ls /user/weiqiang.yu/tmp/test_wechar_tmp/* > Found 1 items > drwxrwxr-x - weiqiang.yu weiqiang.yu 0 2022-10-04 12:56 > /user/weiqiang.yu/tmp/test_wechar_tmp/.hive-staging_hive_2022-10-04_12-56-21_545_2745177084386740362-1/-ext-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-40647) DAGScheduler should fail job until all related running tasks have been killed
[ https://issues.apache.org/jira/browse/SPARK-40647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-40647: Assignee: Apache Spark > DAGScheduler should fail job until all related running tasks have been killed > - > > Key: SPARK-40647 > URL: https://issues.apache.org/jira/browse/SPARK-40647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Wechar >Assignee: Apache Spark >Priority: Major > > *Issue Description* > The staging directory within table location is not removed when {{CTAS}} > fails sometimes. > It is a trouble if the new table is a Managed Table when we want to recreate > it. > *Root Cause* > SchedulerBackend kills tasks via {{KillTask}} message which is asynchronous, > so we may have failed a job but the tasks are still running and create the > tmp file. Even if the running tasks will failed and delete the generated file > finally, but the temporary directory was left. > *Solution* > Before killing a job, we should make sure that all related running tasks have > been killed. > *How to Reproduce* > Step 1: create a source table and insert data to make the file number exceeds > 20 on HDFS > {code:sql} > -- create source table > CREATE TABLE IF NOT EXISTS default.test_wechar > (name string) > PARTITIONED BY (grass_date date) > STORED AS PARQUET > -- insert data 24 times > insert into default.test_wechar partition (grass_date='2022-09-03') > select uuid() > lateral view explode(sequence(1,2000)) as temp_view; > {code} > Step 2: create a new path for new table and setQuota to 20 > {code:bash} > $hadoop fs -count -q hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > 20 19none inf1 >0 0 > hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > {code} > Step 3: create new table from source table > {code:sql} > create table if not exists default.test_wechar_tmp > location 'hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp' > as select * from default.test_wechar; > {code} > > Step 4: check the location of new table after the job failed > {code:bash} > $hadoop fs -ls /user/weiqiang.yu/tmp/test_wechar_tmp/* > Found 1 items > drwxrwxr-x - weiqiang.yu weiqiang.yu 0 2022-10-04 12:56 > /user/weiqiang.yu/tmp/test_wechar_tmp/.hive-staging_hive_2022-10-04_12-56-21_545_2745177084386740362-1/-ext-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40647) DAGScheduler should fail job until all related running tasks have been killed
[ https://issues.apache.org/jira/browse/SPARK-40647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612504#comment-17612504 ] Apache Spark commented on SPARK-40647: -- User 'wecharyu' has created a pull request for this issue: https://github.com/apache/spark/pull/38092 > DAGScheduler should fail job until all related running tasks have been killed > - > > Key: SPARK-40647 > URL: https://issues.apache.org/jira/browse/SPARK-40647 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.2 >Reporter: Wechar >Priority: Major > > *Issue Description* > The staging directory within table location is not removed when {{CTAS}} > fails sometimes. > It is a trouble if the new table is a Managed Table when we want to recreate > it. > *Root Cause* > SchedulerBackend kills tasks via {{KillTask}} message which is asynchronous, > so we may have failed a job but the tasks are still running and create the > tmp file. Even if the running tasks will failed and delete the generated file > finally, but the temporary directory was left. > *Solution* > Before killing a job, we should make sure that all related running tasks have > been killed. > *How to Reproduce* > Step 1: create a source table and insert data to make the file number exceeds > 20 on HDFS > {code:sql} > -- create source table > CREATE TABLE IF NOT EXISTS default.test_wechar > (name string) > PARTITIONED BY (grass_date date) > STORED AS PARQUET > -- insert data 24 times > insert into default.test_wechar partition (grass_date='2022-09-03') > select uuid() > lateral view explode(sequence(1,2000)) as temp_view; > {code} > Step 2: create a new path for new table and setQuota to 20 > {code:bash} > $hadoop fs -count -q hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > 20 19none inf1 >0 0 > hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp > {code} > Step 3: create new table from source table > {code:sql} > create table if not exists default.test_wechar_tmp > location 'hdfs://Test-DMP01/user/weiqiang.yu/tmp/test_wechar_tmp' > as select * from default.test_wechar; > {code} > > Step 4: check the location of new table after the job failed > {code:bash} > $hadoop fs -ls /user/weiqiang.yu/tmp/test_wechar_tmp/* > Found 1 items > drwxrwxr-x - weiqiang.yu weiqiang.yu 0 2022-10-04 12:56 > /user/weiqiang.yu/tmp/test_wechar_tmp/.hive-staging_hive_2022-10-04_12-56-21_545_2745177084386740362-1/-ext-1 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40096) Finalize shuffle merge slow due to connection creation fails
[ https://issues.apache.org/jira/browse/SPARK-40096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17612497#comment-17612497 ] Apache Spark commented on SPARK-40096: -- User 'wankunde' has created a pull request for this issue: https://github.com/apache/spark/pull/38091 > Finalize shuffle merge slow due to connection creation fails > > > Key: SPARK-40096 > URL: https://issues.apache.org/jira/browse/SPARK-40096 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Wan Kun >Assignee: Wan Kun >Priority: Major > Fix For: 3.4.0 > > > *How to reproduce this issue* > * Enable push based shuffle > * Remove some merger nodes before sending finalize RPCs > * Driver try to connect those merger shuffle services and send finalize RPC > one by one, each connection creation will timeout after > SPARK_NETWORK_IO_CONNECTIONCREATIONTIMEOUT_KEY (120s by default) > > We can send these RPCs in *shuffleMergeFinalizeScheduler* thread pool and > handle the connection creation exception -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org