[jira] [Updated] (SPARK-45878) ConcurrentModificationException in CliSuite
[ https://issues.apache.org/jira/browse/SPARK-45878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45878: --- Labels: pull-request-available (was: ) > ConcurrentModificationException in CliSuite > --- > > Key: SPARK-45878 > URL: https://issues.apache.org/jira/browse/SPARK-45878 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > > {code:java} > // code placeholder > java.util.ConcurrentModificationException: mutation occurred during iteration > [info] at > scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) > [info] at > scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) > [info] at > scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247) > [info] at > scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241) > [info] at scala.collection.AbstractIterable.addString(Iterable.scala:933) > [info] at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191) > [info] at > scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189) > [info] at scala.collection.AbstractIterable.mkString(Iterable.scala:933) > [info] at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204) > [info] at > scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204) > [info] at scala.collection.AbstractIterable.mkString(Iterable.scala:933) > [info] at > org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205) > [info] at > org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45878) ConcurrentModificationException in CliSuite
Kent Yao created SPARK-45878: Summary: ConcurrentModificationException in CliSuite Key: SPARK-45878 URL: https://issues.apache.org/jira/browse/SPARK-45878 Project: Spark Issue Type: Bug Components: SQL, Tests Affects Versions: 4.0.0 Reporter: Kent Yao {code:java} // code placeholder java.util.ConcurrentModificationException: mutation occurred during iteration [info] at scala.collection.mutable.MutationTracker$.checkMutations(MutationTracker.scala:43) [info] at scala.collection.mutable.CheckedIndexedSeqView$CheckedIterator.hasNext(CheckedIndexedSeqView.scala:47) [info] at scala.collection.IterableOnceOps.addString(IterableOnce.scala:1247) [info] at scala.collection.IterableOnceOps.addString$(IterableOnce.scala:1241) [info] at scala.collection.AbstractIterable.addString(Iterable.scala:933) [info] at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1191) [info] at scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1189) [info] at scala.collection.AbstractIterable.mkString(Iterable.scala:933) [info] at scala.collection.IterableOnceOps.mkString(IterableOnce.scala:1204) [info] at scala.collection.IterableOnceOps.mkString$(IterableOnce.scala:1204) [info] at scala.collection.AbstractIterable.mkString(Iterable.scala:933) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.runCliWithin(CliSuite.scala:205) [info] at org.apache.spark.sql.hive.thriftserver.CliSuite.$anonfun$new$20(CliSuite.scala:501) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45877) ExecutorFailureTracker support for standalone mode
Kent Yao created SPARK-45877: Summary: ExecutorFailureTracker support for standalone mode Key: SPARK-45877 URL: https://issues.apache.org/jira/browse/SPARK-45877 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Kent Yao ExecutorFailureTracker now works for k8s and yarn, I guess it also an important feature for standalone to have -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45876) Filters are not pushed down across lateral view
Alexander Petrossian (PAF) created SPARK-45876: -- Summary: Filters are not pushed down across lateral view Key: SPARK-45876 URL: https://issues.apache.org/jira/browse/SPARK-45876 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Alexander Petrossian (PAF) {code:python} from pyspark.sql import SparkSession spark = SparkSession.builder.config("spark.sql.catalogImplementation", "hive").appName("Write ORC File").getOrCreate() spark.sql('drop TABLE if exists test').show() spark.sql('CREATE EXTERNAL TABLE test (request struct>>)' 'ROW FORMAT SERDE "org.apache.hadoop.hive.ql.io.orc.OrcSerde" ' 'STORED AS INPUTFORMAT "org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" ' 'OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat" ' 'LOCATION "testfolder"').show() spark.sql("select request from test lateral view explode(request.characteristic) cTable as c where c.value='7964000'").explain() {code} shows {code} == Physical Plan == *(1) Project [request#2] +- *(1) Filter (isnotnull(c#4.value) AND (c#4.value = 7964000)) +- *(1) Generate explode(request#2.characteristic), [request#2], false, [c#4] +- *(1) ColumnarToRow +- FileScan orc spark_catalog.default.test[request#2] Batched: true, DataFilters: [], Format: ORC, Location: InMemoryFileIndex(1 paths)[file:/Users/paf/Downloads/spark-warehouse/testfolder], PartitionFilters: [], PushedFilters: [], ReadSchema: struct>>> {code} Which is extremely slow. Suppose I search for a column value, which is totally out of min/max statistics range. Search could have been much faster, but no. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45875) Remove `MissingStageTableRowData` from `core` module
[ https://issues.apache.org/jira/browse/SPARK-45875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45875: --- Labels: pull-request-available (was: ) > Remove `MissingStageTableRowData` from `core` module > - > > Key: SPARK-45875 > URL: https://issues.apache.org/jira/browse/SPARK-45875 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45875) Remove `MissingStageTableRowData` from `core` module
Yang Jie created SPARK-45875: Summary: Remove `MissingStageTableRowData` from `core` module Key: SPARK-45875 URL: https://issues.apache.org/jira/browse/SPARK-45875 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45874) Remove Java version check from `IsolatedClientLoader`
[ https://issues.apache.org/jira/browse/SPARK-45874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45874: --- Labels: pull-request-available (was: ) > Remove Java version check from `IsolatedClientLoader` > - > > Key: SPARK-45874 > URL: https://issues.apache.org/jira/browse/SPARK-45874 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > > {code:java} > val rootClassLoader: ClassLoader = > if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) { > // In Java 9, the boot classloader can see few JDK classes. The intended > parent > // classloader for delegation is now the platform classloader. > // See http://java9.wtf/class-loading/ > val platformCL = > classOf[ClassLoader].getMethod("getPlatformClassLoader"). > invoke(null).asInstanceOf[ClassLoader] > // Check to make sure that the root classloader does not know about Hive. > > assert(Try(platformCL.loadClass("org.apache.hadoop.hive.conf.HiveConf")).isFailure) > platformCL > } else { > // The boot classloader is represented by null (the instance itself isn't > accessible) > // and before Java 9 can see all JDK classes > null > } {code} > Spark 4.0.0 has a minimum requirement of Java 17, so the version check for > Java 9 is not necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45874) Remove Java version check from `IsolatedClientLoader`
Yang Jie created SPARK-45874: Summary: Remove Java version check from `IsolatedClientLoader` Key: SPARK-45874 URL: https://issues.apache.org/jira/browse/SPARK-45874 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Yang Jie {code:java} val rootClassLoader: ClassLoader = if (SystemUtils.isJavaVersionAtLeast(JavaVersion.JAVA_9)) { // In Java 9, the boot classloader can see few JDK classes. The intended parent // classloader for delegation is now the platform classloader. // See http://java9.wtf/class-loading/ val platformCL = classOf[ClassLoader].getMethod("getPlatformClassLoader"). invoke(null).asInstanceOf[ClassLoader] // Check to make sure that the root classloader does not know about Hive. assert(Try(platformCL.loadClass("org.apache.hadoop.hive.conf.HiveConf")).isFailure) platformCL } else { // The boot classloader is represented by null (the instance itself isn't accessible) // and before Java 9 can see all JDK classes null } {code} Spark 4.0.0 has a minimum requirement of Java 17, so the version check for Java 9 is not necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45847) CliSuite flakiness due to non-sequential guarantee for stdout
[ https://issues.apache.org/jira/browse/SPARK-45847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-45847: - Fix Version/s: 3.4.2 > CliSuite flakiness due to non-sequential guarantee for stdout > > > Key: SPARK-45847 > URL: https://issues.apache.org/jira/browse/SPARK-45847 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.5.0, 4.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45873) Make ExecutorFailureTracker more tolerant when app remains sufficient resources
[ https://issues.apache.org/jira/browse/SPARK-45873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45873: --- Labels: pull-request-available (was: ) > Make ExecutorFailureTracker more tolerant when app remains sufficient > resources > > > Key: SPARK-45873 > URL: https://issues.apache.org/jira/browse/SPARK-45873 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Spark Core, YARN >Affects Versions: 4.0.0 >Reporter: Kent Yao >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45873) Make ExecutorFailureTracker more tolerant when app remains sufficient resources
Kent Yao created SPARK-45873: Summary: Make ExecutorFailureTracker more tolerant when app remains sufficient resources Key: SPARK-45873 URL: https://issues.apache.org/jira/browse/SPARK-45873 Project: Spark Issue Type: Improvement Components: Kubernetes, Spark Core, YARN Affects Versions: 4.0.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45872) Update plugin for SBOM generation to 2.7.10
Vinod Anandan created SPARK-45872: - Summary: Update plugin for SBOM generation to 2.7.10 Key: SPARK-45872 URL: https://issues.apache.org/jira/browse/SPARK-45872 Project: Spark Issue Type: Improvement Components: Project Infra Affects Versions: 3.5.0 Reporter: Vinod Anandan Update the CycloneDX Maven plugin for SBOM generation to 2.7.10 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`
[ https://issues.apache.org/jira/browse/SPARK-45871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45871: --- Labels: pull-request-available (was: ) > Change `.toBuffer.toSeq` to `.toSeq` > > > Key: SPARK-45871 > URL: https://issues.apache.org/jira/browse/SPARK-45871 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Yang Jie >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45871) Change `.toBuffer.toSeq` to `.toSeq`
Yang Jie created SPARK-45871: Summary: Change `.toBuffer.toSeq` to `.toSeq` Key: SPARK-45871 URL: https://issues.apache.org/jira/browse/SPARK-45871 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45814) ArrowConverters.createEmptyArrowBatch may cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-45814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie updated SPARK-45814: - Fix Version/s: 4.0.0 3.5.1 > ArrowConverters.createEmptyArrowBatch may cause memory leak > --- > > Key: SPARK-45814 > URL: https://issues.apache.org/jira/browse/SPARK-45814 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: xie shuiahu >Assignee: xie shuiahu >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1 > > > ArrowConverters.createEmptyArrowBatch don't call hasNext, if TaskContext.get > is None, then memory leak happens -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45814) ArrowConverters.createEmptyArrowBatch may cause memory leak
[ https://issues.apache.org/jira/browse/SPARK-45814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Jie resolved SPARK-45814. -- Fix Version/s: 3.4.2 Resolution: Fixed Issue resolved by pull request 43728 [https://github.com/apache/spark/pull/43728] > ArrowConverters.createEmptyArrowBatch may cause memory leak > --- > > Key: SPARK-45814 > URL: https://issues.apache.org/jira/browse/SPARK-45814 > Project: Spark > Issue Type: Bug > Components: Connect, SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: xie shuiahu >Assignee: xie shuiahu >Priority: Minor > Labels: pull-request-available > Fix For: 3.4.2 > > > ArrowConverters.createEmptyArrowBatch don't call hasNext, if TaskContext.get > is None, then memory leak happens -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43305) Add Java17 dockerfiles for 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43305: --- Labels: pull-request-available (was: ) > Add Java17 dockerfiles for 3.5.0 > > > Key: SPARK-43305 > URL: https://issues.apache.org/jira/browse/SPARK-43305 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.5.0 >Reporter: Yikun Jiang >Assignee: Yikun Jiang >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43305) Add Java17 dockerfiles for 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784669#comment-17784669 ] Yikun Jiang commented on SPARK-43305: - Resolved by https://github.com/apache/spark-docker/pull/56 > Add Java17 dockerfiles for 3.5.0 > > > Key: SPARK-43305 > URL: https://issues.apache.org/jira/browse/SPARK-43305 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.5.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43305) Add Java17 dockerfiles for 3.5.0
[ https://issues.apache.org/jira/browse/SPARK-43305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yikun Jiang updated SPARK-43305: Summary: Add Java17 dockerfiles for 3.5.0 (was: Add Java17 dockerfiles for 3.4.0) > Add Java17 dockerfiles for 3.5.0 > > > Key: SPARK-43305 > URL: https://issues.apache.org/jira/browse/SPARK-43305 > Project: Spark > Issue Type: Sub-task > Components: Spark Docker >Affects Versions: 3.5.0 >Reporter: Yikun Jiang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated
[ https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45373: - Shepherd: (was: Peter Toth) > Minimizing calls to HiveMetaStore layer for getting partitions, when tables > are repeated > - > > Key: SPARK-45373 > URL: https://issues.apache.org/jira/browse/SPARK-45373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.1 > > > In the rule PruneFileSourcePartitions where the CatalogFileIndex gets > converted to InMemoryFileIndex, the HMS calls can get very expensive if : > 1) The translated filter string for push down to HMS layer becomes empty , > resulting in fetching of all partitions and same table is referenced multiple > times in the query. > 2) Or just in case same table is referenced multiple times in the query with > different partition filters. > In such cases current code would result in multiple calls to HMS layer. > This can be avoided by grouping the tables based on CatalogFileIndex and > passing a common minimum filter ( filter1 || filter2) and getting a base > PrunedInmemoryFileIndex which can become a basis for each of the specific > table. > Opened following PR for ticket: > [SPARK-45373-PR|https://github.com/apache/spark/pull/43183] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33152) SPIP: Constraint Propagation code causes OOM issues or increasing compilation time to hours
[ https://issues.apache.org/jira/browse/SPARK-33152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-33152: - Affects Version/s: 3.5.0 (was: 2.4.0) (was: 3.0.1) (was: 3.1.2) > SPIP: Constraint Propagation code causes OOM issues or increasing compilation > time to hours > --- > > Key: SPARK-33152 > URL: https://issues.apache.org/jira/browse/SPARK-33152 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: SPIP > Original Estimate: 168h > Remaining Estimate: 168h > > h2. Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon. > Proposing new algorithm to create, store and use constraints for removing > redundant filters & inferring new filters. > The current algorithm has subpar performance in complex expression scenarios > involving aliases( with certain use cases the compilation time can go into > hours), potential to cause OOM, may miss removing redundant filters in > different scenarios, may miss creating IsNotNull constraints in different > scenarios, does not push compound predicates in Join. > # This issue if not fixed can cause OutOfMemory issue or unacceptable query > compilation times. > Have added a test "plan equivalence with case statements and performance > comparison with benefit of more than 10x conservatively" in > org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite. > *With this PR the compilation time is 247 ms vs 13958 ms without the change* > # It is more effective in filter pruning as is evident in some of the tests > in org.apache.spark.sql.catalyst.plans.OptimizedConstraintPropagationSuite > where current code is not able to identify the redundant filter in some cases. > # It is able to generate a better optimized plan for join queries as it can > push compound predicates. > # The current logic can miss a lot of possible cases of removing redundant > predicates, as it fails to take into account if same attribute or its aliases > are repeated multiple times in a complex expression. > # There are cases where some of the optimizer rules involving removal of > redundant predicates fail to remove on the basis of constraint data. In some > cases the rule works, just by the virtue of previous rules helping it out to > cover the inaccuracy. That the ConstraintPropagation rule & its function of > removal of redundant filters & addition of new inferred filters is dependent > on the working of some of the other unrelated previous optimizer rules is > behaving, is indicative of issues. > # It does away with all the EqualNullSafe constraints as this logic does not > need those constraints to be created. > # There is at least one test in existing ConstraintPropagationSuite which is > missing a IsNotNull constraints because the code incorrectly generated a > EqualsNullSafeConstraint instead of EqualTo constraint, when using the > existing Constraints code. With these changes, the test correctly creates an > EqualTo constraint, resulting in an inferred IsNotNull constraint > # It does away with the current combinatorial logic of evaluation all the > constraints can cause compilation to run into hours or cause OOM. The number > of constraints stored is exactly the same as the number of filters encountered > h2. Q2. What problem is this proposal NOT designed to solve? > It mainly focuses on compile time performance, but in some cases can benefit > run time characteristics too, like inferring IsNotNull filter or pushing down > compound predicates on the join, which currently may get missed/ does not > happen , respectively, by the present code. > h2. Q3. How is it done today, and what are the limits of current practice? > Current ConstraintsPropagation code, pessimistically tries to generates all > the possible combinations of constraints , based on the aliases ( even then > it may miss a lot of combinations if the expression is a complex expression > involving same attribute repeated multiple times within the expression and > there are many aliases to that column). There are query plans in our > production env, which can result in intermediate number of constraints going > into hundreds of thousands, causing OOM or taking time running into hours. > Also there are cases where it incorrectly generates an EqualNullSafe > constraint instead of EqualTo constraint , thus missing a possible IsNull > constraint on column. > Also it only pushes single column predicate on the other side of the join. > The constraints generated , in some cases, are missing the required ones, and > the plan
[jira] [Updated] (SPARK-45373) Minimizing calls to HiveMetaStore layer for getting partitions, when tables are repeated
[ https://issues.apache.org/jira/browse/SPARK-45373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-45373: - Affects Version/s: 3.5.0 (was: 4.0.0) > Minimizing calls to HiveMetaStore layer for getting partitions, when tables > are repeated > - > > Key: SPARK-45373 > URL: https://issues.apache.org/jira/browse/SPARK-45373 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Minor > Labels: pull-request-available > Fix For: 3.5.1 > > > In the rule PruneFileSourcePartitions where the CatalogFileIndex gets > converted to InMemoryFileIndex, the HMS calls can get very expensive if : > 1) The translated filter string for push down to HMS layer becomes empty , > resulting in fetching of all partitions and same table is referenced multiple > times in the query. > 2) Or just in case same table is referenced multiple times in the query with > different partition filters. > In such cases current code would result in multiple calls to HMS layer. > This can be avoided by grouping the tables based on CatalogFileIndex and > passing a common minimum filter ( filter1 || filter2) and getting a base > PrunedInmemoryFileIndex which can become a basis for each of the specific > table. > Opened following PR for ticket: > [SPARK-45373-PR|https://github.com/apache/spark/pull/43183] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44662) SPIP: Improving performance of BroadcastHashJoin queries with stream side join key on non partition columns
[ https://issues.apache.org/jira/browse/SPARK-44662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Asif updated SPARK-44662: - Affects Version/s: 3.5.0 (was: 3.5.1) > SPIP: Improving performance of BroadcastHashJoin queries with stream side > join key on non partition columns > --- > > Key: SPARK-44662 > URL: https://issues.apache.org/jira/browse/SPARK-44662 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: pull-request-available > Attachments: perf results broadcast var pushdown - Partitioned > TPCDS.pdf > > > h2. *Q1. What are you trying to do? Articulate your objectives using > absolutely no jargon.* > On the lines of DPP which helps DataSourceV2 relations when the joining key > is a partition column, the same concept can be extended over to the case > where joining key is not a partition column. > The Keys of BroadcastHashJoin are already available before actual evaluation > of the stream iterator. These keys can be pushed down to the DataSource as a > SortedSet. > For non partition columns, the DataSources like iceberg have max/min stats on > column available at manifest level, and for formats like parquet , they have > max/min stats at various storage level. The passed SortedSet can be used to > prune using ranges at both driver level ( manifests files) as well as > executor level ( while actually going through chunks , row groups etc at > parquet level) > If the data is stored as Columnar Batch format , then it would not be > possible to filter out individual row at DataSource level, even though we > have keys. > But at the scan level, ( ColumnToRowExec) it is still possible to filter out > as many rows as possible , if the query involves nested joins. Thus reducing > the number of rows to join at the higher join levels. > Will be adding more details.. > h2. *Q2. What problem is this proposal NOT designed to solve?* > This can only help in BroadcastHashJoin's performance if the join is Inner or > Left Semi. > This will also not work if there are nodes like Expand, Generator , Aggregate > (without group by on keys not part of joining column etc) below the > BroadcastHashJoin node being targeted. > h2. *Q3. How is it done today, and what are the limits of current practice?* > Currently this sort of pruning at DataSource level is being done using DPP > (Dynamic Partition Pruning ) and IFF one of the join key column is a > Partitioning column ( so that cost of DPP query is justified and way less > than amount of data it will be filtering by skipping partitions). > The limitation is that DPP type approach is not implemented ( intentionally I > believe), if the join column is a non partition column ( because of cost of > "DPP type" query would most likely be way high as compared to any possible > pruning ( especially if the column is not stored in a sorted manner). > h2. *Q4. What is new in your approach and why do you think it will be > successful?* > 1) This allows pruning on non partition column based joins. > 2) Because it piggy backs on Broadcasted Keys, there is no extra cost of "DPP > type" query. > 3) The Data can be used by DataSource to prune at driver (possibly) and also > at executor level ( as in case of parquet which has max/min at various > structure levels) > 4) The big benefit should be seen in multilevel nested join queries. In the > current code base, if I am correct, only one join's pruning filter would get > pushed at scan level. Since it is on partition key may be that is sufficient. > But if it is a nested Join query , and may be involving different columns on > streaming side for join, each such filter push could do significant pruning. > This requires some handling in case of AQE, as the stream side iterator ( & > hence stage evaluation needs to be delayed, till all the available join > filters in the nested tree are pushed at their respective target > BatchScanExec). > h4. *Single Row Filteration* > 5) In case of nested broadcasted joins, if the datasource is column vector > oriented , then what spark would get is a ColumnarBatch. But because scans > have Filters from multiple joins, they can be retrieved and can be applied in > code generated at ColumnToRowExec level, using a new "containsKey" method on > HashedRelation. Thus only those rows which satisfy all the > BroadcastedHashJoins ( whose keys have been pushed) , will be used for join > evaluation. > The code is already there , the PR on spark side is > [spark-broadcast-var|https://github.com/apache/spark/pull/43373]. For non > partition table TPCDS run on laptop with TPCDS data size of ( scale factor
[jira] [Resolved] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45850. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43662 [https://github.com/apache/spark/pull/43662] > Upgrade oracle jdbc driver to 23.3.0.23.09 > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: BingKun Pan >Priority: Minor > Labels: pull-request-available > Fix For: 4.0.0 > > > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45869) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45869: -- Labels: releasenotes (was: ) > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: releasenotes > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45756: -- Parent: SPARK-45869 Issue Type: Sub-task (was: Improvement) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45756: -- Labels: pull-request-available (was: pull-request-available releasenotes) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45869) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45869: -- Summary: Revisit and Improve Spark Standalone Cluster (was: Support `spark.master.useAppNameAsAppId.enabled`) > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45756: -- Summary: Support `spark.master.useAppNameAsAppId.enabled` (was: Revisit and Improve Spark Standalone Cluster) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available, releasenotes > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45869: -- Labels: (was: pull-request-available) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869 ] Dongjoon Hyun deleted comment on SPARK-45869: --- was (Author: dongjoon): This is resolved via [https://github.com/apache/spark/pull/43743] > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45869: -- Parent: (was: SPARK-45756) Issue Type: Improvement (was: Sub-task) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45850: Component/s: Tests > Upgrade oracle jdbc driver to 23.3.0.23.09 > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: Project Infra, Tests >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45850) Upgrade oracle jdbc driver to 23.3.0.23.09
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] BingKun Pan updated SPARK-45850: Summary: Upgrade oracle jdbc driver to 23.3.0.23.09 (was: Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version ) > Upgrade oracle jdbc driver to 23.3.0.23.09 > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45869. --- Fix Version/s: 4.0.0 Resolution: Fixed This is resolved via [https://github.com/apache/spark/pull/43743] > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45869: --- Labels: pull-request-available (was: ) > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
[ https://issues.apache.org/jira/browse/SPARK-45869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45869: - Assignee: Dongjoon Hyun > Support `spark.master.useAppNameAsAppId.enabled` > > > Key: SPARK-45869 > URL: https://issues.apache.org/jira/browse/SPARK-45869 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45798) Assert server-side session ID in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-45798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-45798: Assignee: Martin Grund > Assert server-side session ID in Spark Connect > -- > > Key: SPARK-45798 > URL: https://issues.apache.org/jira/browse/SPARK-45798 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Labels: pull-request-available > > When accessing the Spark Session remotely, it is possible that the server has > silently restarted and we loose temporary state like for example views or > function definitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45798) Assert server-side session ID in Spark Connect
[ https://issues.apache.org/jira/browse/SPARK-45798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-45798. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43664 [https://github.com/apache/spark/pull/43664] > Assert server-side session ID in Spark Connect > -- > > Key: SPARK-45798 > URL: https://issues.apache.org/jira/browse/SPARK-45798 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Assignee: Martin Grund >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > When accessing the Spark Session remotely, it is possible that the server has > silently restarted and we loose temporary state like for example views or > function definitions. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45756) Revisit and Improve Spark Standalone Cluster
[ https://issues.apache.org/jira/browse/SPARK-45756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45756: --- Labels: pull-request-available releasenotes (was: releasenotes) > Revisit and Improve Spark Standalone Cluster > > > Key: SPARK-45756 > URL: https://issues.apache.org/jira/browse/SPARK-45756 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available, releasenotes > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45869) Support `spark.master.useAppNameAsAppId.enabled`
Dongjoon Hyun created SPARK-45869: - Summary: Support `spark.master.useAppNameAsAppId.enabled` Key: SPARK-45869 URL: https://issues.apache.org/jira/browse/SPARK-45869 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun This is an experimental internal configuration for advance users. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make Python data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45600: --- Labels: pull-request-available (was: ) > Make Python data source registration session level > -- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > Labels: pull-request-available > > Currently, registered data sources are stored in `sharedState` and can be > accessed across multiple sessions. This, however, will not work with Spark > Connect. We should make this registration session level, and support static > registration (e.g. using pip install) in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make Python data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Description: Currently, registered data sources are stored in `sharedState` and can be accessed across multiple sessions. This, however, will not work with Spark Connect. We should make this registration session level, and support static registration (e.g. using pip install) in the future. (was: Currently we have added a few instance variables to store information for Python data source reader. We should have a dedicated reader class for Python data source to make the current DataFrameReader clean.) > Make Python data source registration session level > -- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently, registered data sources are stored in `sharedState` and can be > accessed across multiple sessions. This, however, will not work with Spark > Connect. We should make this registration session level, and support static > registration (e.g. using pip install) in the future. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Summary: Make data source registration session level (was: Separate the Python data source logic from DataFrameReader) > Make data source registration session level > --- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently we have added a few instance variables to store information for > Python data source reader. We should have a dedicated reader class for Python > data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45600) Make Python data source registration session level
[ https://issues.apache.org/jira/browse/SPARK-45600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-45600: - Summary: Make Python data source registration session level (was: Make data source registration session level) > Make Python data source registration session level > -- > > Key: SPARK-45600 > URL: https://issues.apache.org/jira/browse/SPARK-45600 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 4.0.0 >Reporter: Allison Wang >Priority: Major > > Currently we have added a few instance variables to store information for > Python data source reader. We should have a dedicated reader class for Python > data source to make the current DataFrameReader clean. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43905) Consolidate BlockId parsing and creation
[ https://issues.apache.org/jira/browse/SPARK-43905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-43905: --- Labels: pull-request-available (was: ) > Consolidate BlockId parsing and creation > > > Key: SPARK-43905 > URL: https://issues.apache.org/jira/browse/SPARK-43905 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Henry Mai >Priority: Minor > Labels: pull-request-available > > Consolidate BlockId parsing and creation. > This helps to cut down on errors arising from parsing the BlockId and also > eliminates the need to manually synchronize the code across different places > that parse and create BlockIds. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44179) When a task failed and the inferred task for that task is still executing, the number of dynamically scheduled executors will be calculated incorrectly
[ https://issues.apache.org/jira/browse/SPARK-44179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44179: --- Labels: pull-request-available (was: ) > When a task failed and the inferred task for that task is still executing, > the number of dynamically scheduled executors will be calculated incorrectly > --- > > Key: SPARK-44179 > URL: https://issues.apache.org/jira/browse/SPARK-44179 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.1 >Reporter: liangyongyuan >Priority: Major > Labels: pull-request-available > > Assuming a stage has Task 1, with Task 1.0 and a speculative task Task 1.1 > running concurrently, the dynamic scheduler calculates the number of > executors as 2 (pendingTask: 0, pendingSpeculative: 0, running: 2). > At this point, Task 1.0 fails, and the dynamic scheduler recalculates the > number of executors as 2 (pendingTask: 1, pendingSpeculative: 0, running: 1). > Due to the failure of Task 1.0, copyRunning(1) becomes 1. As a result, Task 1 > is speculated again and a SparkListenerSpeculativeTaskSubmitted event is > triggered. However, the dynamic scheduler's calculation for the number of > executors becomes 3 (pendingTask: 1, pendingSpeculative: 1, running: 1), > which is obviously not as expected. > Then, Task 1.2 starts, and it is marked as a speculative task. However, the > dynamic scheduler still calculates the number of executors as 3 (pendingTask: > 1, pendingSpeculative: 1, running: 1), which again is not as expected. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44595) Make the user session cache number and cache time be configurable in spark connect service
[ https://issues.apache.org/jira/browse/SPARK-44595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44595: --- Labels: pull-request-available (was: ) > Make the user session cache number and cache time be configurable in spark > connect service > -- > > Key: SPARK-44595 > URL: https://issues.apache.org/jira/browse/SPARK-44595 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Min Zhao >Priority: Minor > Labels: pull-request-available > > Now, the cache size of user session is 100, the cache timeout is 3600. Make > them modifiable to meet diverse scenarios. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun resolved SPARK-45731. -- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43629 [https://github.com/apache/spark/pull/43629] > Update partition statistics with ANALYZE TABLE command > -- > > Key: SPARK-45731 > URL: https://issues.apache.org/jira/browse/SPARK-45731 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently {{ANALYZE TABLE}} command only updates table-level stats but not > partition stats, even though it can be applied to both non-partitioned and > partitioned tables. It seems make sense for it to update partition stats as > well. > Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, > but the syntax is more verbose as they need to specify all the partition > columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45731) Update partition statistics with ANALYZE TABLE command
[ https://issues.apache.org/jira/browse/SPARK-45731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chao Sun reassigned SPARK-45731: Assignee: Chao Sun > Update partition statistics with ANALYZE TABLE command > -- > > Key: SPARK-45731 > URL: https://issues.apache.org/jira/browse/SPARK-45731 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.5.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: pull-request-available > > Currently {{ANALYZE TABLE}} command only updates table-level stats but not > partition stats, even though it can be applied to both non-partitioned and > partitioned tables. It seems make sense for it to update partition stats as > well. > Note users can use {{ANALYZE TABLE PARTITION(..)}} to get the same effect, > but the syntax is more verbose as they need to specify all the partition > columns. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45867) Support `spark.worker.idPattern`
[ https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-45867. --- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43740 [https://github.com/apache/spark/pull/43740] > Support `spark.worker.idPattern` > > > Key: SPARK-45867 > URL: https://issues.apache.org/jira/browse/SPARK-45867 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45867) Support `spark.worker.idPattern`
[ https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45867: - Assignee: Dongjoon Hyun > Support `spark.worker.idPattern` > > > Key: SPARK-45867 > URL: https://issues.apache.org/jira/browse/SPARK-45867 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45868) Make spark.table use the same parser with vanilla spark
[ https://issues.apache.org/jira/browse/SPARK-45868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45868: --- Labels: pull-request-available (was: ) > Make spark.table use the same parser with vanilla spark > --- > > Key: SPARK-45868 > URL: https://issues.apache.org/jira/browse/SPARK-45868 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45868) Make spark.table use the same parser with vanilla spark
Ruifeng Zheng created SPARK-45868: - Summary: Make spark.table use the same parser with vanilla spark Key: SPARK-45868 URL: https://issues.apache.org/jira/browse/SPARK-45868 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45867) Support `spark.worker.idPattern`
[ https://issues.apache.org/jira/browse/SPARK-45867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45867: --- Labels: pull-request-available (was: ) > Support `spark.worker.idPattern` > > > Key: SPARK-45867 > URL: https://issues.apache.org/jira/browse/SPARK-45867 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 4.0.0 >Reporter: Dongjoon Hyun >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45867) Support `spark.worker.idPattern`
Dongjoon Hyun created SPARK-45867: - Summary: Support `spark.worker.idPattern` Key: SPARK-45867 URL: https://issues.apache.org/jira/browse/SPARK-45867 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 4.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45866) Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg )
Asif created SPARK-45866: Summary: Reuse of exchange in AQE does not happen when run time filters are pushed down to the underlying Scan ( like iceberg ) Key: SPARK-45866 URL: https://issues.apache.org/jira/browse/SPARK-45866 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.5.1 Reporter: Asif In certain types of queries for eg TPCDS Query 14b, the reuse of exchange does not happen in AQE , resulting in perf degradation. The spark TPCDS tests are unable to catch the problem, because the InMemoryScan used for testing do not implement the equals & hashCode correctly , in the sense , that they do take into account the pushed down run time filters. In concrete Scan implementations, for eg iceberg's SparkBatchQueryScan , the equality check , apart from other things, also involves Runtime Filters pushed ( which is correct). In spark the issue is this: For a given stage being materialized, just before materialization starts, the run time filters are confined to the BatchScanExec level. Only when the actual RDD corresponding to the BatchScanExec, is being evaluated, do the runtime filters get pushed to the underlying Scan. Now if a new stage is created and it checks in the stageCache using its canonicalized plan to see if a stage can be reused, it fails to find the r-usable stage even if the stage exists, because the canonicalized spark plan present in the stage cache, has now the run time filters pushed to the Scan , so the incoming canonicalized spark plan does not match the key as their underlying scans differ . that is incoming spark plan's scan does not have runtime filters , while the canonicalized spark plan present as key in the stage cache has the scan with runtime filters pushed. The fix as I have worked is to provide, two methods in the SupportsRuntimeV2Filtering interface , default boolean equalToIgnoreRuntimeFilters(Scan other) { return this.equals(other); } default int hashCodeIgnoreRuntimeFilters() { return this.hashCode(); } In the BatchScanExec, if the scan implements SupportsRuntimeV2Filtering, then instead of batch.equals, it should call scan.equalToIgnoreRuntimeFilters And the underlying Scan implementations should provide equality which excludes run time filters. Similarly the hashCode of BatchScanExec, should use scan.hashCodeIgnoreRuntimeFilters instead of ( batch.hashCode). Will be creating a PR with bug test for review. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45865) Add user guide for window operations
Allison Wang created SPARK-45865: Summary: Add user guide for window operations Key: SPARK-45865 URL: https://issues.apache.org/jira/browse/SPARK-45865 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for window operations. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45864) Add user guide for groupby and aggregate
Allison Wang created SPARK-45864: Summary: Add user guide for groupby and aggregate Key: SPARK-45864 URL: https://issues.apache.org/jira/browse/SPARK-45864 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide to showcase common DataFrame operations involving group by and aggregate functions (min, max, count, sum, etc) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45863) Add user guide for column selections
Allison Wang created SPARK-45863: Summary: Add user guide for column selections Key: SPARK-45863 URL: https://issues.apache.org/jira/browse/SPARK-45863 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for column selections in PySpark. This should cover the following API: lit, df.col, and cover common column operations such as: removing a column from a data frame, adding new columns, dropping a duplicate column, etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45862) Add user guide for basic dataframe operations
Allison Wang created SPARK-45862: Summary: Add user guide for basic dataframe operations Key: SPARK-45862 URL: https://issues.apache.org/jira/browse/SPARK-45862 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for basic DataFrame operations. This user guide should include the following APIs: select, filter, collect, show -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45861) Add user guide for dataframe creation
Allison Wang created SPARK-45861: Summary: Add user guide for dataframe creation Key: SPARK-45861 URL: https://issues.apache.org/jira/browse/SPARK-45861 Project: Spark Issue Type: Sub-task Components: Documentation, PySpark Affects Versions: 4.0.0 Reporter: Allison Wang Add a simple user guide for data frame creation. This user guide should cover the following APIs: # df.createDataFrame # spark.read.format(...) (can be csv, json, parquet -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Emil Ejbyfeldt updated SPARK-45592: --- Description: The following query should return 100 {code:java} import org.apache.spark.storage.StorageLevel val df = spark.range(0, 100, 1, 5).map(l => (l, l)) val ee = df.select($"_1".as("src"), $"_2".as("dst")) .persist(StorageLevel.MEMORY_AND_DISK) ee.count() val minNbrs1 = ee .groupBy("src").agg(min(col("dst")).as("min_number")) .persist(StorageLevel.MEMORY_AND_DISK) val join = ee.join(minNbrs1, "src") join.count(){code} but on spark 3.5.0 there is a correctness bug causing it to return `104800` or some other smaller value. was: The following query should return 100 {code:java} import org.apache.spark.storage.StorageLevelval df = spark.range(0, 100, 1, 5).map(l => (l, l)) val ee = df.select($"_1".as("src"), $"_2".as("dst")) .persist(StorageLevel.MEMORY_AND_DISK) ee.count() val minNbrs1 = ee .groupBy("src").agg(min(col("dst")).as("min_number")) .persist(StorageLevel.MEMORY_AND_DISK) val join = ee.join(minNbrs1, "src") join.count(){code} but on spark 3.5.0 there is a correctness bug causing it to return `104800` or some other smaller value. > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1 > > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevel > val df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45860) ClassCastException with SerializedLambda in Spark Cluster Mode
Abhilash created SPARK-45860: Summary: ClassCastException with SerializedLambda in Spark Cluster Mode Key: SPARK-45860 URL: https://issues.apache.org/jira/browse/SPARK-45860 Project: Spark Issue Type: Bug Components: Spark Core, Spark Submit Affects Versions: 3.4.1, 3.2.1 Environment: *Environment* Java Version: 11 Spring Boot Version: 2.7.10 Spark Version: 3.2.1 Reporter: Abhilash h3. Issue Description Running a Spark application in cluster mode encounters a `{*}java.lang.ClassCastException{*}` related to `j{*}ava.lang.invoke.SerializedLambda{*}`. This issue seems to be specific to the Spark Cluster mode, and it doesn't occur when running the application locally without Spring Boot. h3. Steps to Reproduce # Create a dummy dataset {code:java} Dataset dummyData = spark.createDataset(Arrays.asList("Abhi", "Andrii", "Rick", "Duc"), Encoders.STRING()); {code} # Call flatMap function to transform the data {code:java} Dataset transformedData = dummyData.flatMap(new TestDataFlatMap(), Encoders.bean(TestData.class)); {code} # Call any action on the transformed dataset {code:java} transformedData.show(); {code} # Running this Spark application with spark submit command in cluster mode with Spring Boot results in the mentioned ClassCastException. h3. *Complete Code:* {code:java} @SpringBootApplication(exclude = {org.springframework.boot.autoconfigure.gson.GsonAutoConfiguration.class}) public class SampleSparkJob{ public static void main(String[] args) { SpringApplication.run(DataIngestionServiceApplication.class, args); SparkSession spark = SparkSession.builder() .appName("SampleSparkJob") .master("local[*]") .getOrCreate(); Dataset dummyData = spark.createDataset(Arrays.asList("Abhi", "Andrii", "Rick", "Duc"), Encoders.STRING()); Dataset transformedData = dummyData.flatMap(new TestDataFlatMap(), Encoders.bean(TestData.class)); transformedData.show(); transformedData.write().mode("append").parquet("outputpath"); spark.stop(); } }{code} {code:java} class TestDataFlatMap implements FlatMapFunction, Serializable { @Override public Iterator call(String name) { return Arrays.asList(new TestData(name)).iterator(); } }{code} {code:java} @Data @AllArgsConstructor public class TestData implements Serializable { private String name; } {code} h3. Stack trace: {code:java} WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (10.248.66.38 executor 0): java.lang.ClassCastException: cannot assign instance of java.lang.invoke.SerializedLambda to field org.apache.spark.rdd.MapPartitionsRDD.f of type scala.Function3 in instance of org.apache.spark.rdd.MapPartitionsRDD at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2076) at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(ObjectStreamClass.java:2039) at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(ObjectStreamClass.java:1293) at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(ObjectInputStream.java:2512) at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2419) at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228) at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2496) at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2390) at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228) at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:489) at java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:447) at scala.collection.immutable.List$SerializationProxy.readObject(List.scala:527) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at java.base/java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1046) at java.base/java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2357) at java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2228) at java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1687) at
[jira] [Updated] (SPARK-45859) Make UDF objects in ml.functions lazy
[ https://issues.apache.org/jira/browse/SPARK-45859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45859: --- Labels: pull-request-available (was: ) > Make UDF objects in ml.functions lazy > - > > Key: SPARK-45859 > URL: https://issues.apache.org/jira/browse/SPARK-45859 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.2.0, 3.3.0, 3.4.0, 3.5.0, 4.0.0, 3.0, 3.1 >Reporter: Ruifeng Zheng >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45859) Make UDF objects in ml.functions lazy
Ruifeng Zheng created SPARK-45859: - Summary: Make UDF objects in ml.functions lazy Key: SPARK-45859 URL: https://issues.apache.org/jira/browse/SPARK-45859 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 3.0, 3.1, 3.5.0, 3.4.0, 3.3.0, 3.2.0, 4.0.0 Reporter: Ruifeng Zheng -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784567#comment-17784567 ] Asif commented on SPARK-45658: -- I also think that during canonicalization of DynamicPruningSubquery, the pruning key's canonicalization should be done on the basis of the enclosing Plan which contains the DynamicPruningSubquery Expression > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44609) ExecutorPodsAllocator doesn't create new executors if no pod snapshot captured pod creation
[ https://issues.apache.org/jira/browse/SPARK-44609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-44609: --- Labels: pull-request-available (was: ) > ExecutorPodsAllocator doesn't create new executors if no pod snapshot > captured pod creation > --- > > Key: SPARK-44609 > URL: https://issues.apache.org/jira/browse/SPARK-44609 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Scheduler >Affects Versions: 3.4.1 >Reporter: Alibi Yeslambek >Priority: Major > Labels: pull-request-available > > There’s a following race condition in ExecutorPodsAllocator when running a > spark application with static allocation on kubernetes with numExecutors >= 1: > * Driver requests an executor > * exec-1 gets created and registers with driver > * exec-1 is moved from {{newlyCreatedExecutors}} to > {{schedulerKnownNewlyCreatedExecs}} > * exec-1 got deleted very quickly (~1-30 sec) after registration > * {{ExecutorPodsWatchSnapshotSource}} fails to catch the creation of the pod > (e.g. websocket connection was reset, k8s-apiserver was down, etc.) > * {{ExecutorPodsPollingSnapshotSource}} fails to catch the creation because > it runs every 30 secs, but executor was removed much quicker after creation > * exec-1 is never removed from {{schedulerKnownNewlyCreatedExecs}} > * {{ExecutorPodsAllocator}} will never request new executor because it’s > slot is occupied by exec-1, due to {{schedulerKnownNewlyCreatedExecs}} never > being cleared. > > Put up a fix here https://github.com/apache/spark/pull/42297 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-45592: - Assignee: Emil Ejbyfeldt (was: Apache Spark) > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Emil Ejbyfeldt >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1 > > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45592: -- Target Version/s: 3.4.2 > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1 > > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45592) AQE and InMemoryTableScanExec correctness bug
[ https://issues.apache.org/jira/browse/SPARK-45592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-45592: -- Affects Version/s: 3.4.1 > AQE and InMemoryTableScanExec correctness bug > - > > Key: SPARK-45592 > URL: https://issues.apache.org/jira/browse/SPARK-45592 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 >Reporter: Emil Ejbyfeldt >Assignee: Apache Spark >Priority: Blocker > Labels: correctness, pull-request-available > Fix For: 4.0.0, 3.5.1 > > > The following query should return 100 > {code:java} > import org.apache.spark.storage.StorageLevelval > df = spark.range(0, 100, 1, 5).map(l => (l, l)) > val ee = df.select($"_1".as("src"), $"_2".as("dst")) > .persist(StorageLevel.MEMORY_AND_DISK) > ee.count() > val minNbrs1 = ee > .groupBy("src").agg(min(col("dst")).as("min_number")) > .persist(StorageLevel.MEMORY_AND_DISK) > val join = ee.join(minNbrs1, "src") > join.count(){code} > but on spark 3.5.0 there is a correctness bug causing it to return `104800` > or some other smaller value. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45858) Consistent FetchFailed/NoSuchFileExceptions when decommissioning is enabled
Alibi Yeslambek created SPARK-45858: --- Summary: Consistent FetchFailed/NoSuchFileExceptions when decommissioning is enabled Key: SPARK-45858 URL: https://issues.apache.org/jira/browse/SPARK-45858 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.5.0 Reporter: Alibi Yeslambek Decommissioning causes FetchFailures with NoSuchFileException due to multiple tasks on the same partition from different stage attempts sharing a single MapStatus object. Is there any workaround/config flag that I’m missing that will fix the issue or is this rather a bug? *Example* Here are same tasks from different stage attempts for the same partition: {code:java} INFO [2023-11-07T17:50:03.399091Z] org.apache.spark.scheduler.TaskSetManager: Starting task 16.0 in stage 11.1 (TID 1810) (10.0.158.211, executor 5, partition 81, PROCESS_LOCAL, 4743 bytes) taskResourceAssignments Map() INFO [2023-11-07T17:51:20.229168Z] org.apache.spark.scheduler.TaskSetManager: Starting task 13.0 in stage 11.2 (TID 1836) (10.0.187.67, executor 6, partition 81, PROCESS_LOCAL, 4743 bytes) taskResourceAssignments Map() {code} The latest mapStatus.location for partition 81 will be the latest succeeded task (exec-6) , i.e: {code:java} mapStatus(81).location = BlockManagerId(6, 10.0.187.67, 7079, None){code} Which means that multiple MapIDs point to the same MapIndex and share one MapStatus object. In this example: {code:java} mapIdToMapIndex(1810) = 81 mapIdToMapIndex(1836) = 81 {code} Now if we decommission exec-5, all of its blocks (including 1810) will be migrated and driver mapStatuses will updated. {code:java} INFO [2023-11-07T17:57:23.545274Z] org.apache.spark.ShuffleStatus: Updating map output for 1810 to BlockManagerId(4, 10.0.153.179, 7079, None){code} Which updates mapStatus.location for partition 81 to exec-4: {code:java} mapStatus(81).location = BlockManagerId(4, 10.0.153.179, 7079, None){code} And when a task from different stage tries to fetch block for {{MapId: 1836}} , driver will return it’s location as exec-4, whereas in fact it is still on exec-6. The task will fail with FetchFailure caused by NoSuchFileException, because the actual block is in exec-6. {code:java} WARN [2023-11-07T17:58:40.008602Z] org.apache.spark.scheduler.TaskSetManager: Lost task 14.0 in stage 16.0 (TID 1988) (10.0.156.83 executor 24): FetchFailed(BlockManagerId(4, 10.0.153.179, 7079, None), shuffleId=5, mapIndex=81, mapId=1836, reduceId=84, message= org.apache.spark.shuffle.FetchFailedException at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:1167) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:903) at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:84) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.sort_addToSorter_0$(generated.java:31) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage10.processNext(generated.java:43) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:776) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage12.smj_findNextJoinRows_0$(generated.java:40) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage12.processNext(generated.java:101) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$2.hasNext(WholeStageCodegenExec.scala:795) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$1(Executor.scala:516) at
[jira] [Updated] (SPARK-45658) Canonicalization of DynamicPruningSubquery is broken
[ https://issues.apache.org/jira/browse/SPARK-45658?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45658: --- Labels: pull-request-available (was: ) > Canonicalization of DynamicPruningSubquery is broken > > > Key: SPARK-45658 > URL: https://issues.apache.org/jira/browse/SPARK-45658 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0 >Reporter: Asif >Priority: Major > Labels: pull-request-available > > The canonicalization of (buildKeys: Seq[Expression]) in the class > DynamicPruningSubquery is broken, as the buildKeys are canonicalized just by > calling > buildKeys.map(_.canonicalized) > The above would result in incorrect canonicalization as it would not be > normalizing the exprIds relative to buildQuery output > The fix is to use the buildQuery : LogicalPlan's output to normalize the > buildKeys expression > as given below, using the standard approach. > buildKeys.map(QueryPlan.normalizeExpressions(_, buildQuery.output)), > Will be filing a PR and bug test for the same. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-45857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45857: - Description: Make the error class in sub-classes of AnalysisException mandatory to enforce callers to always set it. This simplifies migration on error classes. (was: Make the error class in sub-classes of ParseException mandatory to enforce callers to always set it. This simplifies migration on error classes.) > Enforce the error classes in sub-classes of AnalysisException > - > > Key: SPARK-45857 > URL: https://issues.apache.org/jira/browse/SPARK-45857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Make the error class in sub-classes of AnalysisException mandatory to enforce > callers to always set it. This simplifies migration on error classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException
Max Gekk created SPARK-45857: Summary: Enforce the error classes in sub-classes of AnalysisException Key: SPARK-45857 URL: https://issues.apache.org/jira/browse/SPARK-45857 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 4.0.0 Reporter: Max Gekk Assignee: Max Gekk Fix For: 4.0.0 Make the error class in ParseException mandatory to enforce callers to always set it. This simplifies migration on error classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45857) Enforce the error classes in sub-classes of AnalysisException
[ https://issues.apache.org/jira/browse/SPARK-45857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Max Gekk updated SPARK-45857: - Description: Make the error class in sub-classes of ParseException mandatory to enforce callers to always set it. This simplifies migration on error classes. (was: Make the error class in ParseException mandatory to enforce callers to always set it. This simplifies migration on error classes.) > Enforce the error classes in sub-classes of AnalysisException > - > > Key: SPARK-45857 > URL: https://issues.apache.org/jira/browse/SPARK-45857 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Max Gekk >Assignee: Max Gekk >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Make the error class in sub-classes of ParseException mandatory to enforce > callers to always set it. This simplifies migration on error classes. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45855) Unable to set compression codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson resolved SPARK-45855. --- Resolution: Fixed I found this is fixed in 3.5.0 and I strongly suspect is caused by the same thing documented and fixed in #43504 > Unable to set compression codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > Fix For: 3.5.0 > > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45855) Unable to set compression codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784533#comment-17784533 ] Tim Robertson commented on SPARK-45855: --- I suspect it is this https://issues.apache.org/jira/browse/SPARK-43504 > Unable to set compression codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45855) Unable to set compression codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson updated SPARK-45855: -- Fix Version/s: 3.5.0 > Unable to set compression codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > Fix For: 3.5.0 > > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45855) Unable to set compression codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784530#comment-17784530 ] Tim Robertson commented on SPARK-45855: --- This also seems to fail with 3.4.1 but seems to be fixed in 3.5.0. I'm yet to find out why, so I can link it and close this. > Unable to set compression codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45856) Move ArtifactManager from Spark Connect into SparkSession (sql/core)
[ https://issues.apache.org/jira/browse/SPARK-45856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45856: --- Labels: pull-request-available (was: ) > Move ArtifactManager from Spark Connect into SparkSession (sql/core) > > > Key: SPARK-45856 > URL: https://issues.apache.org/jira/browse/SPARK-45856 > Project: Spark > Issue Type: Improvement > Components: Connect, SQL >Affects Versions: 4.0.0 >Reporter: Venkata Sai Akhil Gudesa >Priority: Major > Labels: pull-request-available > > The `ArtifactManager` that currently lies in the connect package can be moved > into the wider sql/core package (e.g SparkSession) to expand the scope. This > is possible because the `ArtifactManager` is tied solely to the > `SparkSession#sessionUUID` and hence can be cleanly detached from Spark > Connect and be made generally available. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45855) Unable to set compression codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson updated SPARK-45855: -- Summary: Unable to set compression codec for Hive CTAS (was: Unable to set codec for Hive CTAS) > Unable to set compression codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45855) Unable to set codec for Hive CTAS
[ https://issues.apache.org/jira/browse/SPARK-45855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Robertson updated SPARK-45855: -- Description: Hi, We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find anything in the release notes to indicate why, so I wonder if this is a bug. Thank you for looking. Here we're using our own custom codec, but we noticed we can't set gzip either. {{ SparkConf conf = spark.sparkContext().conf();}} {{ conf.set("hive.exec.compress.output", "true");}} {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} This will create the table, but it writes uncompressed files, where Spark 3.3.0 would write compressed files. Any advice is appreciated and I can help run tests. We run Spark on K8S using the stackable.tech distribution. was: Hi, We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find anything in the release notes to indicate why, so I wonder if this is a bug. Thank you for looking. Here we're using our own custom codec, but we noticed we can't set gzip either. {{ SparkConf conf = spark.sparkContext().conf();}} {{ conf.set("hive.exec.compress.output", "true");}} {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} Any advice is appreciated and I can help run tests. We run Spark on K8S using the stackable.tech distribution. > Unable to set codec for Hive CTAS > - > > Key: SPARK-45855 > URL: https://issues.apache.org/jira/browse/SPARK-45855 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 > Environment: Spark 3.4.0 > Stackable.tech release 23.7.0 which runs spark on K8s. >Reporter: Tim Robertson >Priority: Major > > Hi, > We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't > find anything in the release notes to indicate why, so I wonder if this is a > bug. Thank you for looking. > Here we're using our own custom codec, but we noticed we can't set gzip > either. > {{ SparkConf conf = spark.sparkContext().conf();}} > {{ conf.set("hive.exec.compress.output", "true");}} > {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} > {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} > This will create the table, but it writes uncompressed files, where Spark > 3.3.0 would write compressed files. > Any advice is appreciated and I can help run tests. We run Spark on K8S using > the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45855) Unable to set codec for Hive CTAS
Tim Robertson created SPARK-45855: - Summary: Unable to set codec for Hive CTAS Key: SPARK-45855 URL: https://issues.apache.org/jira/browse/SPARK-45855 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Environment: Spark 3.4.0 Stackable.tech release 23.7.0 which runs spark on K8s. Reporter: Tim Robertson Hi, We've discovered code that worked in Spark 3.3.0 doesn't in 3.4.0. I can't find anything in the release notes to indicate why, so I wonder if this is a bug. Thank you for looking. Here we're using our own custom codec, but we noticed we can't set gzip either. {{ SparkConf conf = spark.sparkContext().conf();}} {{ conf.set("hive.exec.compress.output", "true");}} {{ conf.set("mapred.output.compression.codec", D2Codec.class.getName()); }} {{ spark.sql("CREATE TABLE b AS SELECT id FROM a");}} Any advice is appreciated and I can help run tests. We run Spark on K8S using the stackable.tech distribution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45849) Remove uneccessary toSeq when encoding Set to catalyst
[ https://issues.apache.org/jira/browse/SPARK-45849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45849: --- Labels: pull-request-available (was: ) > Remove uneccessary toSeq when encoding Set to catalyst > -- > > Key: SPARK-45849 > URL: https://issues.apache.org/jira/browse/SPARK-45849 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: Emil Ejbyfeldt >Priority: Minor > Labels: pull-request-available > > Currently when encoding Sets to catalyst we first convert them into a Seq. > There is no good reason to do this as the interface we are targeting for > encoding is only `Iterable` which is implemented by Set. So by using Iterable > instead of Seq in some places we should be able to avoid this extra copy when > encoding Sets. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45854) spark.catalog.listTables fails with ParseException after upgrading to Spark 3.4.1 from 3.3.1
Andrej Zachar created SPARK-45854: - Summary: spark.catalog.listTables fails with ParseException after upgrading to Spark 3.4.1 from 3.3.1 Key: SPARK-45854 URL: https://issues.apache.org/jira/browse/SPARK-45854 Project: Spark Issue Type: Bug Components: PySpark, Spark Core, Spark Submit Affects Versions: 3.4.1, 3.4.0 Reporter: Andrej Zachar After upgrading to Spark 3.4.1, the listTables() method in PySpark now throws a ParseException with the message "Syntax error at or near end of input.". This did not occur in previous versions of Spark, such as 3.3.1. Install Spark version 3.4.1. Run pyspark ```bash {{pyspark --packages io.delta:delta-core_2.12:2.4.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"}} ``` Attempt to list tables using ```console {{spark.range(1).createTempView("test_view")}} {{spark.catalog.listTables()}} ``` Expected result: The listTables() method should return a list of tables without throwing any exceptions. Actual result: {{Traceback (most recent call last):}} {{File "", line 1, in }} {{File ".venv/lib/python3.10/site-packages/pyspark/sql/catalog.py", line 302, in listTables}} {{iter = self._jcatalog.listTables(dbName).toLocalIterator()}} {{File ".venv/lib/python3.10/site-packages/pyspark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in _{_}call{_}_}} {{File ".venv/lib/python3.10/site-packages/pyspark/errors/exceptions/captured.py", line 175, in deco}} {{raise converted from None}} {{pyspark.errors.exceptions.captured.ParseException:}} {{[PARSE_SYNTAX_ERROR] Syntax error at or near end of input.(line 1, pos 0)}} == SQL == ^^^ >>> The same code worked correctly in Spark version 3.3.1. No changes were made to the code aside from upgrading Spark. Thank you for considering this issue! Any assistance in resolving it would be greatly appreciated. Best regards, Andrej -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45853) Add Iceberg and Hudi to third party projects
Yuming Wang created SPARK-45853: --- Summary: Add Iceberg and Hudi to third party projects Key: SPARK-45853 URL: https://issues.apache.org/jira/browse/SPARK-45853 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 4.0.0 Reporter: Yuming Wang {noformat} Error: org.apache.hive.service.cli.HiveSQLException: Error running query: java.util.concurrent.ExecutionException: org.apache.spark.SparkClassNotFoundException: [DATA_SOURCE_NOT_FOUND] Failed to find the data source: iceberg. Please find packages at `https://spark.apache.org/third-party-projects.html`. at org.apache.spark.sql.hive.thriftserver.HiveThriftServerErrors$.runningQueryError(HiveThriftServerErrors.scala:46) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.org$apache$spark$sql$hive$thriftserver$SparkExecuteStatementOperation$$execute(SparkExecuteStatementOperation.scala:262) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.$anonfun$run$2(SparkExecuteStatementOperation.scala:166) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties(SparkOperation.scala:79) at org.apache.spark.sql.hive.thriftserver.SparkOperation.withLocalProperties$(SparkOperation.scala:63) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withLocalProperties(SparkExecuteStatementOperation.scala:41) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:166) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2$$anon$3.run(SparkExecuteStatementOperation.scala:161) at java.base/java.security.AccessController.doPrivileged(AccessController.java:712) at java.base/javax.security.auth.Subject.doAs(Subject.java:439) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1878) at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$anon$2.run(SparkExecuteStatementOperation.scala:175) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:833) {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45850: --- Labels: pull-request-available (was: ) > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45852) Gracefully deal with recursion exception during Spark Connect logging
[ https://issues.apache.org/jira/browse/SPARK-45852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated SPARK-45852: --- Labels: pull-request-available (was: ) > Gracefully deal with recursion exception during Spark Connect logging > - > > Key: SPARK-45852 > URL: https://issues.apache.org/jira/browse/SPARK-45852 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Martin Grund >Priority: Major > Labels: pull-request-available > > ``` > from google.protobuf.text_format import MessageToString > from pyspark.sql.functions import col, lit > df = spark.range(10) > for x in range(800): > df = df.withColumn(f"next{x}", lit(1)) > MessageToString(df._plan.to_proto(spark._client), as_one_line=True) > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45852) Gracefully deal with recursion exception during Spark Connect logging
Martin Grund created SPARK-45852: Summary: Gracefully deal with recursion exception during Spark Connect logging Key: SPARK-45852 URL: https://issues.apache.org/jira/browse/SPARK-45852 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.5.0 Reporter: Martin Grund ``` from google.protobuf.text_format import MessageToString from pyspark.sql.functions import col, lit df = spark.range(10) for x in range(800): df = df.withColumn(f"next{x}", lit(1)) MessageToString(df._plan.to_proto(spark._client), as_one_line=True) ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns
[ https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-45815: --- Assignee: Yaohua Zhao > Provide an interface for Streaming sources to add _metadata columns > --- > > Key: SPARK-45815 > URL: https://issues.apache.org/jira/browse/SPARK-45815 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.5.1 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Labels: pull-request-available > > Currently, only the native V1 file-based streaming source can read the > `_metadata` column: > [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63] > > Our goal is to create an interface that allows other streaming sources to add > `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming > source, which you can find here: > [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], > to extend this interface and provide the `{{{}_metadata`{}}} column for its > underlying storage format, such as Parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45815) Provide an interface for Streaming sources to add _metadata columns
[ https://issues.apache.org/jira/browse/SPARK-45815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-45815. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 43692 [https://github.com/apache/spark/pull/43692] > Provide an interface for Streaming sources to add _metadata columns > --- > > Key: SPARK-45815 > URL: https://issues.apache.org/jira/browse/SPARK-45815 > Project: Spark > Issue Type: Improvement > Components: SQL, Structured Streaming >Affects Versions: 3.5.1 >Reporter: Yaohua Zhao >Assignee: Yaohua Zhao >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > Currently, only the native V1 file-based streaming source can read the > `_metadata` column: > [https://github.com/apache/spark/blob/370870b7a0303e4a2c4b3dea1b479b4fcbc93f8d/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingRelation.scala#L63] > > Our goal is to create an interface that allows other streaming sources to add > `{{{}_metadata`{}}} columns. For instance, we would like the Delta Streaming > source, which you can find here: > [https://github.com/delta-io/delta/blob/master/spark/src/main/scala/org/apache/spark/sql/delta/sources/DeltaDataSource.scala#L49], > to extend this interface and provide the `{{{}_metadata`{}}} column for its > underlying storage format, such as Parquet. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-44886) Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE
[ https://issues.apache.org/jira/browse/SPARK-44886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-44886. - Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 42577 [https://github.com/apache/spark/pull/42577] > Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE > --- > > Key: SPARK-44886 > URL: https://issues.apache.org/jira/browse/SPARK-44886 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > This proposes to introduce CLUSTER BY clause to CREATE/REPLACE SQL syntax: > {code:java} > CREATE TABLE tbl(a int, b string) CLUSTER BY (a, b){code} > This doesn't introduce a default implementation for clustering, but it's up > to the catalog/datasource implementation to utilize the clustering > information (e.g., Delta, Iceberg, etc.). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-44886) Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE
[ https://issues.apache.org/jira/browse/SPARK-44886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-44886: --- Assignee: Terry Kim > Introduce CLUSTER BY SQL clause to CREATE/REPLACE TABLE > --- > > Key: SPARK-44886 > URL: https://issues.apache.org/jira/browse/SPARK-44886 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 4.0.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Labels: pull-request-available > > This proposes to introduce CLUSTER BY clause to CREATE/REPLACE SQL syntax: > {code:java} > CREATE TABLE tbl(a int, b string) CLUSTER BY (a, b){code} > This doesn't introduce a default implementation for clustering, but it's up > to the catalog/datasource implementation to utilize the clustering > information (e.g., Delta, Iceberg, etc.). -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45851) (Scala) Support different retry policies for connect client
Alice Sayutina created SPARK-45851: -- Summary: (Scala) Support different retry policies for connect client Key: SPARK-45851 URL: https://issues.apache.org/jira/browse/SPARK-45851 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 4.0.0 Reporter: Alice Sayutina Support multiple retry policies defined at the same time. Each policy determines which error types it can retry and how exactly. For instance, networking errors should generally be retried differently that remote resource being available. Relevant python ticket: SPARK-45733 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45733) (Python) Support different retry policies for connect client
[ https://issues.apache.org/jira/browse/SPARK-45733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alice Sayutina updated SPARK-45733: --- Summary: (Python) Support different retry policies for connect client (was: Classify errors into different classes and support different retry policies.) > (Python) Support different retry policies for connect client > > > Key: SPARK-45733 > URL: https://issues.apache.org/jira/browse/SPARK-45733 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 4.0.0 >Reporter: Alice Sayutina >Priority: Major > Labels: pull-request-available > > Support multiple retry policies defined at the same time. Each policy > determines which error types it can retry and how exactly. > For instance, networking errors should generally be retried differently that > remote resource being available. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784335#comment-17784335 ] ASF GitHub Bot commented on SPARK-45850: User 'panbingkun' has created a pull request for this issue: https://github.com/apache/spark/pull/43662 > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45850: -- Assignee: Apache Spark > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-45850) Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc driver version
[ https://issues.apache.org/jira/browse/SPARK-45850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot reassigned SPARK-45850: -- Assignee: (was: Apache Spark) > Attempt to stabilize `OracleIntegrationSuite ` by upgrading the oracle jdbc > driver version > --- > > Key: SPARK-45850 > URL: https://issues.apache.org/jira/browse/SPARK-45850 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 4.0.0 >Reporter: BingKun Pan >Priority: Minor > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45282) Join loses records for cached datasets
[ https://issues.apache.org/jira/browse/SPARK-45282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784320#comment-17784320 ] Emil Ejbyfeldt commented on SPARK-45282: Created this [https://github.com/apache/spark/pull/43729] to backport the fix to 3.4 from my manual test it solved the reproduction in this ticket. > Join loses records for cached datasets > -- > > Key: SPARK-45282 > URL: https://issues.apache.org/jira/browse/SPARK-45282 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0 > Environment: spark 3.4.1 on apache hadoop 3.3.6 or kubernetes 1.26 or > databricks 13.3 >Reporter: koert kuipers >Priority: Blocker > Labels: CorrectnessBug, correctness, pull-request-available > > we observed this issue on spark 3.4.1 but it is also present on 3.5.0. it is > not present on spark 3.3.1. > it only shows up in distributed environment. i cannot replicate in unit test. > however i did get it to show up on hadoop cluster, kubernetes, and on > databricks 13.3 > the issue is that records are dropped when two cached dataframes are joined. > it seems in spark 3.4.1 in queryplan some Exchanges are dropped as an > optimization while in spark 3.3.1 these Exhanges are still present. it seems > to be an issue with AQE with canChangeCachedPlanOutputPartitioning=true. > to reproduce on distributed cluster these settings needed are: > {code:java} > spark.sql.adaptive.advisoryPartitionSizeInBytes 33554432 > spark.sql.adaptive.coalescePartitions.parallelismFirst false > spark.sql.adaptive.enabled true > spark.sql.optimizer.canChangeCachedPlanOutputPartitioning true {code} > code using scala to reproduce is: > {code:java} > import java.util.UUID > import org.apache.spark.sql.functions.col > import spark.implicits._ > val data = (1 to 100).toDS().map(i => > UUID.randomUUID().toString).persist() > val left = data.map(k => (k, 1)) > val right = data.map(k => (k, k)) // if i change this to k => (k, 1) it works! > println("number of left " + left.count()) > println("number of right " + right.count()) > println("number of (left join right) " + > left.toDF("key", "value1").join(right.toDF("key", "value2"), "key").count() > ) > val left1 = left > .toDF("key", "value1") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of left1 " + left1.count()) > val right1 = right > .toDF("key", "value2") > .repartition(col("key")) // comment out this line to make it work > .persist() > println("number of right1 " + right1.count()) > println("number of (left1 join right1) " + left1.join(right1, > "key").count()) // this gives incorrect result{code} > this produces the following output: > {code:java} > number of left 100 > number of right 100 > number of (left join right) 100 > number of left1 100 > number of right1 100 > number of (left1 join right1) 859531 {code} > note that the last number (the incorrect one) actually varies depending on > settings and cluster size etc. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org