[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page
[ https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081188#comment-17081188 ] Kousuke Saruta commented on SPARK-31420: O.K, I'll look into this. > Infinite timeline redraw in job details page > > > Key: SPARK-31420 > URL: https://issues.apache.org/jira/browse/SPARK-31420 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Kousuke Saruta >Priority: Major > Attachments: timeline.mov > > > In the job page, the timeline section keeps changing the position style and > shaking. We can see that there is a warning "infinite loop in redraw" from > the console, which can be related to > https://github.com/visjs/vis-timeline/issues/17 > I am using the history server with the events under > "core/src/test/resources/spark-events" to reproduce. > I have also uploaded a screen recording. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31422) Fix NPE when BlockManagerSource is used after BlockManagerMaster stops
Dongjoon Hyun created SPARK-31422: - Summary: Fix NPE when BlockManagerSource is used after BlockManagerMaster stops Key: SPARK-31422 URL: https://issues.apache.org/jira/browse/SPARK-31422 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5, 2.3.4, 3.0.0 Reporter: Dongjoon Hyun In `SparkEnv.stop`, the following stop sequence is used. {code} blockManager.stop() blockManager.master.stop() metricsSystem.stop() {code} During `metricsSystem.stop`, some sink can invoke `BlockManagerSource` and ends up with NPE. {code} sinks.foreach(_.stop) registry.removeMatching((_: String, _: Metric) => true) {code} {code} java.lang.NullPointerException at org.apache.spark.storage.BlockManagerMaster.getStorageStatus(BlockManagerMaster.scala:170) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anonfun$10.apply(BlockManagerSource.scala:63) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:31) at org.apache.spark.storage.BlockManagerSource$$anon$1.getValue(BlockManagerSource.scala:30) {code} Since SparkContext registers and forgets `BlockManagerSource` without deregistering, we had better avoid NPE inside `BlockManagerMaster`. {code} _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager)) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29245) CCE during creating HiveMetaStoreClient
[ https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081174#comment-17081174 ] Fokko Driesprong commented on SPARK-29245: -- Thanks, on Iceberg we have a similar issue: [https://github.com/apache/incubator-iceberg/pull/577] For reference. > CCE during creating HiveMetaStoreClient > > > Key: SPARK-29245 > URL: https://issues.apache.org/jira/browse/SPARK-29245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > From `master` branch build, when I try to connect to an external HMS, I hit > the following. > {code} > 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException > class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; > ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader > 'bootstrap') > java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to > class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module > java.base of loader 'bootstrap') > at > org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200) > at > org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70) > {code} > With HIVE-21508, I can get the following. > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4) > Type in expressions to have them evaluated. > Type :help for more information. > scala> sql("show databases").show > ++ > |databaseName| > ++ > | . | > ... > {code} > With 2.3.7-SNAPSHOT, the following basic tests are tested. > - SHOW DATABASES / TABLES > - DESC DATABASE / TABLE > - CREATE / DROP / USE DATABASE > - CREATE / DROP / INSERT / LOAD / SELECT TABLE -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31421) transformAllExpressions can not transform expressions in ScalarSubquery
Yuming Wang created SPARK-31421: --- Summary: transformAllExpressions can not transform expressions in ScalarSubquery Key: SPARK-31421 URL: https://issues.apache.org/jira/browse/SPARK-31421 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Example: {code:scala} spark.sql("CREATE TABLE t1 (c1 bigint, c2 bigint)") val sqlStr = "SELECT COUNT(*) FROM t1 HAVING SUM(c1) > (SELECT AVG(c2) FROM t1)" println(normalizeExprIds(spark.sql(sqlStr).queryExecution.optimizedPlan).treeString) {code} The output: {noformat} Project [count(1)#0L] +- Filter (isnotnull(sum(c1#2L)#0L) AND (cast(sum(c1#2L)#0L as double) > scalar-subquery#0 [])) : +- Aggregate [avg(c2#3L) AS avg(c2)#6] : +- Project [c2#3L] :+- Relation[c1#2L,c2#3L] parquet +- Aggregate [count(1) AS count(1)#0L, sum(c1#0L) AS sum(c1#2L)#0L] +- Project [c1#0L] +- Relation[c1#0L,c2#0L] parquet {noformat} Another example query: https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q14b.sql -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-31256) Dropna doesn't work for struct columns
[ https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] JinxinTang updated SPARK-31256: --- Comment: was deleted (was: I will try to solve it) > Dropna doesn't work for struct columns > -- > > Key: SPARK-31256 > URL: https://issues.apache.org/jira/browse/SPARK-31256 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Spark 2.4.5 > Python 3.7.4 >Reporter: Michael Souder >Priority: Major > > Dropna using a subset with a column from a struct drops the entire data frame. > {code:python} > import pyspark.sql.functions as F > df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, > None)], schema=['age', 'height', 'name']) > df.show() > +---+--+-+ > |age|height| name| > +---+--+-+ > | 5|80|Alice| > | 10| null| Bob| > | 15|80| null| > +---+--+-+ > # this works just fine > df.dropna(subset=['name']).show() > +---+--+-+ > |age|height| name| > +---+--+-+ > | 5|80|Alice| > | 10| null| Bob| > +---+--+-+ > # now add a struct column > df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', > 'name')) > df_with_struct.show(truncate=False) > +---+--+-+--+ > |age|height|name |struct_col| > +---+--+-+--+ > |5 |80|Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob]| > |15 |80|null |[15, 80,] | > +---+--+-+--+ > # now dropna drops the whole dataframe when you use struct_col > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+--++--+ > |age|height|name|struct_col| > +---+--++--+ > +---+--++--+ > {code} > I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 > with python 3.6.8 and in both, the result looks like: > {code:python} > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+--+-+--+ > |age|height|name |struct_col| > +---+--+-+--+ > |5 |80|Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob]| > +---+--+-+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31256) Dropna doesn't work for struct columns
[ https://issues.apache.org/jira/browse/SPARK-31256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081084#comment-17081084 ] JinxinTang commented on SPARK-31256: I will try to solve it > Dropna doesn't work for struct columns > -- > > Key: SPARK-31256 > URL: https://issues.apache.org/jira/browse/SPARK-31256 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 > Environment: Spark 2.4.5 > Python 3.7.4 >Reporter: Michael Souder >Priority: Major > > Dropna using a subset with a column from a struct drops the entire data frame. > {code:python} > import pyspark.sql.functions as F > df = spark.createDataFrame([(5, 80, 'Alice'), (10, None, 'Bob'), (15, 80, > None)], schema=['age', 'height', 'name']) > df.show() > +---+--+-+ > |age|height| name| > +---+--+-+ > | 5|80|Alice| > | 10| null| Bob| > | 15|80| null| > +---+--+-+ > # this works just fine > df.dropna(subset=['name']).show() > +---+--+-+ > |age|height| name| > +---+--+-+ > | 5|80|Alice| > | 10| null| Bob| > +---+--+-+ > # now add a struct column > df_with_struct = df.withColumn('struct_col', F.struct('age', 'height', > 'name')) > df_with_struct.show(truncate=False) > +---+--+-+--+ > |age|height|name |struct_col| > +---+--+-+--+ > |5 |80|Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob]| > |15 |80|null |[15, 80,] | > +---+--+-+--+ > # now dropna drops the whole dataframe when you use struct_col > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+--++--+ > |age|height|name|struct_col| > +---+--++--+ > +---+--++--+ > {code} > I've tested the above code in Spark 2.4.4 with python 3.7.4 and Spark 2.3.1 > with python 3.6.8 and in both, the result looks like: > {code:python} > df_with_struct.dropna(subset=['struct_col.name']).show(truncate=False) > +---+--+-+--+ > |age|height|name |struct_col| > +---+--+-+--+ > |5 |80|Alice|[5, 80, Alice]| > |10 |null |Bob |[10,, Bob]| > +---+--+-+--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25154) Support NOT IN sub-queries inside nested OR conditions.
[ https://issues.apache.org/jira/browse/SPARK-25154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-25154. -- Fix Version/s: 3.1.0 Assignee: Dilip Biswal Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28158 > Support NOT IN sub-queries inside nested OR conditions. > --- > > Key: SPARK-25154 > URL: https://issues.apache.org/jira/browse/SPARK-25154 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.1.0 > > > Currently NOT IN subqueries (predicated null aware subquery) are not allowed > inside OR expressions. We currently catch this condition in checkAnalysis and > throw an error. > We need to extend our subquery rewrite frame work to support this type of > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25154) Support NOT IN sub-queries inside nested OR conditions.
[ https://issues.apache.org/jira/browse/SPARK-25154?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-25154: - Issue Type: Improvement (was: Bug) > Support NOT IN sub-queries inside nested OR conditions. > --- > > Key: SPARK-25154 > URL: https://issues.apache.org/jira/browse/SPARK-25154 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Dilip Biswal >Assignee: Dilip Biswal >Priority: Major > Fix For: 3.1.0 > > > Currently NOT IN subqueries (predicated null aware subquery) are not allowed > inside OR expressions. We currently catch this condition in checkAnalysis and > throw an error. > We need to extend our subquery rewrite frame work to support this type of > queries. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31420) Infinite timeline redraw in job details page
[ https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17081017#comment-17081017 ] Gengliang Wang commented on SPARK-31420: [~sarutak] I looked into it but I am not familiar with this part. Could you please check it? > Infinite timeline redraw in job details page > > > Key: SPARK-31420 > URL: https://issues.apache.org/jira/browse/SPARK-31420 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Kousuke Saruta >Priority: Major > Attachments: timeline.mov > > > In the job page, the timeline section keeps changing the position style and > shaking. We can see that there is a warning "infinite loop in redraw" from > the console, which can be related to > https://github.com/visjs/vis-timeline/issues/17 > I am using the history server with the events under > "core/src/test/resources/spark-events" to reproduce. > I have also uploaded a screen recording. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31420) Infinite timeline redraw in job details page
Gengliang Wang created SPARK-31420: -- Summary: Infinite timeline redraw in job details page Key: SPARK-31420 URL: https://issues.apache.org/jira/browse/SPARK-31420 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 3.0.0, 3.1.0 Reporter: Gengliang Wang Assignee: Kousuke Saruta Attachments: timeline.mov In the job page, the timeline section keeps changing the position style and shaking. We can see that there is a warning "infinite loop in redraw" from the console, which can be related to https://github.com/visjs/vis-timeline/issues/17 I am using the history server with the events under "core/src/test/resources/spark-events" to reproduce. I have also uploaded a screen recording. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31420) Infinite timeline redraw in job details page
[ https://issues.apache.org/jira/browse/SPARK-31420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-31420: --- Attachment: timeline.mov > Infinite timeline redraw in job details page > > > Key: SPARK-31420 > URL: https://issues.apache.org/jira/browse/SPARK-31420 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.0.0, 3.1.0 >Reporter: Gengliang Wang >Assignee: Kousuke Saruta >Priority: Major > Attachments: timeline.mov > > > In the job page, the timeline section keeps changing the position style and > shaking. We can see that there is a warning "infinite loop in redraw" from > the console, which can be related to > https://github.com/visjs/vis-timeline/issues/17 > I am using the history server with the events under > "core/src/test/resources/spark-events" to reproduce. > I have also uploaded a screen recording. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31419) Document Table-valued Function and Inline Table
Huaxin Gao created SPARK-31419: -- Summary: Document Table-valued Function and Inline Table Key: SPARK-31419 URL: https://issues.apache.org/jira/browse/SPARK-31419 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao Document Table-valued Function and Inline Table -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, without creating a new predicate *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations (aka creating a new Predicate not allowed I think), but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, without creating a new predicate *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement >
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless if I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florenti
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered, most of the data has the same nature and thus is hard to filter, unless if I use this "200.000" list) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like some help to explore if the following feature
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like some help to explore if the following feature makes sense and what's the best way to implement it. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like some help to explore if the following feature makes sense and > what's the best way to implement it. Allow users to use "isin" (or similar) > predicates with huge Lists (which might
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with huge Lists (which might be broadcasted).* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like to for someone to explore if the following feature makes sense. > Allow users to use "isin" (or similar) predicates with huge Lists (which > might be broadcasted).* > As of now (AFAIK), users can only provide a list/sequence of e
[jira] [Commented] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080942#comment-17080942 ] Venkata krishnan Sowrirajan commented on SPARK-31418: - Currently, I'm working on this issue. > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like to for someone to explore if the following feature makes sense. > Allow users to use "isin" (or similar) predicates with Lists which are > broadcasted.* > As of now (AFAIK), users can only provide a list/sequence of elements which > will be sent as part of the "In" predicate, which is part of the task, to the > executors.
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having only Seq[Expression] in the case class, but also allow Broadcast[Seq[Expression]]?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any idea on how to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like to for someone to explore if the following feature makes sense. > Allow users to use "isin" (or similar) predicates with Lists which are > broadcasted.* > As of now (AFAIK), users can only provide a list/sequence of elements which > will be sent as part of the "In" predicate, which is p
[jira] [Updated] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
[ https://issues.apache.org/jira/browse/SPARK-31418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Venkata krishnan Sowrirajan updated SPARK-31418: Description: With Spark blacklisting, if a task fails on an executor, the executor gets blacklisted for the task. In order to retry the task, it checks if there are idle blacklisted executor which can be killed and replaced to retry the task if not it aborts the job without doing max retries. In the context of dynamic allocation this can be better, instead of killing the blacklisted idle executor (its possible there are no idle blacklisted executor), request an additional executor and retry the task. This can be easily reproduced with a simple job like below, although this example should fail eventually just to show that its not retried spark.task.maxFailures times: {code:java} def test(a: Int) = { a.asInstanceOf[String] } sc.parallelize(1 to 10, 10).map(x => test(x)).collect {code} with dynamic allocation enabled and min executors set to 1. But there are various other cases where this can fail as well. was: With Spark blacklisting, if a task fails on an executor, the executor gets blacklisted for the task. In order to retry the task, it checks if there are idle blacklisted executor which can be killed and replaced to retry the task if not it aborts the job without doing max retries. In the context of dynamic allocation this can be better, instead of killing the blacklisted idle executor (its possible there are no idle blacklisted executor), request an additional executor and retry the task. This can be easily reproduced with a simple job like below: {code:java} def test(a: Int) = { a.asInstanceOf[String] } sc.parallelize(1 to 10, 10).map(x => test(x)).collect {code} with dynamic allocation enabled and min executors set to 1. But there are various other cases where this can fail as well. > Blacklisting feature aborts Spark job without retrying for max num retries in > case of Dynamic allocation > > > Key: SPARK-31418 > URL: https://issues.apache.org/jira/browse/SPARK-31418 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0, 2.4.5 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > > With Spark blacklisting, if a task fails on an executor, the executor gets > blacklisted for the task. In order to retry the task, it checks if there are > idle blacklisted executor which can be killed and replaced to retry the task > if not it aborts the job without doing max retries. > In the context of dynamic allocation this can be better, instead of killing > the blacklisted idle executor (its possible there are no idle blacklisted > executor), request an additional executor and retry the task. > This can be easily reproduced with a simple job like below, although this > example should fail eventually just to show that its not retried > spark.task.maxFailures times: > {code:java} > def test(a: Int) = { a.asInstanceOf[String] } > sc.parallelize(1 to 10, 10).map(x => test(x)).collect > {code} > with dynamic allocation enabled and min executors set to 1. But there are > various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31417) Allow broadcast variables when using isin code
[ https://issues.apache.org/jira/browse/SPARK-31417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Florentino Sainz updated SPARK-31417: - Description: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* As of now (AFAIK), users can only provide a list/sequence of elements which will be sent as part of the "In" predicate, which is part of the task, to the executors. So when this list is huge, this causes tasks to be "huge" (specially when the task number is big). I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. was: *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. > Allow broadcast variables when using isin code > -- > > Key: SPARK-31417 > URL: https://issues.apache.org/jira/browse/SPARK-31417 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Florentino Sainz >Priority: Minor > > *I would like to for someone to explore if the following feature makes sense. > Allow users to use "isin" (or similar) predicates with Lists which are > broadcasted.* > As of now (AFAIK), users can only provide a list/sequence of elements which > will be sent as part of the "In" predicate, which is part of the task, to the > executors. So when this list is huge, this causes tasks to be "huge" > (specially when the task number is big). > I'm coming from > https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list > (I'm the creator of the post, and also the 2nd answer
[jira] [Created] (SPARK-31418) Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation
Venkata krishnan Sowrirajan created SPARK-31418: --- Summary: Blacklisting feature aborts Spark job without retrying for max num retries in case of Dynamic allocation Key: SPARK-31418 URL: https://issues.apache.org/jira/browse/SPARK-31418 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.5, 2.3.0 Reporter: Venkata krishnan Sowrirajan With Spark blacklisting, if a task fails on an executor, the executor gets blacklisted for the task. In order to retry the task, it checks if there are idle blacklisted executor which can be killed and replaced to retry the task if not it aborts the job without doing max retries. In the context of dynamic allocation this can be better, instead of killing the blacklisted idle executor (its possible there are no idle blacklisted executor), request an additional executor and retry the task. This can be easily reproduced with a simple job like below: {code:java} def test(a: Int) = { a.asInstanceOf[String] } sc.parallelize(1 to 10, 10).map(x => test(x)).collect {code} with dynamic allocation enabled and min executors set to 1. But there are various other cases where this can fail as well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31417) Allow broadcast variables when using isin code
Florentino Sainz created SPARK-31417: Summary: Allow broadcast variables when using isin code Key: SPARK-31417 URL: https://issues.apache.org/jira/browse/SPARK-31417 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Florentino Sainz *I would like to for someone to explore if the following feature makes sense. Allow users to use "isin" (or similar) predicates with Lists which are broadcasted.* I'm coming from https://stackoverflow.com/questions/6172/scala-spark-isin-broadcast-list (I'm the creator of the post, and also the 2nd answer). I know this is not common, but can be useful for people reading from "servers" which are not in the same nodes than Spark (maybe S3/cloud?). In my concrete case, querying Kudu with 200.000 elements in the isin takes ~2minutes (+1 minute "calculating plan" to send the BIG tasks to the executors) versus 30 minutes doing a full-scan (even if semi-filtered) on my table. More or less I have a clear view (explained in stackoverflow) on what to modify If I wanted to create my own "BroadcastIn" in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala, however with my limited "idea", anyone who implemented In pushdown would have to re-adapt it to the new "BroadcastIn" I've looked at Spark 3.0 sources and nothing has changed there, but I'm not 100% sure if there's an alternative there (I'm stuck in 2.4 anyways, working for a corporation which uses that version). Maybe I can go ahead with a PR (slowly, in my free time), however I'm not an expert in Scala syntax/syntactic sugar, *anyone has any clue If it's possible to extend current case class "In" while keeping the compatibility with PushDown implementations, but not having the List as a val in the case class?* This is what makes the tasks huge, I made a test-run with a predicate which receives and Broadcast variable and uses the value inside, and it works much better. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31416) Check more strictly that a field name can be used as a valid Java identifier for codegen
Kousuke Saruta created SPARK-31416: -- Summary: Check more strictly that a field name can be used as a valid Java identifier for codegen Key: SPARK-31416 URL: https://issues.apache.org/jira/browse/SPARK-31416 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta ScalaReflection checks that a field name can be used as a valid Java identifier by checking whether the field name is not a reserved keyword. But, in the current implementation, enum is missed. Further, some characters including numeric literals are not used as valid identifiers but not checked. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31404: -- Priority: Blocker (was: Major) > backward compatibility issues after switching to Proleptic Gregorian calendar > - > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31404) backward compatibility issues after switching to Proleptic Gregorian calendar
[ https://issues.apache.org/jira/browse/SPARK-31404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31404: -- Target Version/s: 3.0.0 > backward compatibility issues after switching to Proleptic Gregorian calendar > - > > Key: SPARK-31404 > URL: https://issues.apache.org/jira/browse/SPARK-31404 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > In Spark 3.0, we switch to the Proleptic Gregorian calendar by using the Java > 8 datetime APIs. This makes Spark follow the ISO and SQL standard, but > introduces some backward compatibility problems: > 1. may read wrong data from the data files written by Spark 2.4 > 2. may have perf regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31415) builtin date-time functions/operations improvement
[ https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31415: - Assignee: Maxim Gekk > builtin date-time functions/operations improvement > -- > > Key: SPARK-31415 > URL: https://issues.apache.org/jira/browse/SPARK-31415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0
[ https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-31306: Assignee: Bryan Cutler > rand() function documentation suggests an inclusive upper bound of 1.0 > -- > > Key: SPARK-31306 > URL: https://issues.apache.org/jira/browse/SPARK-31306 > Project: Spark > Issue Type: Documentation > Components: PySpark, R, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Ben >Assignee: Bryan Cutler >Priority: Major > > The rand() function in PySpark, Spark, and R is documented as drawing from > U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing > (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing > `U[0.0, 1.0]` suggests the value returned could include 1.0). The function > itself uses Rand(), which is [documented |#L71] as having a result in the > range [0, 1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0
[ https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-31306: Assignee: (was: Bryan Cutler) > rand() function documentation suggests an inclusive upper bound of 1.0 > -- > > Key: SPARK-31306 > URL: https://issues.apache.org/jira/browse/SPARK-31306 > Project: Spark > Issue Type: Documentation > Components: PySpark, R, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Ben >Priority: Major > > The rand() function in PySpark, Spark, and R is documented as drawing from > U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing > (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing > `U[0.0, 1.0]` suggests the value returned could include 1.0). The function > itself uses Rand(), which is [documented |#L71] as having a result in the > range [0, 1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31306) rand() function documentation suggests an inclusive upper bound of 1.0
[ https://issues.apache.org/jira/browse/SPARK-31306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-31306. -- Resolution: Fixed Issue resolved by pull request 28071 https://github.com/apache/spark/pull/28071 > rand() function documentation suggests an inclusive upper bound of 1.0 > -- > > Key: SPARK-31306 > URL: https://issues.apache.org/jira/browse/SPARK-31306 > Project: Spark > Issue Type: Documentation > Components: PySpark, R, Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Ben >Priority: Major > > The rand() function in PySpark, Spark, and R is documented as drawing from > U[0.0, 1.0]. This suggests an inclusive upper bound, and can be confusing > (i.e for a distribution written as `X ~ U(a, b)`, x can be a or b, so writing > `U[0.0, 1.0]` suggests the value returned could include 1.0). The function > itself uses Rand(), which is [documented |#L71] as having a result in the > range [0, 1). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) closure cleaner is broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080627#comment-17080627 ] Dongjoon Hyun commented on SPARK-31399: --- According to the above information, - I added the failure result in Apache Spark 2.4.5 with Scala 2.12 additionally into the JIRA description. - Update the title from `in Spark 3.0` to `in Scala 2.12`. - Add `2.4.5` as `Affected Version` (but not a target version) > closure cleaner is broken in Scala 2.12 > --- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureS
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Affects Version/s: (was: 2.4.6) 2.4.5 > closure cleaner is broken in Scala 2.12 > --- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Affects Version/s: 2.4.6 > closure cleaner is broken in Scala 2.12 > --- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0, 2.4.6 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2326) > at org.apache.spark.rdd.RDD.$an
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Description: The `ClosureCleaner` only support Scala functions and it uses the following check to catch closures {code} // Check whether a class represents a Scala closure private def isClosure(cls: Class[_]): Boolean = { cls.getName.contains("$anonfun$") } {code} This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala functions become Java lambdas. As an example, the following code works well in Spark 2.4 Spark Shell: {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. import org.apache.spark.sql.functions.lit defined class Foo col: org.apache.spark.sql.Column = 123 df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 {code} But fails in 3.0 {code} scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.map(RDD.scala:421) ... 39 elided Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column Serialization stack: - object not serializable (class: org.apache.spark.sql.Column, value: 123) - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) - object (class $iw, $iw@2d87ac2b) - element of array (index: 0) - array (class [Ljava.lang.Object;, size 1) - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;) - object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class $iw, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) - writeReplace data (class: java.lang.invoke.SerializedLambda) - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) ... 47 more {code} **Apache Spark 2.4.5 with Scala 2.12** {code} Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.5 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) Type in expressions to have them evaluated. Type :help for more information. scala> :pa // Entering paste mode (ctrl-D to finish) import org.apache.spark.sql.functions.lit case class Foo(id: String) val col = lit("123") val df = sc.range(0,10,1,1).map { _ => Foo("") } // Exiting paste mode, now interpreting. org.apache.spark.SparkException: Task not serializable at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) at org.apache.spark.SparkContext.clean(SparkContext.scala:2326) at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:393) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:385) at org.apache.spark.rdd.RDD.map(RDD.scala:392) ... 45 elided Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column Serialization stack: - object not serializable (class: org.apache.spark.sql.Column, value: 123) - field (class: $iw, name: col, type: class org.apache.spark.
[jira] [Updated] (SPARK-31399) closure cleaner is broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31399: -- Summary: closure cleaner is broken in Scala 2.12 (was: closure cleaner is broken in Spark 3.0) > closure cleaner is broken in Scala 2.12 > --- > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-31377: -- Description: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, InnerJoin, ExistenceJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin was: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin > Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite > -- > > Key: SPARK-31377 > URL: https://issues.apache.org/jira/browse/SPARK-31377 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > For some combinations of join algorithm and join types there are no unit > tests for the "number of output rows" metric. > A list of missing unit tests include the following. > * SortMergeJoin: ExistenceJoin > * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin > * BroadcastNestedLoopJoin: RightOuter, InnerJoin, ExistenceJoin > * BroadcastHashJoin: LeftAnti, ExistenceJoin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31377) Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite
[ https://issues.apache.org/jira/browse/SPARK-31377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Srinivas Rishindra Pothireddi updated SPARK-31377: -- Description: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin was: For some combinations of join algorithm and join types there are no unit tests for the "number of output rows" metric. A list of missing unit tests include the following. * SortMergeJoin: ExistenceJoin * ShuffledHashJoin: OuterJoin, leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin * BroadcastHashJoin: LeftAnti, ExistenceJoin > Add unit tests for "number of output rows" metric for joins in SQLMetricsSuite > -- > > Key: SPARK-31377 > URL: https://issues.apache.org/jira/browse/SPARK-31377 > Project: Spark > Issue Type: Improvement > Components: SQL, Tests >Affects Versions: 3.1.0 >Reporter: Srinivas Rishindra Pothireddi >Priority: Minor > > For some combinations of join algorithm and join types there are no unit > tests for the "number of output rows" metric. > A list of missing unit tests include the following. > * SortMergeJoin: ExistenceJoin > * ShuffledHashJoin: leftOuter, RightOuter, LeftAnti, LeftSemi, ExistenseJoin > * BroadcastNestedLoopJoin: RightOuter, ExistenceJoin, InnerJoin > * BroadcastHashJoin: LeftAnti, ExistenceJoin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
[ https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-30953: - Summary: InsertAdaptiveSparkPlan should apply AQE on child plan of write commands (was: InsertAdaptiveSparkPlan should also skip v2 command) > InsertAdaptiveSparkPlan should apply AQE on child plan of write commands > > > Key: SPARK-30953 > URL: https://issues.apache.org/jira/browse/SPARK-30953 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > InsertAdaptiveSparkPlan should also skip v2 command as we did for v1 command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30953) InsertAdaptiveSparkPlan should apply AQE on child plan of write commands
[ https://issues.apache.org/jira/browse/SPARK-30953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-30953: - Description: Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to avoid unexpected broken. (was: InsertAdaptiveSparkPlan should also skip v2 command as we did for v1 command.) > InsertAdaptiveSparkPlan should apply AQE on child plan of write commands > > > Key: SPARK-30953 > URL: https://issues.apache.org/jira/browse/SPARK-30953 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > Apply AQE on write commands with child plan will expose {{LogicalQueryStage}} > to {{Analyzer}} while it should hider under {{AdaptiveSparkPlanExec}} only to > avoid unexpected broken. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31415) builtin date-time functions/operations improvement
[ https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31415. - Fix Version/s: 3.0.0 Resolution: Fixed > builtin date-time functions/operations improvement > -- > > Key: SPARK-31415 > URL: https://issues.apache.org/jira/browse/SPARK-31415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27898) Support 4 date operators(date + integer, integer + date, date - integer and date - date)
[ https://issues.apache.org/jira/browse/SPARK-27898?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27898: Parent Issue: SPARK-31415 (was: SPARK-27764) > Support 4 date operators(date + integer, integer + date, date - integer and > date - date) > > > Key: SPARK-27898 > URL: https://issues.apache.org/jira/browse/SPARK-27898 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.0 > > > Support 4 date operators(date + integer, integer + date, date - integer and > date - date): > ||Operator||Example||Result|| > |+|date '2001-09-28' + integer '7'|date '2001-10-05'| > |-|date '2001-10-01' - integer '7'|date '2001-09-24'| > |-|date '2001-10-01' - date '2001-09-28'|integer '3' (days)| > [https://www.postgresql.org/docs/12/functions-datetime.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29387) Support `*` and `/` operators for intervals
[ https://issues.apache.org/jira/browse/SPARK-29387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29387: Parent Issue: SPARK-31415 (was: SPARK-27764) > Support `*` and `/` operators for intervals > --- > > Key: SPARK-29387 > URL: https://issues.apache.org/jira/browse/SPARK-29387 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Support `*` by numeric, `/` by numeric. See > [https://www.postgresql.org/docs/12/functions-datetime.html] > ||Operator||Example||Result|| > |*|900 * interval '1 second'|interval '00:15:00'| > |*|21 * interval '1 day'|interval '21 days'| > |/|interval '1 hour' / double precision '1.5'|interval '00:40:00'| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29365) Support date and timestamp subtraction
[ https://issues.apache.org/jira/browse/SPARK-29365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29365: Parent Issue: SPARK-31415 (was: SPARK-27764) > Support date and timestamp subtraction > --- > > Key: SPARK-29365 > URL: https://issues.apache.org/jira/browse/SPARK-29365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > PostgreSQL supports subtraction dates and timestamps: > {code} > maxim=# select date'2020-01-01' - timestamp'2019-10-06 10:11:12.345678'; > ?column? > - > 86 days 13:48:47.654322 > (1 row) > maxim=# select timestamp'2019-10-06 10:11:12.345678' - date'2020-01-01'; > ?column? > --- > -86 days -13:48:47.654322 > (1 row) > {code} > Need to provide the same feature in Spark SQL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31415) builtin date-time functions/operations improvement
[ https://issues.apache.org/jira/browse/SPARK-31415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31415: Summary: builtin date-time functions/operations improvement (was: builtin date-time functions improvement) > builtin date-time functions/operations improvement > -- > > Key: SPARK-31415 > URL: https://issues.apache.org/jira/browse/SPARK-31415 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29774) Date and Timestamp type +/- null should be null as Postgres
[ https://issues.apache.org/jira/browse/SPARK-29774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29774: Parent Issue: SPARK-31415 (was: SPARK-27764) > Date and Timestamp type +/- null should be null as Postgres > --- > > Key: SPARK-29774 > URL: https://issues.apache.org/jira/browse/SPARK-29774 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > > {code:sql} > postgres=# select timestamp '1999-12-31' - null; > ?column? > -- > (1 row) > postgres=# select date '1999-12-31' - null; > ?column? > -- > (1 row) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29486) CalendarInterval should have 3 fields: months, days and microseconds
[ https://issues.apache.org/jira/browse/SPARK-29486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29486: Parent: SPARK-31415 Issue Type: Sub-task (was: Bug) > CalendarInterval should have 3 fields: months, days and microseconds > > > Key: SPARK-29486 > URL: https://issues.apache.org/jira/browse/SPARK-29486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: Liu, Linhong >Assignee: Liu, Linhong >Priority: Major > Fix For: 3.0.0 > > > Current CalendarInterval has 2 fields: months and microseconds. This PR try > to change it > to 3 fields: months, days and microseconds. This is because one logical day > interval may > have different number of microseconds (daylight saving). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29364) Return intervals from date subtracts
[ https://issues.apache.org/jira/browse/SPARK-29364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29364: Parent: SPARK-31415 Issue Type: Sub-task (was: Improvement) > Return intervals from date subtracts > > > Key: SPARK-29364 > URL: https://issues.apache.org/jira/browse/SPARK-29364 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.4 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > According to the SQL standard, date1 - date2 is an interval. See > http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt at +_4.5.3 > Operations involving datetimes and intervals_+. The ticket aims to modify the > DateDiff expression to produce another expression of the INTERVAL type When > wspark.sql.ansi.enabled` is set to *true* and `spark.sql.dialect` is *Spark*. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28420) Date/Time Functions: date_part for intervals
[ https://issues.apache.org/jira/browse/SPARK-28420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28420: Parent Issue: SPARK-31415 (was: SPARK-30375) > Date/Time Functions: date_part for intervals > > > Key: SPARK-28420 > URL: https://issues.apache.org/jira/browse/SPARK-28420 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{date_part(}}{{text}}{{, }}{{interval}}{{)}}|{{double precision}}|Get > subfield (equivalent to {{extract}}); see [Section > 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('month', > interval '2 years 3 months')}}|{{3}}| > We can replace it with {{extract(field from timestamp)}}. > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29355) Support timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29355: Parent Issue: SPARK-31415 (was: SPARK-27764) > Support timestamps subtraction > -- > > Key: SPARK-29355 > URL: https://issues.apache.org/jira/browse/SPARK-29355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Operator||Example||Result|| > |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 > 12:00'}}|{{interval '1 day 15:00:00'}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29187) Return null from `date_part()` for the null `field`
[ https://issues.apache.org/jira/browse/SPARK-29187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29187: Parent Issue: SPARK-31415 (was: SPARK-30375) > Return null from `date_part()` for the null `field` > --- > > Key: SPARK-29187 > URL: https://issues.apache.org/jira/browse/SPARK-29187 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > PostgreSQL return NULL for the NULL field from the date_part() function: > {code} > maxim=# select date_part(null, date'2019-09-20'); > date_part > --- > > (1 row) > {code} > but Spark fails with the error: > {code} > spark-sql> select date_part(null, date'2019-09-20'); > Error in query: null; line 1 pos 7 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28141) Date type can not accept special values
[ https://issues.apache.org/jira/browse/SPARK-28141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28141: Parent Issue: SPARK-31415 (was: SPARK-27764) > Date type can not accept special values > --- > > Key: SPARK-28141 > URL: https://issues.apache.org/jira/browse/SPARK-28141 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Input String||Valid Types||Description|| > |{{epoch}}|{{date}}|1970-01-01 00:00:00+00 (Unix system time zero)| > |{{now}}|{{date}}|current transaction's start time| > |{{today}}|{{date}}|midnight today| > |{{tomorrow}}|{{date}}|midnight tomorrow| > |{{yesterday}}|{{date}}|midnight yesterday| > https://www.postgresql.org/docs/12/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29012) Timestamp type can not accept special values
[ https://issues.apache.org/jira/browse/SPARK-29012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29012: Parent Issue: SPARK-31415 (was: SPARK-27764) > Timestamp type can not accept special values > > > Key: SPARK-29012 > URL: https://issues.apache.org/jira/browse/SPARK-29012 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Input String||Valid Types||Description|| > |{{epoch}}|{{timestamp}}|1970-01-01 00:00:00+00 (Unix system time zero)| > |{{now}}|{{timestamp}}|current transaction's start time| > |{{today}}|{{timestamp}}|midnight today| > |{{tomorrow}}|{{timestamp}}|midnight tomorrow| > |{{yesterday}}|{{timestamp}}|midnight yesterday| > https://www.postgresql.org/docs/12/datatype-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28690) Date/Time Functions: date_part for timestamps
[ https://issues.apache.org/jira/browse/SPARK-28690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28690: Parent Issue: SPARK-31415 (was: SPARK-30375) > Date/Time Functions: date_part for timestamps > - > > Key: SPARK-28690 > URL: https://issues.apache.org/jira/browse/SPARK-28690 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{date_part(}}{{text}}{{, }}{{timestamp}}{{)}}|{{double precision}}|Get > subfield (equivalent to {{extract}}); see [Section > 9.9.1|https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT]|{{date_part('hour', > timestamp '2001-02-16 20:38:40')}}|{{20}}| > We can replace it with {{extract(field from timestamp)}}. > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28687) Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()`
[ https://issues.apache.org/jira/browse/SPARK-28687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28687: Parent Issue: SPARK-31415 (was: SPARK-30375) > Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()` > > > Key: SPARK-28687 > URL: https://issues.apache.org/jira/browse/SPARK-28687 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, we support these field for EXTRACT: CENTURY, MILLENNIUM, DECADE, > YEAR, QUARTER, MONTH, WEEK, DAY, DAYOFWEEK, HOUR, MINUTE, SECOND, DOW, > ISODOW, DOY, > We also need support: EPOCH, MICROSECONDS, MILLISECONDS, TIMEZONE, > TIMEZONE_M, TIMEZONE_H, ISOYEAR. > https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28700) make_timestamp loses seconds fractions
[ https://issues.apache.org/jira/browse/SPARK-28700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28700: Parent: SPARK-31415 Issue Type: Sub-task (was: Bug) > make_timestamp loses seconds fractions > -- > > Key: SPARK-28700 > URL: https://issues.apache.org/jira/browse/SPARK-28700 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > The make_timestamp() function accepts seconds with fractional part up to > microseconds. In some cases, it can lose precision in the fractional part. > For example: > {code} > spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.01); > 2019-08-12 00:00:58 > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28656) Support `millennium`, `century` and `decade` at `extract()`
[ https://issues.apache.org/jira/browse/SPARK-28656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28656: Parent Issue: SPARK-31415 (was: SPARK-30375) > Support `millennium`, `century` and `decade` at `extract()` > --- > > Key: SPARK-28656 > URL: https://issues.apache.org/jira/browse/SPARK-28656 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, we support these field for EXTRACT: YEAR, QUARTER, MONTH, WEEK, > DAY, DAYOFWEEK, HOUR, MINUTE, SECOND. > We also need support: EPOCH, CENTURY, MILLENNIUM, DECADE, MICROSECONDS, > MILLISECONDS, DOW, ISODOW, DOY, TIMEZONE, TIMEZONE_M, TIMEZONE_H, JULIAN, > ISOYEAR. > https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28017) Enhance DATE_TRUNC
[ https://issues.apache.org/jira/browse/SPARK-28017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28017: Parent Issue: SPARK-31415 (was: SPARK-30375) > Enhance DATE_TRUNC > -- > > Key: SPARK-28017 > URL: https://issues.apache.org/jira/browse/SPARK-28017 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > For DATE_TRUNC, we need support: microseconds, milliseconds, decade, century, > millennium. > https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28459) Date/Time Functions: make_timestamp
[ https://issues.apache.org/jira/browse/SPARK-28459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28459: Parent Issue: SPARK-31415 (was: SPARK-30375) > Date/Time Functions: make_timestamp > --- > > Key: SPARK-28459 > URL: https://issues.apache.org/jira/browse/SPARK-28459 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{make_timestamp(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ }}{{int}}{{, > _hour_ }}{{int}}{{, _min_ }}{{int}}{{, _sec_}}{{double > precision}}{{)}}|{{timestamp}}|Create timestamp from year, month, day, hour, > minute and seconds fields|{{make_timestamp(2013, 7, 15, 8, 15, > 23.5)}}|{{2013-07-15 08:15:23.5}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28432) Add `make_date` function
[ https://issues.apache.org/jira/browse/SPARK-28432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28432: Parent Issue: SPARK-31415 (was: SPARK-30375) > Add `make_date` function > > > Key: SPARK-28432 > URL: https://issues.apache.org/jira/browse/SPARK-28432 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Function||Return Type||Description||Example||Result|| > |{{make_date(_year_ }}{{int}}{{, _month_ }}{{int}}{{, _day_ > }}{{int}}{{)}}|{{date}}|Create date from year, month and day > fields|{{make_date(2013, 7, 15)}}|{{2013-07-15}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31415) builtin date-time functions improvement
Wenchen Fan created SPARK-31415: --- Summary: builtin date-time functions improvement Key: SPARK-31415 URL: https://issues.apache.org/jira/browse/SPARK-31415 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31121) catalog plugin API
[ https://issues.apache.org/jira/browse/SPARK-31121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31121: Summary: catalog plugin API (was: Spark API for Table Metadata) > catalog plugin API > -- > > Key: SPARK-31121 > URL: https://issues.apache.org/jira/browse/SPARK-31121 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > > Details please see the SPIP doc: > https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d > This will bring multi-catalog support to Spark and allow external catalog > implementations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27576) table capabilty to skip the output column resolution
[ https://issues.apache.org/jira/browse/SPARK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27576: Parent: SPARK-25390 Issue Type: Sub-task (was: Improvement) > table capabilty to skip the output column resolution > > > Key: SPARK-27576 > URL: https://issues.apache.org/jira/browse/SPARK-27576 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27521) move data source v2 API to catalyst module
[ https://issues.apache.org/jira/browse/SPARK-27521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27521: Parent: SPARK-25390 Issue Type: Sub-task (was: Improvement) > move data source v2 API to catalyst module > -- > > Key: SPARK-27521 > URL: https://issues.apache.org/jira/browse/SPARK-27521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26811) Add DataSourceV2 capabilities to check support for batch append, overwrite, truncate during analysis.
[ https://issues.apache.org/jira/browse/SPARK-26811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-26811: Parent: SPARK-25390 Issue Type: Sub-task (was: Bug) > Add DataSourceV2 capabilities to check support for batch append, overwrite, > truncate during analysis. > - > > Key: SPARK-26811 > URL: https://issues.apache.org/jira/browse/SPARK-26811 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Ryan Blue >Assignee: Ryan Blue >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27190) Add DataSourceV2 capabilities for streaming
[ https://issues.apache.org/jira/browse/SPARK-27190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-27190: Parent: SPARK-25390 Issue Type: Sub-task (was: Improvement) > Add DataSourceV2 capabilities for streaming > --- > > Key: SPARK-27190 > URL: https://issues.apache.org/jira/browse/SPARK-27190 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24882) data source v2 API improvement
[ https://issues.apache.org/jira/browse/SPARK-24882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-24882: Parent Issue: SPARK-25390 (was: SPARK-22386) > data source v2 API improvement > -- > > Key: SPARK-24882 > URL: https://issues.apache.org/jira/browse/SPARK-24882 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > Data source V2 is out for a while, see the SPIP > [here|https://docs.google.com/document/d/1n_vUVbF4KD3gxTmkNEon5qdQ-Z8qU5Frf6WMQZ6jJVM/edit?usp=sharing]. > We have already migrated most of the built-in streaming data sources to the > V2 API, and the file source migration is in progress. During the migration, > we found several problems and want to address them before we stabilize the V2 > API. > To solve these problems, we need to separate responsibilities in the data > source v2 API, isolate the stateull part of the API, think of better naming > of some interfaces. Details please see the attached google doc: > https://docs.google.com/document/d/1DDXCTCrup4bKWByTalkXWgavcPdvur8a4eEu8x1BzPM/edit?usp=sharing -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31406. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28180 [https://github.com/apache/spark/pull/28180] > ThriftServerQueryTestSuite: Sharing test data and test tables among multiple > test cases. > > > Key: SPARK-31406 > URL: https://issues.apache.org/jira/browse/SPARK-31406 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > Fix For: 3.0.0 > > > ThriftServerQueryTestSuite spend 17 minutes time to test. > I checked the code and found ThriftServerQueryTestSuite load test data > repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31406) ThriftServerQueryTestSuite: Sharing test data and test tables among multiple test cases.
[ https://issues.apache.org/jira/browse/SPARK-31406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31406: --- Assignee: jiaan.geng > ThriftServerQueryTestSuite: Sharing test data and test tables among multiple > test cases. > > > Key: SPARK-31406 > URL: https://issues.apache.org/jira/browse/SPARK-31406 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Minor > > ThriftServerQueryTestSuite spend 17 minutes time to test. > I checked the code and found ThriftServerQueryTestSuite load test data > repeatedly. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31124) change the default value of minPartitionNum in AQE
[ https://issues.apache.org/jira/browse/SPARK-31124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31124: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > change the default value of minPartitionNum in AQE > -- > > Key: SPARK-31124 > URL: https://issues.apache.org/jira/browse/SPARK-31124 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31414) Performance regression with new TimestampFormatter for json and csv
Kent Yao created SPARK-31414: Summary: Performance regression with new TimestampFormatter for json and csv Key: SPARK-31414 URL: https://issues.apache.org/jira/browse/SPARK-31414 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Kent Yao with benchmark original, where the timestamp values are valid to new parser the result is {code:java} [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms {code} when we modify the benchmark to {code:java} def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } {code} where the timestamp values are invalid for the new parser which cause fallback to legacy parser. the result is {code:java} [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms {code} About 10x perf-regression -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
[ https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] varun senthilnathan updated SPARK-31413: Description: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); *{color:#172b4d}Record is of type com.amazonaws.services.kinesis.model.Record{color}* The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? was: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); Record is of type com.amazonaws.services.kinesis.model.Record The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? > Accessing the sequence number and partition id for records in Kinesis adapter > - > > Key: SPARK-31413 > URL: https://issues.apache.org/jira/browse/SPARK-31413 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 >Reporter: varun senthilnathan >Priority: Critical > > We are using spark 2.4.3 in java. We would like to log the partition key and > the sequence number of every event. The overloaded create stream function of > the kinesis utils always throws a compilation error. > > Function printSeq = s -> s; > KinesisUtils.createStream( > jssc, > appName, > streamName, > endPointUrl, > regionName, > InitialPositionInStream.TRIM_HORIZON, > kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), > printSeq, > Record.class); > *{color:#172b4d}Record is of type > com.amazonaws.services.kinesis.model.Record{color}* > > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS : > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apach
[jira] [Resolved] (SPARK-31412) New Adaptive Query Execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31412. - Fix Version/s: 3.0.0 Resolution: Fixed > New Adaptive Query Execution in Spark SQL > - > > Key: SPARK-31412 > URL: https://issues.apache.org/jira/browse/SPARK-31412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
[ https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] varun senthilnathan updated SPARK-31413: Description: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); Record is of type com.amazonaws.services.kinesis.model.Record The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? was: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? > Accessing the sequence number and partition id for records in Kinesis adapter > - > > Key: SPARK-31413 > URL: https://issues.apache.org/jira/browse/SPARK-31413 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 >Reporter: varun senthilnathan >Priority: Critical > > We are using spark 2.4.3 in java. We would like to log the partition key and > the sequence number of every event. The overloaded create stream function of > the kinesis utils always throws a compilation error. > > Function printSeq = s -> s; > KinesisUtils.createStream( > jssc, > appName, > streamName, > endPointUrl, > regionName, > InitialPositionInStream.TRIM_HORIZON, > kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), > printSeq, > Record.class); > Record is of type com.amazonaws.services.kinesis.model.Record > > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS : > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] > Is there a way out? -- This message was sent by Atlassian Ji
[jira] [Updated] (SPARK-30864) Add the user guide for Adaptive Query Execution
[ https://issues.apache.org/jira/browse/SPARK-30864?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30864: Parent: SPARK-31412 Issue Type: Sub-task (was: Bug) > Add the user guide for Adaptive Query Execution > --- > > Key: SPARK-30864 > URL: https://issues.apache.org/jira/browse/SPARK-30864 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > This Jira will add the detail user guide to describe how to enable AQE and > the three mainly features. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30918) improve the splitting of skewed partitions
[ https://issues.apache.org/jira/browse/SPARK-30918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30918: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > improve the splitting of skewed partitions > -- > > Key: SPARK-30918 > URL: https://issues.apache.org/jira/browse/SPARK-30918 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28560) Optimize shuffle reader to local shuffle reader when smj converted to bhj in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28560: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > Optimize shuffle reader to local shuffle reader when smj converted to bhj in > adaptive execution > --- > > Key: SPARK-28560 > URL: https://issues.apache.org/jira/browse/SPARK-28560 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to optimize the shuffle reader to local > shuffle reader when smj is converted to bhj in adaptive execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29893) Improve the local reader performance by changing the task number from 1 to multi
[ https://issues.apache.org/jira/browse/SPARK-29893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29893: Parent: (was: SPARK-28560) Issue Type: Improvement (was: Sub-task) > Improve the local reader performance by changing the task number from 1 to > multi > > > Key: SPARK-29893 > URL: https://issues.apache.org/jira/browse/SPARK-29893 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > The currently local reader read all the partition of map stage only using 1 > task, which may cause the performance degradation. This PR will improve the > performance by using multi tasks instead of one task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
[ https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] varun senthilnathan updated SPARK-31413: Environment: (was: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. {{Function printSeq = s -> s;}} {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS: [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out?) > Accessing the sequence number and partition id for records in Kinesis adapter > - > > Key: SPARK-31413 > URL: https://issues.apache.org/jira/browse/SPARK-31413 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 >Reporter: varun senthilnathan >Priority: Critical > > We are using spark 2.4.3 in java. We would like to log the partition key and > the sequence number of every event. The overloaded create stream function of > the kinesis utils always throws a compilation error. > > Function printSeq = s -> s; > KinesisUtils.createStream( > jssc, > appName, > streamName, > endPointUrl, > regionName, > InitialPositionInStream.TRIM_HORIZON, > kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), > printSeq, > Record.class); > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS : > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] > Is there a way out? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
[ https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] varun senthilnathan updated SPARK-31413: Description: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? was: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. {{Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);}} The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? > Accessing the sequence number and partition id for records in Kinesis adapter > - > > Key: SPARK-31413 > URL: https://issues.apache.org/jira/browse/SPARK-31413 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 > Environment: We are using spark 2.4.3 in java. We would like to log > the partition key and the sequence number of every event. The overloaded > create stream function of the kinesis utils always throws a compilation error. > > {{Function printSeq = s -> s;}} > {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, > regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS: > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] > Is there a way out? >Reporter: varun senthilnathan >Priority: Critical > > We are using spark 2.4.3 in java. We would like to log the partition key and > the sequence number of every event. The overloaded create stream function of >
[jira] [Updated] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
[ https://issues.apache.org/jira/browse/SPARK-31413?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] varun senthilnathan updated SPARK-31413: Description: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. {{Function printSeq = s -> s; KinesisUtils.createStream( jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class);}} The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS : [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? was:We > Accessing the sequence number and partition id for records in Kinesis adapter > - > > Key: SPARK-31413 > URL: https://issues.apache.org/jira/browse/SPARK-31413 > Project: Spark > Issue Type: Question > Components: DStreams >Affects Versions: 2.4.3 > Environment: We are using spark 2.4.3 in java. We would like to log > the partition key and the sequence number of every event. The overloaded > create stream function of the kinesis utils always throws a compilation error. > > {{Function printSeq = s -> s;}} > {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, > regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS: > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] > Is there a way out? >Reporter: varun senthilnathan >Priority: Critical > > We are using spark 2.4.3 in java. We would like to log the partition key and > the sequence number of every event. The overloaded create stream function of > the kinesis utils always throws a compilation error. > > {{Function printSeq = s -> s; > KinesisUtils.createStream( > jssc, > appName, > streamName, > endPointUrl, > regionName, > InitialPositionInStream.TRIM_HORIZON, > kinesisCheckpointInterval, > StorageLevel.MEMORY_AND_DISK_SER(), > printSeq, > Record.class);}} > The exception is as follows: > {quote}no suitable method found for > createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) > {quote} > JAVA DOCS : > [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] > Is there a way out? -- This message was sent by Atlassian Jira (v8.3.4#803005) -
[jira] [Updated] (SPARK-28739) Add a simple cost check for Adaptive Query Execution
[ https://issues.apache.org/jira/browse/SPARK-28739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28739: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > Add a simple cost check for Adaptive Query Execution > > > Key: SPARK-28739 > URL: https://issues.apache.org/jira/browse/SPARK-28739 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > > Add a mechanism to compare the costs of the before and after plans of > re-optimization in Adaptive Query Execution. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31413) Accessing the sequence number and partition id for records in Kinesis adapter
varun senthilnathan created SPARK-31413: --- Summary: Accessing the sequence number and partition id for records in Kinesis adapter Key: SPARK-31413 URL: https://issues.apache.org/jira/browse/SPARK-31413 Project: Spark Issue Type: Question Components: DStreams Affects Versions: 2.4.3 Environment: We are using spark 2.4.3 in java. We would like to log the partition key and the sequence number of every event. The overloaded create stream function of the kinesis utils always throws a compilation error. {{Function printSeq = s -> s;}} {{KinesisUtils.createStream(}}jssc, appName, streamName, endPointUrl, regionName, InitialPositionInStream.TRIM_HORIZON, kinesisCheckpointInterval, StorageLevel.MEMORY_AND_DISK_SER(), printSeq, Record.class); The exception is as follows: {quote}no suitable method found for createStream(org.apache.spark.streaming.api.java.JavaStreamingContext,java.lang.String,java.lang.String,java.lang.String,java.lang.String,com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream,org.apache.spark.streaming.Duration,org.apache.spark.storage.StorageLevel,java.util.function.Function,java.lang.Class) {quote} JAVA DOCS: [https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kinesis/KinesisUtils.html#createStream-org.apache.spark.streaming.api.java.JavaStreamingContext-java.lang.String-java.lang.String-java.lang.String-java.lang.String-com.amazonaws.services.kinesis.clientlibrary.lib.worker.InitialPositionInStream-org.apache.spark.streaming.Duration-org.apache.spark.storage.StorageLevel-org.apache.spark.api.java.function.Function-java.lang.Class-] Is there a way out? Reporter: varun senthilnathan We -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28753) Dynamically reuse subqueries in AQE
[ https://issues.apache.org/jira/browse/SPARK-28753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28753: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > Dynamically reuse subqueries in AQE > --- > > Key: SPARK-28753 > URL: https://issues.apache.org/jira/browse/SPARK-28753 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wei Xue >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31037) refine AQE config names
[ https://issues.apache.org/jira/browse/SPARK-31037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31037: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > refine AQE config names > --- > > Key: SPARK-31037 > URL: https://issues.apache.org/jira/browse/SPARK-31037 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31070) make skew join split skewed partitions more evenly
[ https://issues.apache.org/jira/browse/SPARK-31070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31070: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > make skew join split skewed partitions more evenly > -- > > Key: SPARK-31070 > URL: https://issues.apache.org/jira/browse/SPARK-31070 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31134) optimize skew join after shuffle partitions are coalesced
[ https://issues.apache.org/jira/browse/SPARK-31134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31134: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > optimize skew join after shuffle partitions are coalesced > - > > Key: SPARK-31134 > URL: https://issues.apache.org/jira/browse/SPARK-31134 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31253) add metrics to shuffle reader
[ https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31253: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > add metrics to shuffle reader > - > > Key: SPARK-31253 > URL: https://issues.apache.org/jira/browse/SPARK-31253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31201) add an individual config for skewed partition threshold
[ https://issues.apache.org/jira/browse/SPARK-31201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31201: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > add an individual config for skewed partition threshold > --- > > Key: SPARK-31201 > URL: https://issues.apache.org/jira/browse/SPARK-31201 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31253) add metrics to AQE shuffle reader
[ https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31253: Summary: add metrics to AQE shuffle reader (was: add metrics to shuffle reader) > add metrics to AQE shuffle reader > - > > Key: SPARK-31253 > URL: https://issues.apache.org/jira/browse/SPARK-31253 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29544) Optimize skewed join at runtime with new Adaptive Execution
[ https://issues.apache.org/jira/browse/SPARK-29544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29544: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > Optimize skewed join at runtime with new Adaptive Execution > --- > > Key: SPARK-29544 > URL: https://issues.apache.org/jira/browse/SPARK-29544 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > Attachments: Skewed Join Optimization Design Doc.docx > > > Implement a rule in the new adaptive execution framework introduced in > [SPARK-23128|https://issues.apache.org/jira/browse/SPARK-23128]. This rule is > used to handle the skew join optimization based on the runtime statistics > (data size and row count). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30524) Disable OptimizeSkewJoin rule if introducing additional shuffle.
[ https://issues.apache.org/jira/browse/SPARK-30524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30524: Parent Issue: SPARK-31412 (was: SPARK-29544) > Disable OptimizeSkewJoin rule if introducing additional shuffle. > > > Key: SPARK-30524 > URL: https://issues.apache.org/jira/browse/SPARK-30524 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Ke Jia >Assignee: Ke Jia >Priority: Major > Fix For: 3.0.0 > > > The OptimizeSkewedJoin will break the outputPartitioning of origin SMJ. And > it may introduce additional shuffle after apply the OptimizeSkewedJoin. This > PR will disable "OptimizeSkewedJoin" rule if introducing additional shuffle. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29954) collect the runtime statistics of row count in map stage
[ https://issues.apache.org/jira/browse/SPARK-29954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29954: Parent Issue: SPARK-31412 (was: SPARK-29544) > collect the runtime statistics of row count in map stage > > > Key: SPARK-29954 > URL: https://issues.apache.org/jira/browse/SPARK-29954 > Project: Spark > Issue Type: Sub-task > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Ke Jia >Priority: Major > > We need the row count info to more accurately estimate the data skew > situation when too many duplicated data. This PR will collect the row count > info in map stage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28177) Adjust post shuffle partition number in adaptive execution
[ https://issues.apache.org/jira/browse/SPARK-28177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28177: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > Adjust post shuffle partition number in adaptive execution > -- > > Key: SPARK-28177 > URL: https://issues.apache.org/jira/browse/SPARK-28177 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.3 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > > Implement a rule in the new adaptive execution framework introduced in > SPARK-23128. This rule is used to adjust the post shuffle partitions based on > the map output statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31412) New Adaptive Query Execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31412: Summary: New Adaptive Query Execution in Spark SQL (was: Adaptive Query Execution in Spark SQL) > New Adaptive Query Execution in Spark SQL > - > > Key: SPARK-31412 > URL: https://issues.apache.org/jira/browse/SPARK-31412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31412) Adaptive Query Execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-31412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31412: Description: SPARK-9850 proposed the basic idea of adaptive execution in Spark. In DAGScheduler, a new API is added to support submitting a single map stage. The current implementation of adaptive execution in Spark SQL supports changing the reducer number at runtime. An Exchange coordinator is used to determine the number of post-shuffle partitions for a stage that needs to fetch shuffle data from one or multiple stages. The current implementation adds ExchangeCoordinator while we are adding Exchanges. However there are some limitations. First, it may cause additional shuffles that may decrease the performance. We can see this from EnsureRequirements rule when it adds ExchangeCoordinator. Secondly, it is not a good idea to add ExchangeCoordinators while we are adding Exchanges because we don’t have a global picture of all shuffle dependencies of a post-shuffle stage. I.e. for 3 tables’ join in a single stage, the same ExchangeCoordinator should be used in three Exchanges but currently two separated ExchangeCoordinator will be added. Thirdly, with the current framework it is not easy to implement other features in adaptive execution flexibly like changing the execution plan and handling skewed join at runtime. We'd like to introduce a new way to do adaptive execution in Spark SQL and address the limitations. The idea is described at [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] > Adaptive Query Execution in Spark SQL > - > > Key: SPARK-31412 > URL: https://issues.apache.org/jira/browse/SPARK-31412 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23128) The basic framework for the new Adaptive Query Execution
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23128: Summary: The basic framework for the new Adaptive Query Execution (was: A new approach to do adaptive execution in Spark SQL) > The basic framework for the new Adaptive Query Execution > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23128) A new approach to do adaptive execution in Spark SQL
[ https://issues.apache.org/jira/browse/SPARK-23128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23128: Parent: SPARK-31412 Issue Type: Sub-task (was: Improvement) > A new approach to do adaptive execution in Spark SQL > > > Key: SPARK-23128 > URL: https://issues.apache.org/jira/browse/SPARK-23128 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Carson Wang >Assignee: Carson Wang >Priority: Major > Fix For: 3.0.0 > > Attachments: AdaptiveExecutioninBaidu.pdf > > > SPARK-9850 proposed the basic idea of adaptive execution in Spark. In > DAGScheduler, a new API is added to support submitting a single map stage. > The current implementation of adaptive execution in Spark SQL supports > changing the reducer number at runtime. An Exchange coordinator is used to > determine the number of post-shuffle partitions for a stage that needs to > fetch shuffle data from one or multiple stages. The current implementation > adds ExchangeCoordinator while we are adding Exchanges. However there are > some limitations. First, it may cause additional shuffles that may decrease > the performance. We can see this from EnsureRequirements rule when it adds > ExchangeCoordinator. Secondly, it is not a good idea to add > ExchangeCoordinators while we are adding Exchanges because we don’t have a > global picture of all shuffle dependencies of a post-shuffle stage. I.e. for > 3 tables’ join in a single stage, the same ExchangeCoordinator should be used > in three Exchanges but currently two separated ExchangeCoordinator will be > added. Thirdly, with the current framework it is not easy to implement other > features in adaptive execution flexibly like changing the execution plan and > handling skewed join at runtime. > We'd like to introduce a new way to do adaptive execution in Spark SQL and > address the limitations. The idea is described at > [https://docs.google.com/document/d/1mpVjvQZRAkD-Ggy6-hcjXtBPiQoVbZGe3dLnAKgtJ4k/edit?usp=sharing] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31412) Adaptive Query Execution in Spark SQL
Wenchen Fan created SPARK-31412: --- Summary: Adaptive Query Execution in Spark SQL Key: SPARK-31412 URL: https://issues.apache.org/jira/browse/SPARK-31412 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org