[jira] [Resolved] (SPARK-46486) DataSourceV2: Restore createOrReplaceView
[ https://issues.apache.org/jira/browse/SPARK-46486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge resolved SPARK-46486. Resolution: Not A Problem > DataSourceV2: Restore createOrReplaceView > - > > Key: SPARK-46486 > URL: https://issues.apache.org/jira/browse/SPARK-46486 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.5.1 >Reporter: John Zhuge >Priority: Trivial > > [https://github.com/apache/spark/pull/44330] for SPARK-45807 accidentally > removed createOrReplaceView. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-46486) DataSourceV2: Restore createOrReplaceView
John Zhuge created SPARK-46486: -- Summary: DataSourceV2: Restore createOrReplaceView Key: SPARK-46486 URL: https://issues.apache.org/jira/browse/SPARK-46486 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.5.1 Reporter: John Zhuge [https://github.com/apache/spark/pull/44330] for SPARK-45807 accidentally removed createOrReplaceView. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39911) Optimize global Sort to RepartitionByExpression
[ https://issues.apache.org/jira/browse/SPARK-39911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-39911: --- Fix Version/s: 3.3.1 > Optimize global Sort to RepartitionByExpression > --- > > Key: SPARK-39911 > URL: https://issues.apache.org/jira/browse/SPARK-39911 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: XiDuo You >Priority: Major > Labels: pull-request-available > Fix For: 3.3.1, 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge resolved SPARK-45657. Fix Version/s: 3.5.0 Resolution: Fixed The issue is fixed in 3.5.0 > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: John Zhuge >Priority: Major > Fix For: 3.5.0 > > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-45657: --- Affects Version/s: 3.4.1 3.4.0 > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0, 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 4:55 AM: -- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be able to find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mai
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779326#comment-17779326 ] John Zhuge commented on SPARK-45657: It is fixed in main branch {code:java} __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 4.0.0-SNAPSHOT /_/Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.7) Type in expressions to have them evaluated. Type :help for more information. 23/10/24 21:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Spark context Web UI available at http://192.168.86.29:4040 Spark context available as 'sc' (master = local[*], app id = local-1698208231783). Spark session available as 'spark'.scala> spark.sql("select 1 id union select 's2' id").cache() val res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: string]scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan val res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Union false, false :- InMemoryRelation [id#11], StorageLevel(disk, memory, deserialized, 1 replicas) : +- AdaptiveSparkPlan isFinalPlan=false : +- HashAggregate(keys=[id#2], functions=[], output=[id#2]) : +- Exchange hashpartitioning(id#2, 200), ENSURE_REQUIREMENTS, [plan_id=30] : +- HashAggregate(keys=[id#2], functions=[], output=[id#2]) : +- Union : :- Project [1 AS id#2] : : +- Scan OneRowRelation[] : +- Project [s2 AS id#1] : +- Scan OneRowRelation[] +- Project [s3 AS s3#13] +- OneRowRelation {code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779283#comment-17779283 ] John Zhuge commented on SPARK-45657: Interesting, there is warning in Dataset.union {code:java} def union(other: Dataset[T]): Dataset[T] = withSetOperator { // This breaks caching, but it's usually ok because it addresses a very specific use case: // using union to union many files or partitions. CombineUnions(Union(logicalPlan, other.logicalPlan)) } {code} > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-45657: --- Affects Version/s: 3.3.2 (was: 3.4.1) > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779282#comment-17779282 ] John Zhuge commented on SPARK-45657: Checking whether this is still an issue in main branch. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 12:38 AM: --- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. {code:java} object CombineUnions extends Rule[LogicalPlan] { ... private def flattenUnion(union: Union, flattenDistinct: Boolean): ... case p1 @ Project(_, p2: Project) if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = false) && !p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) && !p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) => val newProjectList = buildCleanedProjectList(p1.projectList, p2.projectList) stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code} was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281 ] John Zhuge edited comment on SPARK-45657 at 10/25/23 12:36 AM: --- Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the above plan with any plan will not be find a matching cached plan. was (Author: jzhuge): Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
[ https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281 ] John Zhuge commented on SPARK-45657: Root cause: # SQL UNION of 2 sides with different data types produce a Project of Project on 1 side to cast the type. When this is cached, the Project of Project is preserved. {noformat} Distinct +- Union false, false :- Project [cast(id#153 as string) AS id#155] : +- Project [1 AS id#153] : +- OneRowRelation +- Project [s2 AS id#154] +- OneRowRelation{noformat} # Dataset.union applies `CombineUnions` which applies to all unions in the tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the above plan with any plan will not be find a matching cached plan. > Caching SQL UNION of different column data types does not work inside > Dataset.union > --- > > Key: SPARK-45657 > URL: https://issues.apache.org/jira/browse/SPARK-45657 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1 >Reporter: John Zhuge >Priority: Major > > > Cache SQL UNION of 2 sides with different column data types > {code:java} > scala> spark.sql("select 1 id union select 's2' id").cache() {code} > Dataset.union does not leverage the cache > {code:java} > scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select > 's3'")).queryExecution.optimizedPlan > res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Union false, false > :- Aggregate [id#109], [id#109] > : +- Union false, false > : :- Project [1 AS id#109] > : : +- OneRowRelation > : +- Project [s2 AS id#108] > : +- OneRowRelation > +- Project [s3 AS s3#111] > +- OneRowRelation {code} > SQL UNION of the cached SQL UNION does use the cache! Please note > `InMemoryRelation` used. > {code:java} > scala> spark.sql("(select 1 id union select 's2' id) union select > 's3'").queryExecution.optimizedPlan > res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = > Aggregate [id#117], [id#117] > +- Union false, false > :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 > replicas) > : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) > : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, > [plan_id=241] > : +- *(3) HashAggregate(keys=[id#100], functions=[], > output=[id#100]) > : +- Union > : :- *(1) Project [1 AS id#100] > : : +- *(1) Scan OneRowRelation[] > : +- *(2) Project [s2 AS id#99] > : +- *(2) Scan OneRowRelation[] > +- Project [s3 AS s3#116] > +- OneRowRelation {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union
John Zhuge created SPARK-45657: -- Summary: Caching SQL UNION of different column data types does not work inside Dataset.union Key: SPARK-45657 URL: https://issues.apache.org/jira/browse/SPARK-45657 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: John Zhuge Cache SQL UNION of 2 sides with different column data types {code:java} scala> spark.sql("select 1 id union select 's2' id").cache() {code} Dataset.union does not leverage the cache {code:java} scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Union false, false :- Aggregate [id#109], [id#109] : +- Union false, false : :- Project [1 AS id#109] : : +- OneRowRelation : +- Project [s2 AS id#108] : +- OneRowRelation +- Project [s3 AS s3#111] +- OneRowRelation {code} SQL UNION of the cached SQL UNION does use the cache! Please note `InMemoryRelation` used. {code:java} scala> spark.sql("(select 1 id union select 's2' id) union select 's3'").queryExecution.optimizedPlan res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan = Aggregate [id#117], [id#117] +- Union false, false :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 replicas) : +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, [plan_id=241] : +- *(3) HashAggregate(keys=[id#100], functions=[], output=[id#100]) : +- Union : :- *(1) Project [1 AS id#100] : : +- *(1) Scan OneRowRelation[] : +- *(2) Project [s2 AS id#99] : +- *(2) Scan OneRowRelation[] +- Project [s3 AS s3#116] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719925#comment-17719925 ] John Zhuge commented on SPARK-43288: TODO: fix a few unit tests > DataSourceV2: CREATE TABLE LIKE > --- > > Key: SPARK-43288 > URL: https://issues.apache.org/jira/browse/SPARK-43288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Major > > Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719924#comment-17719924 ] John Zhuge commented on SPARK-43288: Please review the WIP PR: https://github.com/apache/spark/pull/40963 > DataSourceV2: CREATE TABLE LIKE > --- > > Key: SPARK-43288 > URL: https://issues.apache.org/jira/browse/SPARK-43288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Major > > Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-43288: --- Issue Type: Improvement (was: New Feature) > DataSourceV2: CREATE TABLE LIKE > --- > > Key: SPARK-43288 > URL: https://issues.apache.org/jira/browse/SPARK-43288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Minor > > Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
[ https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-43288: --- Priority: Major (was: Minor) > DataSourceV2: CREATE TABLE LIKE > --- > > Key: SPARK-43288 > URL: https://issues.apache.org/jira/browse/SPARK-43288 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Major > > Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE
John Zhuge created SPARK-43288: -- Summary: DataSourceV2: CREATE TABLE LIKE Key: SPARK-43288 URL: https://issues.apache.org/jira/browse/SPARK-43288 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.4.0 Reporter: John Zhuge Support CREATE TABLE LIKE in DSv2. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores` as described in [PR #38699|https://github.com/apache/spark/pull/38699]. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores` as described in [PR > #38699|https://github.com/apache/spark/pull/38699]. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default (was: PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default) > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have > issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by > default > > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have > issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default (was: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default) > PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by > default > > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42613: --- Description: Follow up from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. was: Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. > PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor > cores by default > - > > Key: SPARK-42613 > URL: https://issues.apache.org/jira/browse/SPARK-42613 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Follow up from > [https://github.com/apache/spark/pull/40199#discussion_r1119453996] > > If OMP_NUM_THREADS is not set explicitly, we should set it to > `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still > have issues when executer core is set to a very large number but task cpus is > 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default
John Zhuge created SPARK-42613: -- Summary: PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default Key: SPARK-42613 URL: https://issues.apache.org/jira/browse/SPARK-42613 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 3.3.0 Reporter: John Zhuge Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996] If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still have issues when executer core is set to a very large number but task cpus is 1. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default
John Zhuge created SPARK-42607: -- Summary: [MESOS] OMP_NUM_THREADS not set to number of executor cores by default Key: SPARK-42607 URL: https://issues.apache.org/jira/browse/SPARK-42607 Project: Spark Issue Type: Bug Components: Mesos Affects Versions: 3.3.2 Reporter: John Zhuge We could have similar issue to SPARK-42596 (YARN) in Mesos. Could someone verify? Unfortunately I am not able to due to lack of infra. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42596: --- Description: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} was: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42596: --- Description: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} was: Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
[ https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693837#comment-17693837 ] John Zhuge commented on SPARK-42596: Looks like a regression from SPARK-41188 where it removed the code that sets the default OMP_NUM_THREADS from PythonRunner. Its PR assumes the code can be moved to SparkContext, unfortunately `SparkContext#executorEnvs` is only used by StandaloneSchedulerBackend for Spark's standalone cluster manager, thus the PR broke YARN as shown in the test case above, probably Mesos as well but I don't have a way to test. > [YARN] OMP_NUM_THREADS not set to number of executor cores by default > - > > Key: SPARK-42596 > URL: https://issues.apache.org/jira/browse/SPARK-42596 > Project: Spark > Issue Type: Bug > Components: PySpark, YARN >Affects Versions: 3.3.2 >Reporter: John Zhuge >Priority: Major > > Run this PySpark script with `spark.executor.cores=1` > {code:python} > import os > from pyspark.sql import SparkSession > from pyspark.sql.functions import udf > spark = SparkSession.builder.getOrCreate() > var_name = 'OMP_NUM_THREADS' > def get_env_var(): > return os.getenv(var_name) > udf_get_env_var = udf(get_env_var) > spark.range(1).toDF("id").withColumn(f"env_{var_name}", > udf_get_env_var()).show(truncate=False) > {code} > Output with release `3.3.2`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |null | > +---+---+ > {noformat} > Output with release `3.3.0`: > {noformat} > +---+---+ > |id |env_OMP_NUM_THREADS| > +---+---+ > |0 |1 | > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default
John Zhuge created SPARK-42596: -- Summary: [YARN] OMP_NUM_THREADS not set to number of executor cores by default Key: SPARK-42596 URL: https://issues.apache.org/jira/browse/SPARK-42596 Project: Spark Issue Type: Bug Components: PySpark, YARN Affects Versions: 3.3.2 Reporter: John Zhuge Run this PySpark script with `spark.executor.cores=1` {code:python} import os from pyspark.sql import SparkSession from pyspark.sql.functions import udf spark = SparkSession.builder.getOrCreate() var_name = 'OMP_NUM_THREADS' def get_env_var(): return os.getenv(var_name) udf_get_env_var = udf(get_env_var) spark.range(1).toDF("id").withColumn(f"env_{var_name}", udf_get_env_var()).show(truncate=False) {code} Output with release `3.3.2`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |null | +---+---+ {noformat} Output with release `3.3.0`: {noformat} +---+---+ |id |env_OMP_NUM_THREADS| +---+---+ |0 |1 | +---+---+ {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch
[ https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42036: --- Description: {noformat} 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.nio.ByteBuffer Serialization trace: lowerBounds (org.apache.iceberg.GenericDataFile) taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) writerCommitMessage (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) {noformat} Iceberg 1.1 `BaseFile.lowerBounds` is defined as {code:java} Map {code} Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS Kryo version: 4.0.2 Same Spark job works when both driver and executors run the same JDK 8 or JDK 17. was: {noformat} 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.nio.ByteBuffer Serialization trace: lowerBounds (org.apache.iceberg.GenericDataFile) taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) writerCommitMessage (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) {noformat} Iceberg 1.1 `BaseFile.lowerBounds` is defined as {code:java} Map {code} Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS Kryo version: 4.0.2 > Kryo ClassCastException getting task result when JDK versions mismatch > -- > > Key: SPARK-42036 > URL: https://issues.apache.org/jira/browse/SPARK-42036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > {noformat} > 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: > java.lang.Integer cannot be cast to java.nio.ByteBuffer > Serialization trace: > lowerBounds (org.apache.iceberg.GenericDataFile) > taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) > writerCommitMessage > (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > {noformat} > Iceberg 1.1 `BaseFile.lowerBounds` is defined as > {code:java} > Map {code} > Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) > Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS > Kryo version: 4.0.2 > > Same Spark job works when both driver and executors run the same JDK 8 or JDK > 17. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch
[ https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-42036: --- Summary: Kryo ClassCastException getting task result when JDK versions mismatch (was: Kryo ClassCastException getting task result when JDK version mismatch) > Kryo ClassCastException getting task result when JDK versions mismatch > -- > > Key: SPARK-42036 > URL: https://issues.apache.org/jira/browse/SPARK-42036 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > {noformat} > 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result > com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: > java.lang.Integer cannot be cast to java.nio.ByteBuffer > Serialization trace: > lowerBounds (org.apache.iceberg.GenericDataFile) > taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) > writerCommitMessage > (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) > at > com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) > {noformat} > Iceberg 1.1 `BaseFile.lowerBounds` is defined as > {code:java} > Map {code} > Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) > Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS > Kryo version: 4.0.2 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42036) Kryo ClassCastException getting task result when JDK version mismatch
John Zhuge created SPARK-42036: -- Summary: Kryo ClassCastException getting task result when JDK version mismatch Key: SPARK-42036 URL: https://issues.apache.org/jira/browse/SPARK-42036 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.3.0 Reporter: John Zhuge {noformat} 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: java.lang.Integer cannot be cast to java.nio.ByteBuffer Serialization trace: lowerBounds (org.apache.iceberg.GenericDataFile) taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit) writerCommitMessage (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult) at com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144) {noformat} Iceberg 1.1 `BaseFile.lowerBounds` is defined as {code:java} Map {code} Driver JDK version: 1.8.0_352 (Azul Systems, Inc.) Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS Kryo version: 4.0.2 -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41519) Pin versions-maven-plugin version
[ https://issues.apache.org/jira/browse/SPARK-41519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647711#comment-17647711 ] John Zhuge commented on SPARK-41519: {noformat} [ERROR] java.nio.file.NoSuchFileException: /Users/jzhuge/Repos/upstream-spark/avro at sun.nio.fs.UnixException.translateToIOException (UnixException.java:86) at sun.nio.fs.UnixException.rethrowAsIOException (UnixException.java:102) at sun.nio.fs.UnixException.rethrowAsIOException (UnixException.java:107) at sun.nio.fs.UnixFileSystemProvider.newByteChannel (UnixFileSystemProvider.java:214) at java.nio.file.Files.newByteChannel (Files.java:361) at java.nio.file.Files.newByteChannel (Files.java:407) at java.nio.file.spi.FileSystemProvider.newInputStream (FileSystemProvider.java:384) at java.nio.file.Files.newInputStream (Files.java:152) at org.codehaus.plexus.util.xml.XmlReader. (XmlReader.java:129) at org.codehaus.plexus.util.xml.XmlStreamReader. (XmlStreamReader.java:67) at org.codehaus.plexus.util.ReaderFactory.newXmlReader (ReaderFactory.java:122) at org.codehaus.mojo.versions.api.PomHelper.readXmlFile (PomHelper.java:1498) at org.codehaus.mojo.versions.AbstractVersionsUpdaterMojo.process (AbstractVersionsUpdaterMojo.java:326) at org.codehaus.mojo.versions.SetMojo.execute (SetMojo.java:381) at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo (DefaultBuildPluginManager.java:137) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:370) at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:351) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:171) at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:163) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117) at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81) at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56) at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294) at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192) at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105) at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960) at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293) at org.apache.maven.cli.MavenCli.main (MavenCli.java:196) at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke (Method.java:498) at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282) at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225) at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406) at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347){noformat} > Pin versions-maven-plugin version > - > > Key: SPARK-41519 > URL: https://issues.apache.org/jira/browse/SPARK-41519 > Project: Spark > Issue Type: Task > Components: Project Infra >Affects Versions: 3.4.0 >Reporter: John Zhuge >Priority: Minor > > `versions-maven-plugin` release `2.14.0` broke the following this command in > Spark: > {noformat} > build/mvn versions:set -DnewVersion=3.4.0-jz-0 -DgenerateBackupPoms=false > {noformat} > See [https://github.com/mojohaus/versions/issues/848.] > The plugin will be fixed in 2.14.1. However it may be desirable to pin plugin > version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41519) Pin versions-maven-plugin version
John Zhuge created SPARK-41519: -- Summary: Pin versions-maven-plugin version Key: SPARK-41519 URL: https://issues.apache.org/jira/browse/SPARK-41519 Project: Spark Issue Type: Task Components: Project Infra Affects Versions: 3.4.0 Reporter: John Zhuge `versions-maven-plugin` release `2.14.0` broke the following this command in Spark: {noformat} build/mvn versions:set -DnewVersion=3.4.0-jz-0 -DgenerateBackupPoms=false {noformat} See [https://github.com/mojohaus/versions/issues/848.] The plugin will be fixed in 2.14.1. However it may be desirable to pin plugin version. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39800) DataSourceV2: View support
[ https://issues.apache.org/jira/browse/SPARK-39800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-39800: --- Description: Support Data source V2 views. (was: Data source V2 view substitution and resolution.) > DataSourceV2: View support > -- > > Key: SPARK-39800 > URL: https://issues.apache.org/jira/browse/SPARK-39800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Support Data source V2 views. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39800) DataSourceV2: View support
[ https://issues.apache.org/jira/browse/SPARK-39800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-39800: --- Summary: DataSourceV2: View support (was: DataSourceV2: View substitution and resolution) > DataSourceV2: View support > -- > > Key: SPARK-39800 > URL: https://issues.apache.org/jira/browse/SPARK-39800 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.3.0 >Reporter: John Zhuge >Priority: Major > > Data source V2 view substitution and resolution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39800) DataSourceV2: View substitution and resolution
John Zhuge created SPARK-39800: -- Summary: DataSourceV2: View substitution and resolution Key: SPARK-39800 URL: https://issues.apache.org/jira/browse/SPARK-39800 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: John Zhuge Data source V2 view substitution and resolution. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-39799) DataSourceV2: View catalog interface
John Zhuge created SPARK-39799: -- Summary: DataSourceV2: View catalog interface Key: SPARK-39799 URL: https://issues.apache.org/jira/browse/SPARK-39799 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.3.0 Reporter: John Zhuge The view catalog interfaces. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) DataSourceV2: Catalog API for view metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Target Version/s: 3.3.0 (was: 3.2.0) > DataSourceV2: Catalog API for view metadata > --- > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in [SPIP > document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) DataSourceV2: Catalog API for view metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Summary: DataSourceV2: Catalog API for view metadata (was: SPIP: Catalog API for view metadata) > DataSourceV2: Catalog API for view metadata > --- > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in [SPIP > document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources
[ https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421824#comment-17421824 ] John Zhuge commented on SPARK-36664: This can be very useful. For example, we'd like to track how long YARN jobs are stuck ACCEPTED state. > Log time spent waiting for cluster resources > > > Key: SPARK-36664 > URL: https://issues.apache.org/jira/browse/SPARK-36664 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.3.0 >Reporter: Holden Karau >Priority: Major > > To provide better visibility into why jobs might be running slow it would be > useful to log when we are waiting for resources and how long we are waiting > for resources so if there is an underlying cluster issue the user can be > aware. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25929) Support metrics with tags
[ https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098247#comment-17098247 ] John Zhuge commented on SPARK-25929: Has anyone looked at https://micrometer.io/? From the web site: {quote}Micrometer provides a simple facade over the instrumentation clients for the most popular monitoring systems, allowing you to instrument your JVM-based application code without vendor lock-in. Think SLF4J, but for metrics.{quote} > Support metrics with tags > - > > Key: SPARK-25929 > URL: https://issues.apache.org/jira/browse/SPARK-25929 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: John Zhuge >Priority: Major > > For better integration with DBs that support tags/labels, e.g., InfluxDB, > Prometheus, Atlas, etc. > We should continue to support the current Graphite-style metrics. > Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. > Currently > `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` > is in Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Description: SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata. Details in [SPIP document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. was: SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata. Details in [SPIP docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. > SPIP: Catalog API for view metadata > --- > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in [SPIP > document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Description: SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata. Details in [SPIP docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. was: SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata. Details in [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. > SPIP: Catalog API for view metadata > --- > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in [SPIP > docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Summary: SPIP: Catalog API for view metadata (was: Catalog API for View Metadata) > SPIP: Catalog API for view metadata > --- > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in > [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31357) Catalog API for View Metadata
[ https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-31357: --- Labels: SPIP (was: ) > Catalog API for View Metadata > - > > Key: SPARK-31357 > URL: https://issues.apache.org/jira/browse/SPARK-31357 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: John Zhuge >Priority: Major > Labels: SPIP > > SPARK-24252 added a catalog plugin system and `TableCatalog` API that > provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view > metadata. > Details in > [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31357) Catalog API for View Metadata
John Zhuge created SPARK-31357: -- Summary: Catalog API for View Metadata Key: SPARK-31357 URL: https://issues.apache.org/jira/browse/SPARK-31357 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: John Zhuge SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata. Details in [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25929) Support metrics with tags
[ https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035074#comment-17035074 ] John Zhuge commented on SPARK-25929: Yeah, I can feel the pain. When I ingest into InfluxDB, I have to use its [Graphite templates|https://github.com/influxdata/influxdb/tree/v1.7.10/services/graphite#templates], e.g., {noformat} "*.*.*.DAGScheduler.*.* application.app_id.executor_id.measurement.type.qty name=DAGScheduler", "*.*.*.ExecutorAllocationManager.*.* application.app_id.executor_id.measurement.type.qty name=ExecutorAllocationManager", "*.*.*.ExternalShuffle.*.* application.app_id.executor_id.measurement.type.qty name=ExternalShuffle", {noformat} Hard to get right. Easily obsolete. Doesn't support multiple versions. > Support metrics with tags > - > > Key: SPARK-25929 > URL: https://issues.apache.org/jira/browse/SPARK-25929 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.3.2 >Reporter: John Zhuge >Priority: Major > > For better integration with DBs that support tags/labels, e.g., InfluxDB, > Prometheus, Atlas, etc. > We should continue to support the current Graphite-style metrics. > Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. > Currently > `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` > is in Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988078#comment-16988078 ] John Zhuge commented on SPARK-30118: [~cltlfcjin] Thanks for the comment. Do you know which commit fixed the issue? > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987429#comment-16987429 ] John Zhuge commented on SPARK-30118: I am running Spark master with Hive 1.2.1. Same issue in Spark 2.3. > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work
[ https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987428#comment-16987428 ] John Zhuge commented on SPARK-30118: {code:java} spark-sql> DESC FORMATTED jzhuge.v1; foo1string NULL # Detailed Table Information Databasejzhuge Table v1 Owner jzhuge Created TimeTue Dec 03 17:53:59 PST 2019 Last Access UNKNOWN Created By Spark 3.0.0-SNAPSHOT TypeVIEW View Text SELECT 'foo' foo1 View Original Text SELECT 'foo' foo1 View Default Database default View Query Output Columns [foo2] Table Properties[transient_lastDdlTime=1575424439, view.query.out.col.0=foo2, view.query.out.numCols=1, view.default.database=default] Locationfile://tmp/jzhuge/v1 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormatorg.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [serialization.format=1] {code} > ALTER VIEW QUERY does not work > -- > > Key: SPARK-30118 > URL: https://issues.apache.org/jira/browse/SPARK-30118 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted > state. > {code:sql} > spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; > spark-sql> SHOW CREATE TABLE jzhuge.v1; > CREATE VIEW `jzhuge`.`v1`(foo1) AS > SELECT 'foo' foo1 > spark-sql> TABLE jzhuge.v1; > Error in query: Attribute with name 'foo2' is not found in '(foo1)';; > SubqueryAlias `jzhuge`.`v1` > +- View (`jzhuge`.`v1`, [foo1#33]) >+- Project [foo AS foo1#34] > +- OneRowRelation > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30118) ALTER VIEW QUERY does not work
John Zhuge created SPARK-30118: -- Summary: ALTER VIEW QUERY does not work Key: SPARK-30118 URL: https://issues.apache.org/jira/browse/SPARK-30118 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted state. {code:sql} spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1; spark-sql> SHOW CREATE TABLE jzhuge.v1; CREATE VIEW `jzhuge`.`v1`(foo1) AS SELECT 'foo' foo1 spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2; spark-sql> SHOW CREATE TABLE jzhuge.v1; CREATE VIEW `jzhuge`.`v1`(foo1) AS SELECT 'foo' foo1 spark-sql> TABLE jzhuge.v1; Error in query: Attribute with name 'foo2' is not found in '(foo1)';; SubqueryAlias `jzhuge`.`v1` +- View (`jzhuge`.`v1`, [foo1#33]) +- Project [foo AS foo1#34] +- OneRowRelation {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29030) Simplify lookupV2Relation
[ https://issues.apache.org/jira/browse/SPARK-29030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-29030: --- Description: Simplify the return type for {{lookupV2Relation}} which makes the 3 callers more straightforward. (was: Simplify the return type for {{lookupV2Relation}} which makes the 3 callers straightforward as well.) > Simplify lookupV2Relation > - > > Key: SPARK-29030 > URL: https://issues.apache.org/jira/browse/SPARK-29030 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > Simplify the return type for {{lookupV2Relation}} which makes the 3 callers > more straightforward. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29030) Simplify lookupV2Relation
John Zhuge created SPARK-29030: -- Summary: Simplify lookupV2Relation Key: SPARK-29030 URL: https://issues.apache.org/jira/browse/SPARK-29030 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Simplify the return type for {{lookupV2Relation}} which makes the 3 callers straightforward as well. -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28640) Only give warning when session catalog is not defined
John Zhuge created SPARK-28640: -- Summary: Only give warning when session catalog is not defined Key: SPARK-28640 URL: https://issues.apache.org/jira/browse/SPARK-28640 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge LookupCatalog.sessionCatalog logs an error message and the exception stack upon any nonfatal exception. When session catalog is not defined, this may alarm the user unnecessarily. It should be enough to give a warning and return None. -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable
John Zhuge created SPARK-28565: -- Summary: DataSourceV2: DataFrameWriter.saveAsTable Key: SPARK-28565 URL: https://issues.apache.org/jira/browse/SPARK-28565 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Assignee: John Zhuge Fix For: 3.0.0 Support multiple catalogs in the following InsertInto use cases: * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable
[ https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-28565: --- Description: Support multiple catalogs in the following use cases: * DataFrameWriter.saveAsTable("catalog.db.tbl") was: Support multiple catalogs in the following InsertInto use cases: * DataFrameWriter.insertInto("catalog.db.tbl") > DataSourceV2: DataFrameWriter.saveAsTable > - > > Key: SPARK-28565 > URL: https://issues.apache.org/jira/browse/SPARK-28565 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Assignee: John Zhuge >Priority: Major > Fix For: 3.0.0 > > > Support multiple catalogs in the following use cases: > * DataFrameWriter.saveAsTable("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.14#76016) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
[ https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-28178: --- Description: Support multiple catalogs in the following InsertInto use cases: * DataFrameWriter.insertInto("catalog.db.tbl") was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") > DataSourceV2: DataFrameWriter.insertInfo > > > Key: SPARK-28178 > URL: https://issues.apache.org/jira/browse/SPARK-28178 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo
John Zhuge created SPARK-28178: -- Summary: DataSourceV2: DataFrameWriter.insertInfo Key: SPARK-28178 URL: https://issues.apache.org/jira/browse/SPARK-28178 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Description: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Description: Support multiple catalogs in the following use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl was: Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Summary: DataSourceV2: InsertTable (was: DataSourceV2: Insert into tables in multiple catalogs) > DataSourceV2: InsertTable > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) DataSourceV2 table relation
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: DataSourceV2 table relation (was: DataSourceV2: Select from multiple catalogs) > DataSourceV2 table relation > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27845: --- Summary: DataSourceV2: Insert into tables in multiple catalogs (was: DataSourceV2: InsertInto multiple catalogs) > DataSourceV2: Insert into tables in multiple catalogs > - > > Key: SPARK-27845 > URL: https://issues.apache.org/jira/browse/SPARK-27845 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multiple catalogs in the following InsertInto use cases: > * INSERT INTO [TABLE] catalog.db.tbl > * INSERT OVERWRITE TABLE catalog.db.tbl > * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27961) DataSourceV2Relation should not have refresh method
[ https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27961: --- Description: The newly added `Refresh` method in [PR #24401|https://github.com/apache/spark/pull/24401] prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. was: The newly added `Refresh` method in PR #24401 prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. > DataSourceV2Relation should not have refresh method > --- > > Key: SPARK-27961 > URL: https://issues.apache.org/jira/browse/SPARK-27961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > The newly added `Refresh` method in [PR > #24401|https://github.com/apache/spark/pull/24401] prevented me from moving > DataSourceV2Relation into catalyst. It calls `case table: FileTable => > table.fileIndex.refresh()` while `FileTable` belongs to sql/core. > More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by > design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method
[ https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856982#comment-16856982 ] John Zhuge commented on SPARK-27961: [~Gengliang.Wang] [~cloud_fan] Could you help? > DataSourceV2Relation should not have refresh method > --- > > Key: SPARK-27961 > URL: https://issues.apache.org/jira/browse/SPARK-27961 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > The newly added `Refresh` method in PR #24401 prevented me from moving > DataSourceV2Relation into catalyst. It calls `case table: FileTable => > table.fileIndex.refresh()` while `FileTable` belongs to sql/core. > More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by > design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27961) DataSourceV2Relation should not have refresh method
John Zhuge created SPARK-27961: -- Summary: DataSourceV2Relation should not have refresh method Key: SPARK-27961 URL: https://issues.apache.org/jira/browse/SPARK-27961 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge The newly added `Refresh` method in PR #24401 prevented me from moving DataSourceV2Relation into catalyst. It calls `case table: FileTable => table.fileIndex.refresh()` while `FileTable` belongs to sql/core. More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by design, it should not have refresh method. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives this warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives this warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
John Zhuge created SPARK-27947: -- Summary: ParsedStatement subclass toString may throw ClassCastException Key: SPARK-27947 URL: https://issues.apache.org/jira/browse/SPARK-27947 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} {code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException
[ https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27947: --- Description: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} was: In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map type, thus causing `asInstanceOf[Map[String, String]]` to throw ClassCastException. The following test reproduces the issue: {code:java} case class TestStatement(p: Map[String, Int]) extends ParsedStatement { override def output: Seq[Attribute] = Nil override def children: Seq[LogicalPlan] = Nil } TestStatement(Map("abc" -> 1)).toString{code} {code:java} {code} Changing the code to `case mapArg: Map[String, String]` will not work due to type erasure. As a matter of fact, compiler gives the warning: {noformat} Warning:(41, 18) non-variable type argument String in type pattern scala.collection.immutable.Map[String,String] (the underlying of Map[String,String]) is unchecked since it is eliminated by erasure case mapArg: Map[String, String] =>{noformat} > ParsedStatement subclass toString may throw ClassCastException > -- > > Key: SPARK-27947 > URL: https://issues.apache.org/jira/browse/SPARK-27947 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any > Map type, thus causing `asInstanceOf[Map[String, String]]` to throw > ClassCastException. > The following test reproduces the issue: > {code:java} > case class TestStatement(p: Map[String, Int]) extends ParsedStatement { > override def output: Seq[Attribute] = Nil > override def children: Seq[LogicalPlan] = Nil > } > TestStatement(Map("abc" -> 1)).toString{code} > {code:java} > Changing the code to `case mapArg: Map[String, String]` will not work due to > type erasure. As a matter of fact, compiler gives the warning: > {noformat} > Warning:(41, 18) non-variable type argument String in type pattern > scala.collection.immutable.Map[String,String] (the underlying of > Map[String,String]) is unchecked since it is eliminated by erasure > case mapArg: Map[String, String] =>{noformat} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27845) DataSourceV2: InsertInto multiple catalogs
John Zhuge created SPARK-27845: -- Summary: DataSourceV2: InsertInto multiple catalogs Key: SPARK-27845 URL: https://issues.apache.org/jira/browse/SPARK-27845 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Support multiple catalogs in the following InsertInto use cases: * INSERT INTO [TABLE] catalog.db.tbl * INSERT OVERWRITE TABLE catalog.db.tbl * DataFrameWriter.insertInto("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) DataSourceV2: Select from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: DataSourceV2: Select from multiple catalogs (was: DataSourceV2: Logical relation in multiple catalogs) > DataSourceV2: Select from multiple catalogs > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) DataSourceV2: Logical relation in multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: DataSourceV2: Logical relation in multiple catalogs (was: Select from multiple catalogs) > DataSourceV2: Logical relation in multiple catalogs > --- > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27813) DataSourceV2: Add DropTable logical operation
John Zhuge created SPARK-27813: -- Summary: DataSourceV2: Add DropTable logical operation Key: SPARK-27813 URL: https://issues.apache.org/jira/browse/SPARK-27813 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.3 Reporter: John Zhuge Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) Select from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Description: Support multi-catalog in the following SELECT code paths: * SELECT * FROM catalog.db.tbl * TABLE catalog.db.tbl * JOIN or UNION tables from different catalogs * SparkSession.table("catalog.db.tbl") was: Support multi-catalog in the following SELECT code paths: * `SELECT * FROM catalog.db.tbl` * `TABLE catalog.db.tbl` * SELECT with JOIN or UNION * SparkSession.table("catalog.db.tbl") > Select from multiple catalogs > - > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * SELECT * FROM catalog.db.tbl > * TABLE catalog.db.tbl > * JOIN or UNION tables from different catalogs > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) Select from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: Select from multiple catalogs (was: SELECT from multiple catalogs) > Select from multiple catalogs > - > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * `SELECT * FROM catalog.db.tbl` > * `TABLE catalog.db.tbl` > * SELECT with JOIN or UNION > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27739) df.persist should save stats from optimized plan
[ https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27739: --- Summary: df.persist should save stats from optimized plan (was: Persist should use stats from optimized plan) > df.persist should save stats from optimized plan > > > Key: SPARK-27739 > URL: https://issues.apache.org/jira/browse/SPARK-27739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: John Zhuge >Priority: Minor > > CacheManager.cacheQuery passes the stats for `planToCache` to > InMemoryRelation. Since the plan has not been optimized, the stats is > inaccurate because project and filter have not been applied. I'd suggest > passing the stats from the optimized plan. > {code:java} > class CacheManager extends Logging { > ... > def cacheQuery( > query: Dataset[_], > tableName: Option[String] = None, > storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { > val planToCache = query.logicalPlan > if (lookupCachedData(planToCache).nonEmpty) { > logWarning("Asked to cache already cached data.") > } else { > val sparkSession = query.sparkSession > val inMemoryRelation = InMemoryRelation( > sparkSession.sessionState.conf.useCompression, > sparkSession.sessionState.conf.columnBatchSize, storageLevel, > sparkSession.sessionState.executePlan(planToCache).executedPlan, > tableName, > planToCache) <== > ... > } > object InMemoryRelation { > def apply( > useCompression: Boolean, > batchSize: Int, > storageLevel: StorageLevel, > child: SparkPlan, > tableName: Option[String], > logicalPlan: LogicalPlan): InMemoryRelation = { > val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, > storageLevel, child, tableName) > val relation = new InMemoryRelation(child.output, cacheBuilder, > logicalPlan.outputOrdering) > relation.statsOfPlanToCache = logicalPlan.stats <== > relation > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27739) Persist should use stats from optimized plan
[ https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27739: --- Summary: Persist should use stats from optimized plan (was: CacheManager.cacheQuery should copy stats from optimized plan) > Persist should use stats from optimized plan > > > Key: SPARK-27739 > URL: https://issues.apache.org/jira/browse/SPARK-27739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: John Zhuge >Priority: Minor > > CacheManager.cacheQuery passes the stats for `planToCache` to > InMemoryRelation. Since the plan has not been optimized, the stats is > inaccurate because project and filter have not been applied. > {code:java} > class CacheManager extends Logging { > ... > def cacheQuery( > query: Dataset[_], > tableName: Option[String] = None, > storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { > val planToCache = query.logicalPlan > if (lookupCachedData(planToCache).nonEmpty) { > logWarning("Asked to cache already cached data.") > } else { > val sparkSession = query.sparkSession > val inMemoryRelation = InMemoryRelation( > sparkSession.sessionState.conf.useCompression, > sparkSession.sessionState.conf.columnBatchSize, storageLevel, > sparkSession.sessionState.executePlan(planToCache).executedPlan, > tableName, > planToCache) <== > ... > } > object InMemoryRelation { > def apply( > useCompression: Boolean, > batchSize: Int, > storageLevel: StorageLevel, > child: SparkPlan, > tableName: Option[String], > logicalPlan: LogicalPlan): InMemoryRelation = { > val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, > storageLevel, child, tableName) > val relation = new InMemoryRelation(child.output, cacheBuilder, > logicalPlan.outputOrdering) > relation.statsOfPlanToCache = logicalPlan.stats <== > relation > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27739) Persist should use stats from optimized plan
[ https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27739: --- Description: CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. Since the plan has not been optimized, the stats is inaccurate because project and filter have not been applied. I'd suggest passing the stats from the optimized plan. {code:java} class CacheManager extends Logging { ... def cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { val planToCache = query.logicalPlan if (lookupCachedData(planToCache).nonEmpty) { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession val inMemoryRelation = InMemoryRelation( sparkSession.sessionState.conf.useCompression, sparkSession.sessionState.conf.columnBatchSize, storageLevel, sparkSession.sessionState.executePlan(planToCache).executedPlan, tableName, planToCache) <== ... } object InMemoryRelation { def apply( useCompression: Boolean, batchSize: Int, storageLevel: StorageLevel, child: SparkPlan, tableName: Option[String], logicalPlan: LogicalPlan): InMemoryRelation = { val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, storageLevel, child, tableName) val relation = new InMemoryRelation(child.output, cacheBuilder, logicalPlan.outputOrdering) relation.statsOfPlanToCache = logicalPlan.stats <== relation } {code} was: CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. Since the plan has not been optimized, the stats is inaccurate because project and filter have not been applied. {code:java} class CacheManager extends Logging { ... def cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { val planToCache = query.logicalPlan if (lookupCachedData(planToCache).nonEmpty) { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession val inMemoryRelation = InMemoryRelation( sparkSession.sessionState.conf.useCompression, sparkSession.sessionState.conf.columnBatchSize, storageLevel, sparkSession.sessionState.executePlan(planToCache).executedPlan, tableName, planToCache) <== ... } object InMemoryRelation { def apply( useCompression: Boolean, batchSize: Int, storageLevel: StorageLevel, child: SparkPlan, tableName: Option[String], logicalPlan: LogicalPlan): InMemoryRelation = { val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, storageLevel, child, tableName) val relation = new InMemoryRelation(child.output, cacheBuilder, logicalPlan.outputOrdering) relation.statsOfPlanToCache = logicalPlan.stats <== relation } {code} > Persist should use stats from optimized plan > > > Key: SPARK-27739 > URL: https://issues.apache.org/jira/browse/SPARK-27739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0, 2.4.0 >Reporter: John Zhuge >Priority: Minor > > CacheManager.cacheQuery passes the stats for `planToCache` to > InMemoryRelation. Since the plan has not been optimized, the stats is > inaccurate because project and filter have not been applied. I'd suggest > passing the stats from the optimized plan. > {code:java} > class CacheManager extends Logging { > ... > def cacheQuery( > query: Dataset[_], > tableName: Option[String] = None, > storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { > val planToCache = query.logicalPlan > if (lookupCachedData(planToCache).nonEmpty) { > logWarning("Asked to cache already cached data.") > } else { > val sparkSession = query.sparkSession > val inMemoryRelation = InMemoryRelation( > sparkSession.sessionState.conf.useCompression, > sparkSession.sessionState.conf.columnBatchSize, storageLevel, > sparkSession.sessionState.executePlan(planToCache).executedPlan, > tableName, > planToCache) <== > ... > } > object InMemoryRelation { > def apply( > useCompression: Boolean, > batchSize: Int, > storageLevel: StorageLevel, > child: SparkPlan, > tableName: Option[String], > logicalPlan: LogicalPlan): InMemoryRelation = { > val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, > storageLevel, child, tableName) > val relation = new InMemoryRelation(child.output, cacheBuild
[jira] [Created] (SPARK-27739) CacheManager.cacheQuery should copy stats from optimized plan
John Zhuge created SPARK-27739: -- Summary: CacheManager.cacheQuery should copy stats from optimized plan Key: SPARK-27739 URL: https://issues.apache.org/jira/browse/SPARK-27739 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0, 2.3.0 Reporter: John Zhuge CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. Since the plan has not been optimized, the stats is inaccurate because project and filter have not been applied. {code:java} class CacheManager extends Logging { ... def cacheQuery( query: Dataset[_], tableName: Option[String] = None, storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = { val planToCache = query.logicalPlan if (lookupCachedData(planToCache).nonEmpty) { logWarning("Asked to cache already cached data.") } else { val sparkSession = query.sparkSession val inMemoryRelation = InMemoryRelation( sparkSession.sessionState.conf.useCompression, sparkSession.sessionState.conf.columnBatchSize, storageLevel, sparkSession.sessionState.executePlan(planToCache).executedPlan, tableName, planToCache) <== ... } object InMemoryRelation { def apply( useCompression: Boolean, batchSize: Int, storageLevel: StorageLevel, child: SparkPlan, tableName: Option[String], logicalPlan: LogicalPlan): InMemoryRelation = { val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, storageLevel, child, tableName) val relation = new InMemoryRelation(child.output, cacheBuilder, logicalPlan.outputOrdering) relation.statsOfPlanToCache = logicalPlan.stats <== relation } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) SELECT from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: SELECT from multiple catalogs (was: Support SELECT from multiple catalogs) > SELECT from multiple catalogs > - > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * `SELECT * FROM catalog.db.tbl` > * `TABLE catalog.db.tbl` > * SELECT with JOIN or UNION > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27322) Support SELECT from multiple catalogs
[ https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27322: --- Summary: Support SELECT from multiple catalogs (was: Support multi-catalog in SELECT) > Support SELECT from multiple catalogs > - > > Key: SPARK-27322 > URL: https://issues.apache.org/jira/browse/SPARK-27322 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Major > > Support multi-catalog in the following SELECT code paths: > * `SELECT * FROM catalog.db.tbl` > * `TABLE catalog.db.tbl` > * SELECT with JOIN or UNION > * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27322) Support multi-catalog in SELECT
John Zhuge created SPARK-27322: -- Summary: Support multi-catalog in SELECT Key: SPARK-27322 URL: https://issues.apache.org/jira/browse/SPARK-27322 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge Support multi-catalog in the following SELECT code paths: * `SELECT * FROM catalog.db.tbl` * `TABLE catalog.db.tbl` * SELECT with JOIN or UNION * SparkSession.table("catalog.db.tbl") -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8
[ https://issues.apache.org/jira/browse/SPARK-27250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27250: --- Description: Discovered by https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/509/console: {noformat} [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(Array.empty, name)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: Static methods in interface require -target:jvm-1.8 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(parts.init.toArray, parts.last)) [error] ^ [error] three errors found [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] {noformat} was: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull {noformat} [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(Array.empty, name)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: Static methods in interface require -target:jvm-1.8 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(parts.init.toArray, parts.last)) [error] ^ [error] three errors found [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] {noformat} > Scala 2.11 maven compile should target Java 1.8 > --- > > Key: SPARK-27250 > URL: https://issues.apache.org/jira/browse/SPARK-27250 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > Discovered by > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/509/console: > {noformat} > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: > Static methods in interface require -target:jvm-1.8 > [error] (None, Identifier.of(Array.empty, name)) > [error] ^ > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: > Static methods in interface require -target:jvm-1.8 > [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) > [error] ^ > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: > Static methods in interface require -target:jvm-1.8 > [error] (None, Identifier.of(parts.init.toArray, parts.last)) > [error] ^ > [error] three errors found > [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27251) SPARK-25196 Scala 2.1.1 maven build failure
[ https://issues.apache.org/jira/browse/SPARK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27251: --- Description: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull {noformat} [info] Compiling 371 Scala sources and 102 Java sources to /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes... [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162: values cannot be volatile [error] @volatile var statsOfPlanToCache: Statistics) {noformat} was: {noformat} [info] Compiling 371 Scala sources and 102 Java sources to /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes... [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162: values cannot be volatile [error] @volatile var statsOfPlanToCache: Statistics) {noformat} > SPARK-25196 Scala 2.1.1 maven build failure > --- > > Key: SPARK-27251 > URL: https://issues.apache.org/jira/browse/SPARK-27251 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull > {noformat} > [info] Compiling 371 Scala sources and 102 Java sources to > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes... > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162: > values cannot be volatile > [error] @volatile var statsOfPlanToCache: Statistics) > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27251) SPARK-25196 Scala 2.1.1 maven build failure
John Zhuge created SPARK-27251: -- Summary: SPARK-25196 Scala 2.1.1 maven build failure Key: SPARK-27251 URL: https://issues.apache.org/jira/browse/SPARK-27251 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: John Zhuge {noformat} [info] Compiling 371 Scala sources and 102 Java sources to /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes... [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162: values cannot be volatile [error] @volatile var statsOfPlanToCache: Statistics) {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8
[ https://issues.apache.org/jira/browse/SPARK-27250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-27250: --- Description: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull {noformat} [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(Array.empty, name)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: Static methods in interface require -target:jvm-1.8 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(parts.init.toArray, parts.last)) [error] ^ [error] three errors found [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] {noformat} was: {noformat} [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(Array.empty, name)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: Static methods in interface require -target:jvm-1.8 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(parts.init.toArray, parts.last)) [error] ^ [error] three errors found [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] {noformat} > Scala 2.11 maven compile should target Java 1.8 > --- > > Key: SPARK-27250 > URL: https://issues.apache.org/jira/browse/SPARK-27250 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.0.0 >Reporter: John Zhuge >Priority: Minor > > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull > {noformat} > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: > Static methods in interface require -target:jvm-1.8 > [error] (None, Identifier.of(Array.empty, name)) > [error] ^ > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: > Static methods in interface require -target:jvm-1.8 > [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) > [error] ^ > [error] > /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: > Static methods in interface require -target:jvm-1.8 > [error] (None, Identifier.of(parts.init.toArray, parts.last)) > [error] ^ > [error] three errors found > [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8
John Zhuge created SPARK-27250: -- Summary: Scala 2.11 maven compile should target Java 1.8 Key: SPARK-27250 URL: https://issues.apache.org/jira/browse/SPARK-27250 Project: Spark Issue Type: Bug Components: Build Affects Versions: 3.0.0 Reporter: John Zhuge {noformat} [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(Array.empty, name)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44: Static methods in interface require -target:jvm-1.8 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last)) [error] ^ [error] /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47: Static methods in interface require -target:jvm-1.8 [error] (None, Identifier.of(parts.init.toArray, parts.last)) [error] ^ [error] three errors found [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s] {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20314) Inconsistent error handling in JSON parsing SQL functions
[ https://issues.apache.org/jira/browse/SPARK-20314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge resolved SPARK-20314. Resolution: Duplicate Resolve it as duplicate. > Inconsistent error handling in JSON parsing SQL functions > - > > Key: SPARK-20314 > URL: https://issues.apache.org/jira/browse/SPARK-20314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Eric Wasserman >Priority: Major > > Most parse errors in the JSON parsing SQL functions (e.g. json_tuple, > get_json_object) will return a null(s) if the JSON is badly formed. However, > if Jackson determines that the string includes invalid characters it will > throw an exception (java.io.CharConversionException: Invalid UTF-32 > character) that Spark does not catch. This creates a robustness problem in > that these functions cannot be used at all when there may be dirty data as > these exceptions will kill the jobs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark
[ https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26946: --- Component/s: (was: Spark Core) > Identifiers for multi-catalog Spark > --- > > Key: SPARK-26946 > URL: https://issues.apache.org/jira/browse/SPARK-26946 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.2 >Reporter: John Zhuge >Priority: Major > > Propose semantics for identifiers and a listing API to support multiple > catalogs. > [~rdblue]'s SPIP: > [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark
[ https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26946: --- Description: Propose semantics for identifiers and a listing API to support multiple catalogs. [~rdblue]'s SPIP: [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing] was: Propose semantics for identifiers and a listing API to support multiple catalogs. Ryan's SPIP: [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing] > Identifiers for multi-catalog Spark > --- > > Key: SPARK-26946 > URL: https://issues.apache.org/jira/browse/SPARK-26946 > Project: Spark > Issue Type: New Feature > Components: Spark Core, SQL >Affects Versions: 2.3.2 >Reporter: John Zhuge >Priority: Major > > Propose semantics for identifiers and a listing API to support multiple > catalogs. > [~rdblue]'s SPIP: > [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26946) Identifiers for multi-catalog Spark
John Zhuge created SPARK-26946: -- Summary: Identifiers for multi-catalog Spark Key: SPARK-26946 URL: https://issues.apache.org/jira/browse/SPARK-26946 Project: Spark Issue Type: New Feature Components: Spark Core, SQL Affects Versions: 2.3.2 Reporter: John Zhuge Propose semantics for identifiers and a listing API to support multiple catalogs. Ryan's SPIP: [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26576: --- Summary: Broadcast hint not applied to partitioned table (was: Broadcast hint not applied to partitioned Parquet table) > Broadcast hint not applied to partitioned table > --- > > Key: SPARK-26576 > URL: https://issues.apache.org/jira/browse/SPARK-26576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: John Zhuge >Priority: Major > > Broadcast hint is not applied to partitioned Parquet table. Below > "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed > in Optimized Plan. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) > PARTITIONED BY (dateint INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_with_part` > : +- Relation[val#28,dateint#29] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- SubqueryAlias `jzhuge`.`parquet_with_part` >: +- Relation[val#28,dateint#29] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Optimized Logical Plan == > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- Project [val#28, dateint#29] >: +- Filter isnotnull(dateint#29) >: +- Relation[val#28,dateint#29] parquet >+- Project [val#32, dateint#33] > +- Filter isnotnull(dateint#33) > +- Relation[val#32,dateint#33] parquet > == Physical Plan == > *(5) Project [dateint#29, val#28, val#32] > +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner >:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, > 500), coordinator[target post-shuffle partition size: 67108864] >: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] > Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], > PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: > [], ReadSchema: struct >+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0 > +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: > 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle > partition size: 67108864] > {noformat} > Broadcast hint is applied to Parquet table without partition. Below > "BroadcastHashJoin" is chosen as expected. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint > INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_no_part` > : +- Relation[val#44,dateint#45] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- SubqueryAlias `jzhuge`.`parquet_no_part` >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Optimized Logical Plan == > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- Filter isnotnull(dateint#45) >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- Filter isnotnull(dateint#51) > +- Relation[val#50,dateint#51] parquet > == Physical Plan == > *(2) Project [dateint#45, val#44, val#50] > +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight >:- *(2) Project [val#44, dateint#45] >: +- *(2) Filter isnotnull(dateint#45) >: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] > Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], > PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: > struct >+- BroadcastExchange HashedRelationBroadc
[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Zhuge updated SPARK-26576: --- Affects Version/s: 2.2.2 > Broadcast hint not applied to partitioned Parquet table > --- > > Key: SPARK-26576 > URL: https://issues.apache.org/jira/browse/SPARK-26576 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.2, 2.3.2, 2.4.0 >Reporter: John Zhuge >Priority: Major > > Broadcast hint is not applied to partitioned Parquet table. Below > "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed > in Optimized Plan. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) > PARTITIONED BY (dateint INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_with_part` > : +- Relation[val#28,dateint#29] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- SubqueryAlias `jzhuge`.`parquet_with_part` >: +- Relation[val#28,dateint#29] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_with_part` > +- Relation[val#32,dateint#33] parquet > == Optimized Logical Plan == > Project [dateint#29, val#28, val#32] > +- Join Inner, (dateint#29 = dateint#33) >:- Project [val#28, dateint#29] >: +- Filter isnotnull(dateint#29) >: +- Relation[val#28,dateint#29] parquet >+- Project [val#32, dateint#33] > +- Filter isnotnull(dateint#33) > +- Relation[val#32,dateint#33] parquet > == Physical Plan == > *(5) Project [dateint#29, val#28, val#32] > +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner >:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0 >: +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, > 500), coordinator[target post-shuffle partition size: 67108864] >: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] > Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], > PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: > [], ReadSchema: struct >+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0 > +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: > 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle > partition size: 67108864] > {noformat} > Broadcast hint is applied to Parquet table without partition. Below > "BroadcastHashJoin" is chosen as expected. > {noformat} > scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint > INT) STORED AS parquet") > scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1") > scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => > df.join(broadcast(df), "dateint").explain(true)) > == Parsed Logical Plan == > 'Join UsingJoin(Inner,List(dateint)) > :- SubqueryAlias `jzhuge`.`parquet_no_part` > : +- Relation[val#44,dateint#45] parquet > +- ResolvedHint (broadcast) >+- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Analyzed Logical Plan == > dateint: int, val: string, val: string > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- SubqueryAlias `jzhuge`.`parquet_no_part` >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- SubqueryAlias `jzhuge`.`parquet_no_part` > +- Relation[val#50,dateint#51] parquet > == Optimized Logical Plan == > Project [dateint#45, val#44, val#50] > +- Join Inner, (dateint#45 = dateint#51) >:- Filter isnotnull(dateint#45) >: +- Relation[val#44,dateint#45] parquet >+- ResolvedHint (broadcast) > +- Filter isnotnull(dateint#51) > +- Relation[val#50,dateint#51] parquet > == Physical Plan == > *(2) Project [dateint#45, val#44, val#50] > +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight >:- *(2) Project [val#44, dateint#45] >: +- *(2) Filter isnotnull(dateint#45) >: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] > Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], > PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: > struct >+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, > true] as bigint))) > +- *(1) Project [va
[jira] [Comment Edited] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table
[ https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737964#comment-16737964 ] John Zhuge edited comment on SPARK-26576 at 1/9/19 5:12 PM: No issue on the master branch. Please note "rightHint=(broadcast)" for the Join in Optimized Plan. {noformat} scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => df.join(broadcast(df), "dateint").explain(true)) == Parsed Logical Plan == 'Join UsingJoin(Inner,List(dateint)) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Analyzed Logical Plan == dateint: int, val: string, val: string Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Optimized Logical Plan == Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast) :- Project [val#34, dateint#35] : +- Filter isnotnull(dateint#35) : +- Relation[val#34,dateint#35] parquet +- Project [val#40, dateint#41] +- Filter isnotnull(dateint#41) +- Relation[val#40,dateint#41] parquet == Physical Plan == *(2) Project [dateint#35, val#34, val#40] +- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, true] as bigint))) +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct {noformat} >From a quick look at the source, EliminateResolvedHint pulls broadcast hint >into Join and eliminates the ResolvedHint node. It is called before >PruneFileSourcePartitions so the above code in >PhysicalOperation.collectProjectsAndFilters is never called on master branch >for the few cases I tried. was (Author: jzhuge): No issue on the master branch. Please note "rightHint=(broadcast)" for the Join in Optimized Plan. {noformat} scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => df.join(broadcast(df), "dateint").explain(true)) == Parsed Logical Plan == 'Join UsingJoin(Inner,List(dateint)) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Analyzed Logical Plan == dateint: int, val: string, val: string Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41) :- SubqueryAlias `jzhuge`.`parquet_with_part` : +- Relation[val#34,dateint#35] parquet +- ResolvedHint (broadcast) +- SubqueryAlias `jzhuge`.`parquet_with_part` +- Relation[val#40,dateint#41] parquet == Optimized Logical Plan == Project [dateint#35, val#34, val#40] +- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast) :- Project [val#34, dateint#35] : +- Filter isnotnull(dateint#35) : +- Relation[val#34,dateint#35] parquet +- Project [val#40, dateint#41] +- Filter isnotnull(dateint#41) +- Relation[val#40,dateint#41] parquet == Physical Plan == *(2) Project [dateint#35, val#34, val#40] +- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, true] as bigint))) +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] Batched: true, DataFilters: [], Format: Parquet, Location: PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: [isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct {noformat} >From a quick look at the source, EliminateResolvedHint pulls broadcast hint >into Join and eliminates the ResolvedHint node. It is called before >PruneFileSourcePartitions so the above code in >PhysicalOperation.collectProjectsAndFilters is never called on master branch. > Broadcast hint not applied to partitioned Parquet table > ---