[jira] [Resolved] (SPARK-46486) DataSourceV2: Restore createOrReplaceView

2023-12-22 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-46486?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved SPARK-46486.

Resolution: Not A Problem

> DataSourceV2: Restore createOrReplaceView
> -
>
> Key: SPARK-46486
> URL: https://issues.apache.org/jira/browse/SPARK-46486
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.5.1
>Reporter: John Zhuge
>Priority: Trivial
>
> [https://github.com/apache/spark/pull/44330] for SPARK-45807 accidentally 
> removed createOrReplaceView.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-46486) DataSourceV2: Restore createOrReplaceView

2023-12-21 Thread John Zhuge (Jira)
John Zhuge created SPARK-46486:
--

 Summary: DataSourceV2: Restore createOrReplaceView
 Key: SPARK-46486
 URL: https://issues.apache.org/jira/browse/SPARK-46486
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.5.1
Reporter: John Zhuge


[https://github.com/apache/spark/pull/44330] for SPARK-45807 accidentally 
removed createOrReplaceView.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39911) Optimize global Sort to RepartitionByExpression

2023-12-01 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-39911:
---
Fix Version/s: 3.3.1

> Optimize global Sort to RepartitionByExpression
> ---
>
> Key: SPARK-39911
> URL: https://issues.apache.org/jira/browse/SPARK-39911
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: XiDuo You
>Assignee: XiDuo You
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.3.1, 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved SPARK-45657.

Fix Version/s: 3.5.0
   Resolution: Fixed

The issue is fixed in 3.5.0

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: John Zhuge
>Priority: Major
> Fix For: 3.5.0
>
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-45657:
---
Affects Version/s: 3.4.1
   3.4.0

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0, 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 4:55 AM:
--

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be able to find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mai

[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779326#comment-17779326
 ] 

John Zhuge commented on SPARK-45657:


It is fixed in main branch
{code:java}
                    __
     / __/__  ___ _/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 4.0.0-SNAPSHOT
      /_/Using Scala version 2.13.12 (OpenJDK 64-Bit Server VM, Java 17.0.7)
Type in expressions to have them evaluated.
Type :help for more information.
23/10/24 21:30:31 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Spark context Web UI available at http://192.168.86.29:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1698208231783).
Spark session available as 'spark'.scala> spark.sql("select 1 id union select 
's2' id").cache()
val res0: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [id: 
string]scala> spark.sql("select 1 id union select 's2' 
id").union(spark.sql("select 's3'")).queryExecution.optimizedPlan
val res1: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Union false, false
:- InMemoryRelation [id#11], StorageLevel(disk, memory, deserialized, 1 
replicas)
:     +- AdaptiveSparkPlan isFinalPlan=false
:        +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:           +- Exchange hashpartitioning(id#2, 200), ENSURE_REQUIREMENTS, 
[plan_id=30]
:              +- HashAggregate(keys=[id#2], functions=[], output=[id#2])
:                 +- Union
:                    :- Project [1 AS id#2]
:                    :  +- Scan OneRowRelation[]
:                    +- Project [s2 AS id#1]
:                       +- Scan OneRowRelation[]
+- Project [s3 AS s3#13]
   +- OneRowRelation {code}

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779283#comment-17779283
 ] 

John Zhuge commented on SPARK-45657:


Interesting, there is warning in Dataset.union
{code:java}
def union(other: Dataset[T]): Dataset[T] = withSetOperator {
  // This breaks caching, but it's usually ok because it addresses a very 
specific use case:
  // using union to union many files or partitions.
  CombineUnions(Union(logicalPlan, other.logicalPlan))
} {code}

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-45657:
---
Affects Version/s: 3.3.2
   (was: 3.4.1)

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779282#comment-17779282
 ] 

John Zhuge commented on SPARK-45657:


Checking whether this is still an issue in main branch.

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 12:38 AM:
---

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.
{code:java}
object CombineUnions extends Rule[LogicalPlan] {
...
  private def flattenUnion(union: Union, flattenDistinct: Boolean):
...
    case p1 @ Project(_, p2: Project)
  if canCollapseExpressions(p1.projectList, p2.projectList, alwaysInline = 
false) &&
!p1.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) &&
!p2.projectList.exists(SubqueryExpression.hasCorrelatedSubquery) =>
  val newProjectList = buildCleanedProjectList(p1.projectList, 
p2.projectList)
  stack.pushAll(Seq(p2.copy(projectList = newProjectList))){code}

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281
 ] 

John Zhuge edited comment on SPARK-45657 at 10/25/23 12:36 AM:
---

Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1, thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 


was (Author: jzhuge):
Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-45657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17779281#comment-17779281
 ] 

John Zhuge commented on SPARK-45657:


Root cause:
 # SQL UNION of 2 sides with different data types produce a Project of Project 
on 1 side to cast the type. When this is cached, the Project of Project is 
preserved.
{noformat}
Distinct
+- Union false, false
   :- Project [cast(id#153 as string) AS id#155]
   :  +- Project [1 AS id#153]
   :     +- OneRowRelation
   +- Project [s2 AS id#154]
      +- OneRowRelation{noformat}

 # Dataset.union applies `CombineUnions` which applies to all unions in the 
tree. CombineUnions collapses the 2 Projects into 1. Thus Dataset.union of the 
above plan with any plan will not be find a matching cached plan.

 

> Caching SQL UNION of different column data types does not work inside 
> Dataset.union
> ---
>
> Key: SPARK-45657
> URL: https://issues.apache.org/jira/browse/SPARK-45657
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1
>Reporter: John Zhuge
>Priority: Major
>
>  
> Cache SQL UNION of 2 sides with different column data types
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
> Dataset.union does not leverage the cache
> {code:java}
> scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
> 's3'")).queryExecution.optimizedPlan
> res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Union false, false
> :- Aggregate [id#109], [id#109]
> :  +- Union false, false
> :     :- Project [1 AS id#109]
> :     :  +- OneRowRelation
> :     +- Project [s2 AS id#108]
> :        +- OneRowRelation
> +- Project [s3 AS s3#111]
>    +- OneRowRelation {code}
> SQL UNION of the cached SQL UNION does use the cache! Please note 
> `InMemoryRelation` used.
> {code:java}
> scala> spark.sql("(select 1 id union select 's2' id) union select 
> 's3'").queryExecution.optimizedPlan
> res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
> Aggregate [id#117], [id#117]
> +- Union false, false
>    :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
> replicas)
>    :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
>    :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
> [plan_id=241]
>    :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
> output=[id#100])
>    :              +- Union
>    :                 :- *(1) Project [1 AS id#100]
>    :                 :  +- *(1) Scan OneRowRelation[]
>    :                 +- *(2) Project [s2 AS id#99]
>    :                    +- *(2) Scan OneRowRelation[]
>    +- Project [s3 AS s3#116]
>       +- OneRowRelation {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-45657) Caching SQL UNION of different column data types does not work inside Dataset.union

2023-10-24 Thread John Zhuge (Jira)
John Zhuge created SPARK-45657:
--

 Summary: Caching SQL UNION of different column data types does not 
work inside Dataset.union
 Key: SPARK-45657
 URL: https://issues.apache.org/jira/browse/SPARK-45657
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: John Zhuge


 

Cache SQL UNION of 2 sides with different column data types
{code:java}
scala> spark.sql("select 1 id union select 's2' id").cache()  {code}
Dataset.union does not leverage the cache
{code:java}
scala> spark.sql("select 1 id union select 's2' id").union(spark.sql("select 
's3'")).queryExecution.optimizedPlan
res15: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Union false, false
:- Aggregate [id#109], [id#109]
:  +- Union false, false
:     :- Project [1 AS id#109]
:     :  +- OneRowRelation
:     +- Project [s2 AS id#108]
:        +- OneRowRelation
+- Project [s3 AS s3#111]
   +- OneRowRelation {code}
SQL UNION of the cached SQL UNION does use the cache! Please note 
`InMemoryRelation` used.
{code:java}
scala> spark.sql("(select 1 id union select 's2' id) union select 
's3'").queryExecution.optimizedPlan
res16: org.apache.spark.sql.catalyst.plans.logical.LogicalPlan =
Aggregate [id#117], [id#117]
+- Union false, false
   :- InMemoryRelation [id#117], StorageLevel(disk, memory, deserialized, 1 
replicas)
   :     +- *(4) HashAggregate(keys=[id#100], functions=[], output=[id#100])
   :        +- Exchange hashpartitioning(id#100, 500), ENSURE_REQUIREMENTS, 
[plan_id=241]
   :           +- *(3) HashAggregate(keys=[id#100], functions=[], 
output=[id#100])
   :              +- Union
   :                 :- *(1) Project [1 AS id#100]
   :                 :  +- *(1) Scan OneRowRelation[]
   :                 +- *(2) Project [s2 AS id#99]
   :                    +- *(2) Scan OneRowRelation[]
   +- Project [s3 AS s3#116]
      +- OneRowRelation {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-05-05 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719925#comment-17719925
 ] 

John Zhuge commented on SPARK-43288:


TODO: fix a few unit tests

> DataSourceV2: CREATE TABLE LIKE
> ---
>
> Key: SPARK-43288
> URL: https://issues.apache.org/jira/browse/SPARK-43288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-05-05 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719924#comment-17719924
 ] 

John Zhuge commented on SPARK-43288:


Please review the WIP PR: https://github.com/apache/spark/pull/40963

> DataSourceV2: CREATE TABLE LIKE
> ---
>
> Key: SPARK-43288
> URL: https://issues.apache.org/jira/browse/SPARK-43288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-04-26 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-43288:
---
Issue Type: Improvement  (was: New Feature)

> DataSourceV2: CREATE TABLE LIKE
> ---
>
> Key: SPARK-43288
> URL: https://issues.apache.org/jira/browse/SPARK-43288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Minor
>
> Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-04-26 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-43288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-43288:
---
Priority: Major  (was: Minor)

> DataSourceV2: CREATE TABLE LIKE
> ---
>
> Key: SPARK-43288
> URL: https://issues.apache.org/jira/browse/SPARK-43288
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-43288) DataSourceV2: CREATE TABLE LIKE

2023-04-25 Thread John Zhuge (Jira)
John Zhuge created SPARK-43288:
--

 Summary: DataSourceV2: CREATE TABLE LIKE
 Key: SPARK-43288
 URL: https://issues.apache.org/jira/browse/SPARK-43288
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.4.0
Reporter: John Zhuge


Support CREATE TABLE LIKE in DSv2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Description: 
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores` as described in [PR 
#38699|https://github.com/apache/spark/pull/38699].

  was:
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.


> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores` as described in [PR 
> #38699|https://github.com/apache/spark/pull/38699].



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Summary: PythonRunner should set OMP_NUM_THREADS to task cpus instead of 
executor cores by default  (was: PythonRunner should set OMP_NUM_THREADS to 
task cpus times executor cores by default)

> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
>  
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have 
> issues when executer core is set to a very large number but task cpus is 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Description: 
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.

  was:
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus 
x spark.executor.cores`. Otherwise, we will still have issues when executer 
core is set to a very large number but task cpus is 1.


> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
>  
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still 
> have issues when executer core is set to a very large number but task cpus is 
> 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Description: 
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus 
x spark.executor.cores`. Otherwise, we will still have issues when executer 
core is set to a very large number but task cpus is 1.

  was:
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.


> PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by 
> default
> 
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
>  
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus x spark.executor.cores`. Otherwise, we will still have 
> issues when executer core is set to a very large number but task cpus is 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Summary: PythonRunner should set OMP_NUM_THREADS to task cpus times 
executor cores by default  (was: PythonRunner should set OMP_NUM_THREADS to 
task cpus instead of executor cores by default)

> PythonRunner should set OMP_NUM_THREADS to task cpus times executor cores by 
> default
> 
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
>  
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still 
> have issues when executer core is set to a very large number but task cpus is 
> 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-02-27 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42613:
---
Description: 
Follow up from 
[https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.

  was:
Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.


> PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor 
> cores by default
> -
>
> Key: SPARK-42613
> URL: https://issues.apache.org/jira/browse/SPARK-42613
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Follow up from 
> [https://github.com/apache/spark/pull/40199#discussion_r1119453996]
>  
> If OMP_NUM_THREADS is not set explicitly, we should set it to 
> `spark.task.cpus` instead of `spark.executor.cores`. Otherwise, we will still 
> have issues when executer core is set to a very large number but task cpus is 
> 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42613) PythonRunner should set OMP_NUM_THREADS to task cpus instead of executor cores by default

2023-02-27 Thread John Zhuge (Jira)
John Zhuge created SPARK-42613:
--

 Summary: PythonRunner should set OMP_NUM_THREADS to task cpus 
instead of executor cores by default
 Key: SPARK-42613
 URL: https://issues.apache.org/jira/browse/SPARK-42613
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 3.3.0
Reporter: John Zhuge


Coming from [https://github.com/apache/spark/pull/40199#discussion_r1119453996]

 

If OMP_NUM_THREADS is not set explicitly, we should set it to `spark.task.cpus` 
instead of `spark.executor.cores`. Otherwise, we will still have issues when 
executer core is set to a very large number but task cpus is 1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42607) [MESOS] OMP_NUM_THREADS not set to number of executor cores by default

2023-02-27 Thread John Zhuge (Jira)
John Zhuge created SPARK-42607:
--

 Summary: [MESOS] OMP_NUM_THREADS not set to number of executor 
cores by default
 Key: SPARK-42607
 URL: https://issues.apache.org/jira/browse/SPARK-42607
 Project: Spark
  Issue Type: Bug
  Components: Mesos
Affects Versions: 3.3.2
Reporter: John Zhuge


We could have similar issue to SPARK-42596 (YARN) in Mesos.

Could someone verify? Unfortunately I am not able to due to lack of infra. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default

2023-02-26 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42596:
---
Description: 
Run this PySpark script with `spark.executor.cores=1`
{code:python}
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()

var_name = 'OMP_NUM_THREADS'

def get_env_var():
  return os.getenv(var_name)

udf_get_env_var = udf(get_env_var)
spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
udf_get_env_var()).show(truncate=False)
{code}
Output with release `3.3.2`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |null   |
+---+---+
{noformat}
Output with release `3.3.0`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |1  |
+---+---+
{noformat}

  was:
Run this PySpark script with `spark.executor.cores=1`
{code:python}
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()

var_name = 'OMP_NUM_THREADS'

def get_env_var():
  return os.getenv(var_name)

udf_get_env_var = udf(get_env_var)
spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
udf_get_env_var()).show(truncate=False)
{code}
Output with release `3.3.2`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |null   |
+---+---+
{noformat}
Output with release `3.3.0`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |1   |
+---+---+
{noformat}


> [YARN] OMP_NUM_THREADS not set to number of executor cores by default
> -
>
> Key: SPARK-42596
> URL: https://issues.apache.org/jira/browse/SPARK-42596
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Run this PySpark script with `spark.executor.cores=1`
> {code:python}
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf
> spark = SparkSession.builder.getOrCreate()
> var_name = 'OMP_NUM_THREADS'
> def get_env_var():
>   return os.getenv(var_name)
> udf_get_env_var = udf(get_env_var)
> spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
> udf_get_env_var()).show(truncate=False)
> {code}
> Output with release `3.3.2`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |null   |
> +---+---+
> {noformat}
> Output with release `3.3.0`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |1  |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default

2023-02-26 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42596:
---
Description: 
Run this PySpark script with `spark.executor.cores=1`
{code:python}
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()

var_name = 'OMP_NUM_THREADS'

def get_env_var():
  return os.getenv(var_name)

udf_get_env_var = udf(get_env_var)
spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
udf_get_env_var()).show(truncate=False)
{code}
Output with release `3.3.2`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |null   |
+---+---+
{noformat}
Output with release `3.3.0`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |1   |
+---+---+
{noformat}

  was:
Run this PySpark script with `spark.executor.cores=1`
{code:python}
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()

var_name = 'OMP_NUM_THREADS'

def get_env_var():
  return os.getenv(var_name)

udf_get_env_var = udf(get_env_var)
spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
udf_get_env_var()).show(truncate=False)
{code}
Output with release `3.3.2`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |null   |
+---+---+
{noformat}
Output with release `3.3.0`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |1   |
+---+---+
{noformat}


> [YARN] OMP_NUM_THREADS not set to number of executor cores by default
> -
>
> Key: SPARK-42596
> URL: https://issues.apache.org/jira/browse/SPARK-42596
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Run this PySpark script with `spark.executor.cores=1`
> {code:python}
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf
> spark = SparkSession.builder.getOrCreate()
> var_name = 'OMP_NUM_THREADS'
> def get_env_var():
>   return os.getenv(var_name)
> udf_get_env_var = udf(get_env_var)
> spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
> udf_get_env_var()).show(truncate=False)
> {code}
> Output with release `3.3.2`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |null   |
> +---+---+
> {noformat}
> Output with release `3.3.0`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |1   |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default

2023-02-26 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-42596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17693837#comment-17693837
 ] 

John Zhuge commented on SPARK-42596:


Looks like a regression from SPARK-41188 where it removed the code that sets 
the default OMP_NUM_THREADS from PythonRunner.

Its PR assumes the code can be moved to SparkContext, unfortunately 
`SparkContext#executorEnvs` is only used by StandaloneSchedulerBackend for 
Spark's standalone cluster manager, thus the PR broke YARN as shown in the test 
case above, probably Mesos as well but I don't have a way to test.

> [YARN] OMP_NUM_THREADS not set to number of executor cores by default
> -
>
> Key: SPARK-42596
> URL: https://issues.apache.org/jira/browse/SPARK-42596
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, YARN
>Affects Versions: 3.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Run this PySpark script with `spark.executor.cores=1`
> {code:python}
> import os
> from pyspark.sql import SparkSession
> from pyspark.sql.functions import udf
> spark = SparkSession.builder.getOrCreate()
> var_name = 'OMP_NUM_THREADS'
> def get_env_var():
>   return os.getenv(var_name)
> udf_get_env_var = udf(get_env_var)
> spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
> udf_get_env_var()).show(truncate=False)
> {code}
> Output with release `3.3.2`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |null   |
> +---+---+
> {noformat}
> Output with release `3.3.0`:
> {noformat}
> +---+---+
> |id |env_OMP_NUM_THREADS|
> +---+---+
> |0  |1   |
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42596) [YARN] OMP_NUM_THREADS not set to number of executor cores by default

2023-02-26 Thread John Zhuge (Jira)
John Zhuge created SPARK-42596:
--

 Summary: [YARN] OMP_NUM_THREADS not set to number of executor 
cores by default
 Key: SPARK-42596
 URL: https://issues.apache.org/jira/browse/SPARK-42596
 Project: Spark
  Issue Type: Bug
  Components: PySpark, YARN
Affects Versions: 3.3.2
Reporter: John Zhuge


Run this PySpark script with `spark.executor.cores=1`
{code:python}
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf

spark = SparkSession.builder.getOrCreate()

var_name = 'OMP_NUM_THREADS'

def get_env_var():
  return os.getenv(var_name)

udf_get_env_var = udf(get_env_var)
spark.range(1).toDF("id").withColumn(f"env_{var_name}", 
udf_get_env_var()).show(truncate=False)
{code}
Output with release `3.3.2`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |null   |
+---+---+
{noformat}
Output with release `3.3.0`:
{noformat}
+---+---+
|id |env_OMP_NUM_THREADS|
+---+---+
|0  |1   |
+---+---+
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch

2023-01-12 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42036:
---
Description: 
{noformat}
22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
java.lang.Integer cannot be cast to java.nio.ByteBuffer
Serialization trace:
lowerBounds (org.apache.iceberg.GenericDataFile)
taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
writerCommitMessage 
(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
{noformat}
Iceberg 1.1 `BaseFile.lowerBounds` is defined as
{code:java}
Map {code}
Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS

Kryo version: 4.0.2

 

Same Spark job works when both driver and executors run the same JDK 8 or JDK 
17.

  was:
{noformat}
22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
java.lang.Integer cannot be cast to java.nio.ByteBuffer
Serialization trace:
lowerBounds (org.apache.iceberg.GenericDataFile)
taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
writerCommitMessage 
(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
{noformat}
Iceberg 1.1 `BaseFile.lowerBounds` is defined as
{code:java}
Map {code}
Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS

Kryo version: 4.0.2

 


> Kryo ClassCastException getting task result when JDK versions mismatch
> --
>
> Key: SPARK-42036
> URL: https://issues.apache.org/jira/browse/SPARK-42036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> {noformat}
> 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.nio.ByteBuffer
> Serialization trace:
> lowerBounds (org.apache.iceberg.GenericDataFile)
> taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
> writerCommitMessage 
> (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
> {noformat}
> Iceberg 1.1 `BaseFile.lowerBounds` is defined as
> {code:java}
> Map {code}
> Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
> Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS
> Kryo version: 4.0.2
>  
> Same Spark job works when both driver and executors run the same JDK 8 or JDK 
> 17.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42036) Kryo ClassCastException getting task result when JDK versions mismatch

2023-01-12 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-42036:
---
Summary: Kryo ClassCastException getting task result when JDK versions 
mismatch  (was: Kryo ClassCastException getting task result when JDK version 
mismatch)

> Kryo ClassCastException getting task result when JDK versions mismatch
> --
>
> Key: SPARK-42036
> URL: https://issues.apache.org/jira/browse/SPARK-42036
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> {noformat}
> 22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
> com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
> java.lang.Integer cannot be cast to java.nio.ByteBuffer
> Serialization trace:
> lowerBounds (org.apache.iceberg.GenericDataFile)
> taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
> writerCommitMessage 
> (org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
>   at 
> com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
> {noformat}
> Iceberg 1.1 `BaseFile.lowerBounds` is defined as
> {code:java}
> Map {code}
> Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
> Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS
> Kryo version: 4.0.2
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-42036) Kryo ClassCastException getting task result when JDK version mismatch

2023-01-12 Thread John Zhuge (Jira)
John Zhuge created SPARK-42036:
--

 Summary: Kryo ClassCastException getting task result when JDK 
version mismatch
 Key: SPARK-42036
 URL: https://issues.apache.org/jira/browse/SPARK-42036
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: John Zhuge


{noformat}
22/12/21 01:27:12 ERROR TaskResultGetter: Exception while getting task result
com.esotericsoftware.kryo.KryoException: java.lang.ClassCastException: 
java.lang.Integer cannot be cast to java.nio.ByteBuffer
Serialization trace:
lowerBounds (org.apache.iceberg.GenericDataFile)
taskFiles (org.apache.iceberg.spark.source.SparkWrite$TaskCommit)
writerCommitMessage 
(org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTaskResult)
at 
com.esotericsoftware.kryo.serializers.ObjectField.read(ObjectField.java:144)
{noformat}
Iceberg 1.1 `BaseFile.lowerBounds` is defined as
{code:java}
Map {code}
Driver JDK version: 1.8.0_352 (Azul Systems, Inc.)
Executor JDK version: openjdk version "17.0.5" 2022-10-18 LTS

Kryo version: 4.0.2

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41519) Pin versions-maven-plugin version

2022-12-14 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647711#comment-17647711
 ] 

John Zhuge commented on SPARK-41519:


{noformat}
[ERROR]
java.nio.file.NoSuchFileException: /Users/jzhuge/Repos/upstream-spark/avro
    at sun.nio.fs.UnixException.translateToIOException (UnixException.java:86)
    at sun.nio.fs.UnixException.rethrowAsIOException (UnixException.java:102)
    at sun.nio.fs.UnixException.rethrowAsIOException (UnixException.java:107)
    at sun.nio.fs.UnixFileSystemProvider.newByteChannel 
(UnixFileSystemProvider.java:214)
    at java.nio.file.Files.newByteChannel (Files.java:361)
    at java.nio.file.Files.newByteChannel (Files.java:407)
    at java.nio.file.spi.FileSystemProvider.newInputStream 
(FileSystemProvider.java:384)
    at java.nio.file.Files.newInputStream (Files.java:152)
    at org.codehaus.plexus.util.xml.XmlReader. (XmlReader.java:129)
    at org.codehaus.plexus.util.xml.XmlStreamReader. 
(XmlStreamReader.java:67)
    at org.codehaus.plexus.util.ReaderFactory.newXmlReader 
(ReaderFactory.java:122)
    at org.codehaus.mojo.versions.api.PomHelper.readXmlFile 
(PomHelper.java:1498)
    at org.codehaus.mojo.versions.AbstractVersionsUpdaterMojo.process 
(AbstractVersionsUpdaterMojo.java:326)
    at org.codehaus.mojo.versions.SetMojo.execute (SetMojo.java:381)
    at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo 
(DefaultBuildPluginManager.java:137)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 
(MojoExecutor.java:370)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute 
(MojoExecutor.java:351)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:171)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute 
(MojoExecutor.java:163)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject 
(LifecycleModuleBuilder.java:81)
    at 
org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build
 (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute 
(LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:294)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:960)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:293)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:196)
    at sun.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke 
(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke 
(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:498)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced 
(Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch 
(Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode 
(Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main 
(Launcher.java:347){noformat}

> Pin versions-maven-plugin version
> -
>
> Key: SPARK-41519
> URL: https://issues.apache.org/jira/browse/SPARK-41519
> Project: Spark
>  Issue Type: Task
>  Components: Project Infra
>Affects Versions: 3.4.0
>Reporter: John Zhuge
>Priority: Minor
>
> `versions-maven-plugin` release `2.14.0` broke the following this command in 
> Spark:
> {noformat}
> build/mvn versions:set -DnewVersion=3.4.0-jz-0 -DgenerateBackupPoms=false
> {noformat}
> See [https://github.com/mojohaus/versions/issues/848.]
> The plugin will be fixed in 2.14.1. However it may be desirable to pin plugin 
> version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41519) Pin versions-maven-plugin version

2022-12-14 Thread John Zhuge (Jira)
John Zhuge created SPARK-41519:
--

 Summary: Pin versions-maven-plugin version
 Key: SPARK-41519
 URL: https://issues.apache.org/jira/browse/SPARK-41519
 Project: Spark
  Issue Type: Task
  Components: Project Infra
Affects Versions: 3.4.0
Reporter: John Zhuge


`versions-maven-plugin` release `2.14.0` broke the following this command in 
Spark:
{noformat}
build/mvn versions:set -DnewVersion=3.4.0-jz-0 -DgenerateBackupPoms=false
{noformat}
See [https://github.com/mojohaus/versions/issues/848.]

The plugin will be fixed in 2.14.1. However it may be desirable to pin plugin 
version.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39800) DataSourceV2: View support

2022-07-17 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-39800:
---
Description: Support Data source V2 views.  (was: Data source V2 view 
substitution and resolution.)

> DataSourceV2: View support
> --
>
> Key: SPARK-39800
> URL: https://issues.apache.org/jira/browse/SPARK-39800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Support Data source V2 views.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39800) DataSourceV2: View support

2022-07-17 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-39800:
---
Summary: DataSourceV2: View support  (was: DataSourceV2: View substitution 
and resolution)

> DataSourceV2: View support
> --
>
> Key: SPARK-39800
> URL: https://issues.apache.org/jira/browse/SPARK-39800
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: John Zhuge
>Priority: Major
>
> Data source V2 view substitution and resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39800) DataSourceV2: View substitution and resolution

2022-07-17 Thread John Zhuge (Jira)
John Zhuge created SPARK-39800:
--

 Summary: DataSourceV2: View substitution and resolution
 Key: SPARK-39800
 URL: https://issues.apache.org/jira/browse/SPARK-39800
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: John Zhuge


Data source V2 view substitution and resolution.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39799) DataSourceV2: View catalog interface

2022-07-17 Thread John Zhuge (Jira)
John Zhuge created SPARK-39799:
--

 Summary: DataSourceV2: View catalog interface
 Key: SPARK-39799
 URL: https://issues.apache.org/jira/browse/SPARK-39799
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: John Zhuge


The view catalog interfaces.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) DataSourceV2: Catalog API for view metadata

2022-02-10 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Target Version/s: 3.3.0  (was: 3.2.0)

> DataSourceV2: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) DataSourceV2: Catalog API for view metadata

2022-02-10 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Summary: DataSourceV2: Catalog API for view metadata  (was: SPIP: Catalog 
API for view metadata)

> DataSourceV2: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2021-09-28 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421824#comment-17421824
 ] 

John Zhuge commented on SPARK-36664:


This can be very useful. For example, we'd like to track how long YARN jobs are 
stuck ACCEPTED state.

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25929) Support metrics with tags

2020-05-02 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17098247#comment-17098247
 ] 

John Zhuge commented on SPARK-25929:


Has anyone looked at https://micrometer.io/? From the web site:
{quote}Micrometer provides a simple facade over the instrumentation clients for 
the most popular monitoring systems, allowing you to instrument your JVM-based 
application code without vendor lock-in. Think SLF4J, but for metrics.{quote}

> Support metrics with tags
> -
>
> Key: SPARK-25929
> URL: https://issues.apache.org/jira/browse/SPARK-25929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> For better integration with DBs that support tags/labels, e.g., InfluxDB, 
> Prometheus, Atlas, etc.
> We should continue to support the current Graphite-style metrics.
> Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. 
> Currently 
> `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` 
> is in Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata

2020-04-06 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Description: 
SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in [SPIP 
document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].

  was:
SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in [SPIP 
docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].


> SPIP: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> document|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata

2020-04-06 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Description: 
SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in [SPIP 
docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].

  was:
SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in 
[SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].


> SPIP: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in [SPIP 
> docment|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) SPIP: Catalog API for view metadata

2020-04-06 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Summary: SPIP: Catalog API for view metadata  (was: Catalog API for View 
Metadata)

> SPIP: Catalog API for view metadata
> ---
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in 
> [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31357) Catalog API for View Metadata

2020-04-06 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-31357:
---
Labels: SPIP  (was: )

> Catalog API for View Metadata
> -
>
> Key: SPARK-31357
> URL: https://issues.apache.org/jira/browse/SPARK-31357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: John Zhuge
>Priority: Major
>  Labels: SPIP
>
> SPARK-24252 added a catalog plugin system and `TableCatalog` API that 
> provided table metadata to Spark. This JIRA adds `ViewCatalog` API for view 
> metadata.
> Details in 
> [SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31357) Catalog API for View Metadata

2020-04-05 Thread John Zhuge (Jira)
John Zhuge created SPARK-31357:
--

 Summary: Catalog API for View Metadata
 Key: SPARK-31357
 URL: https://issues.apache.org/jira/browse/SPARK-31357
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: John Zhuge


SPARK-24252 added a catalog plugin system and `TableCatalog` API that provided 
table metadata to Spark. This JIRA adds `ViewCatalog` API for view metadata.

Details in 
[SPIP|https://docs.google.com/document/d/1XOxFtloiMuW24iqJ-zJnDzHl2KMxipTjJoxleJFz66A/edit?usp=sharing].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25929) Support metrics with tags

2020-02-11 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17035074#comment-17035074
 ] 

John Zhuge commented on SPARK-25929:


Yeah, I can feel the pain.

When I ingest into InfluxDB, I have to use its [Graphite 
templates|https://github.com/influxdata/influxdb/tree/v1.7.10/services/graphite#templates],
 e.g.,

{noformat}
"*.*.*.DAGScheduler.*.* application.app_id.executor_id.measurement.type.qty 
name=DAGScheduler",
"*.*.*.ExecutorAllocationManager.*.* 
application.app_id.executor_id.measurement.type.qty 
name=ExecutorAllocationManager",
"*.*.*.ExternalShuffle.*.* 
application.app_id.executor_id.measurement.type.qty name=ExternalShuffle",
{noformat}

Hard to get right. Easily obsolete. Doesn't support multiple versions.

> Support metrics with tags
> -
>
> Key: SPARK-25929
> URL: https://issues.apache.org/jira/browse/SPARK-25929
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> For better integration with DBs that support tags/labels, e.g., InfluxDB, 
> Prometheus, Atlas, etc.
> We should continue to support the current Graphite-style metrics.
> Dropwizard Metrics v5 supports tags. It has been in RC status since Feb. 
> Currently 
> `[5.0.0-rc2|https://github.com/dropwizard/metrics/releases/tag/v5.0.0-rc2]` 
> is in Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-04 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16988078#comment-16988078
 ] 

John Zhuge commented on SPARK-30118:


[~cltlfcjin] Thanks for the comment. Do you know which commit fixed the issue?

> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987429#comment-16987429
 ] 

John Zhuge commented on SPARK-30118:


I am running Spark master with Hive 1.2.1. Same issue in Spark 2.3.

> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-30118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16987428#comment-16987428
 ] 

John Zhuge commented on SPARK-30118:


{code:java}
spark-sql> DESC FORMATTED jzhuge.v1;
foo1string  NULL

# Detailed Table Information
Databasejzhuge
Table   v1
Owner   jzhuge
Created TimeTue Dec 03 17:53:59 PST 2019
Last Access UNKNOWN
Created By  Spark 3.0.0-SNAPSHOT
TypeVIEW
View Text   SELECT 'foo' foo1
View Original Text  SELECT 'foo' foo1
View Default Database   default
View Query Output Columns   [foo2]
Table Properties[transient_lastDdlTime=1575424439, 
view.query.out.col.0=foo2, view.query.out.numCols=1, 
view.default.database=default]
Locationfile://tmp/jzhuge/v1
Serde Library   org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormatorg.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties  [serialization.format=1]
{code}


> ALTER VIEW QUERY does not work
> --
>
> Key: SPARK-30118
> URL: https://issues.apache.org/jira/browse/SPARK-30118
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> `ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
> state.
> {code:sql}
> spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
> spark-sql> SHOW CREATE TABLE jzhuge.v1;
> CREATE VIEW `jzhuge`.`v1`(foo1) AS
> SELECT 'foo' foo1
> spark-sql> TABLE jzhuge.v1;
> Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
> SubqueryAlias `jzhuge`.`v1`
> +- View (`jzhuge`.`v1`, [foo1#33])
>+- Project [foo AS foo1#34]
>   +- OneRowRelation
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30118) ALTER VIEW QUERY does not work

2019-12-03 Thread John Zhuge (Jira)
John Zhuge created SPARK-30118:
--

 Summary: ALTER VIEW QUERY does not work
 Key: SPARK-30118
 URL: https://issues.apache.org/jira/browse/SPARK-30118
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


`ALTER VIEW AS` does not change view query. It leaves the view in a corrupted 
state.
{code:sql}
spark-sql> CREATE VIEW jzhuge.v1 AS SELECT 'foo' foo1;
spark-sql> SHOW CREATE TABLE jzhuge.v1;
CREATE VIEW `jzhuge`.`v1`(foo1) AS
SELECT 'foo' foo1

spark-sql> ALTER VIEW jzhuge.v1 AS SELECT 'foo' foo2;
spark-sql> SHOW CREATE TABLE jzhuge.v1;
CREATE VIEW `jzhuge`.`v1`(foo1) AS
SELECT 'foo' foo1

spark-sql> TABLE jzhuge.v1;
Error in query: Attribute with name 'foo2' is not found in '(foo1)';;
SubqueryAlias `jzhuge`.`v1`
+- View (`jzhuge`.`v1`, [foo1#33])
   +- Project [foo AS foo1#34]
  +- OneRowRelation
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29030) Simplify lookupV2Relation

2019-09-09 Thread John Zhuge (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-29030:
---
Description: Simplify the return type for {{lookupV2Relation}} which makes 
the 3 callers more straightforward.  (was: Simplify the return type for 
{{lookupV2Relation}} which makes the 3 callers straightforward as well.)

> Simplify lookupV2Relation
> -
>
> Key: SPARK-29030
> URL: https://issues.apache.org/jira/browse/SPARK-29030
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> Simplify the return type for {{lookupV2Relation}} which makes the 3 callers 
> more straightforward.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29030) Simplify lookupV2Relation

2019-09-09 Thread John Zhuge (Jira)
John Zhuge created SPARK-29030:
--

 Summary: Simplify lookupV2Relation
 Key: SPARK-29030
 URL: https://issues.apache.org/jira/browse/SPARK-29030
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


Simplify the return type for {{lookupV2Relation}} which makes the 3 callers 
straightforward as well.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28640) Only give warning when session catalog is not defined

2019-08-06 Thread John Zhuge (JIRA)
John Zhuge created SPARK-28640:
--

 Summary: Only give warning when session catalog is not defined
 Key: SPARK-28640
 URL: https://issues.apache.org/jira/browse/SPARK-28640
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


LookupCatalog.sessionCatalog logs an error message and the exception stack upon 
any nonfatal exception. When session catalog is not defined, this may alarm the 
user unnecessarily. It should be enough to give a warning and return None.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable

2019-07-30 Thread John Zhuge (JIRA)
John Zhuge created SPARK-28565:
--

 Summary: DataSourceV2: DataFrameWriter.saveAsTable
 Key: SPARK-28565
 URL: https://issues.apache.org/jira/browse/SPARK-28565
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge
Assignee: John Zhuge
 Fix For: 3.0.0


Support multiple catalogs in the following InsertInto use cases:
 * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28565) DataSourceV2: DataFrameWriter.saveAsTable

2019-07-30 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28565?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-28565:
---
Description: 
Support multiple catalogs in the following use cases:
 * DataFrameWriter.saveAsTable("catalog.db.tbl")

  was:
Support multiple catalogs in the following InsertInto use cases:
 * DataFrameWriter.insertInto("catalog.db.tbl")


> DataSourceV2: DataFrameWriter.saveAsTable
> -
>
> Key: SPARK-28565
> URL: https://issues.apache.org/jira/browse/SPARK-28565
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Assignee: John Zhuge
>Priority: Major
> Fix For: 3.0.0
>
>
> Support multiple catalogs in the following use cases:
>  * DataFrameWriter.saveAsTable("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28178?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-28178:
---
Description: 
Support multiple catalogs in the following InsertInto use cases:
 * DataFrameWriter.insertInto("catalog.db.tbl")

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")


> DataSourceV2: DataFrameWriter.insertInfo
> 
>
> Key: SPARK-28178
> URL: https://issues.apache.org/jira/browse/SPARK-28178
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28178) DataSourceV2: DataFrameWriter.insertInfo

2019-06-26 Thread John Zhuge (JIRA)
John Zhuge created SPARK-28178:
--

 Summary: DataSourceV2: DataFrameWriter.insertInfo
 Key: SPARK-28178
 URL: https://issues.apache.org/jira/browse/SPARK-28178
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Description: 
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")


> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-26 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Description: 
Support multiple catalogs in the following use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl

  was:
Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl


> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: InsertTable

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Summary: DataSourceV2: InsertTable  (was: DataSourceV2: Insert into tables 
in multiple catalogs)

> DataSourceV2: InsertTable
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) DataSourceV2 table relation

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: DataSourceV2 table relation  (was: DataSourceV2: Select from 
multiple catalogs)

> DataSourceV2 table relation
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27845) DataSourceV2: Insert into tables in multiple catalogs

2019-06-10 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27845?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27845:
---
Summary: DataSourceV2: Insert into tables in multiple catalogs  (was: 
DataSourceV2: InsertInto multiple catalogs)

> DataSourceV2: Insert into tables in multiple catalogs
> -
>
> Key: SPARK-27845
> URL: https://issues.apache.org/jira/browse/SPARK-27845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multiple catalogs in the following InsertInto use cases:
>  * INSERT INTO [TABLE] catalog.db.tbl
>  * INSERT OVERWRITE TABLE catalog.db.tbl
>  * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27961:
---
Description: 
The newly added `Refresh` method in [PR 
#24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.

  was:
The newly added `Refresh` method in PR #24401 prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.


> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in [PR 
> #24401|https://github.com/apache/spark/pull/24401] prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-27961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16856982#comment-16856982
 ] 

John Zhuge commented on SPARK-27961:


[~Gengliang.Wang] [~cloud_fan] Could you help?

> DataSourceV2Relation should not have refresh method
> ---
>
> Key: SPARK-27961
> URL: https://issues.apache.org/jira/browse/SPARK-27961
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> The newly added `Refresh` method in PR #24401 prevented me from moving 
> DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
> table.fileIndex.refresh()` while `FileTable` belongs to sql/core.
> More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
> design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27961) DataSourceV2Relation should not have refresh method

2019-06-05 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27961:
--

 Summary: DataSourceV2Relation should not have refresh method
 Key: SPARK-27961
 URL: https://issues.apache.org/jira/browse/SPARK-27961
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


The newly added `Refresh` method in PR #24401 prevented me from moving 
DataSourceV2Relation into catalyst. It calls `case table: FileTable => 
table.fileIndex.refresh()` while `FileTable` belongs to sql/core.

More importantly, [~rdblue] pointed out DataSourceV2Relation is immutable by 
design, it should not have refresh method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives this warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives this warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
> TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, `case mapArg: Map[_, _]` may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27947:
--

 Summary: ParsedStatement subclass toString may throw 
ClassCastException
 Key: SPARK-27947
 URL: https://issues.apache.org/jira/browse/SPARK-27947
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}
 {code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27947) ParsedStatement subclass toString may throw ClassCastException

2019-06-04 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27947:
---
Description: 
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}

Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 

  was:
In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any Map 
type, thus causing `asInstanceOf[Map[String, String]]` to throw 
ClassCastException.

The following test reproduces the issue:
{code:java}
case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
 override def output: Seq[Attribute] = Nil
 override def children: Seq[LogicalPlan] = Nil
}

 TestStatement(Map("abc" -> 1)).toString{code}
{code:java}
 {code}
Changing the code to `case mapArg: Map[String, String]` will not work due to 
type erasure. As a matter of fact, compiler gives the warning:
{noformat}
Warning:(41, 18) non-variable type argument String in type pattern 
scala.collection.immutable.Map[String,String] (the underlying of 
Map[String,String]) is unchecked since it is eliminated by erasure
case mapArg: Map[String, String] =>{noformat}
 


> ParsedStatement subclass toString may throw ClassCastException
> --
>
> Key: SPARK-27947
> URL: https://issues.apache.org/jira/browse/SPARK-27947
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> In ParsedStatement.productIterator, "case mapArg: Map[_, _]" may match any 
> Map type, thus causing `asInstanceOf[Map[String, String]]` to throw 
> ClassCastException.
> The following test reproduces the issue:
> {code:java}
> case class TestStatement(p: Map[String, Int]) extends ParsedStatement {
>  override def output: Seq[Attribute] = Nil
>  override def children: Seq[LogicalPlan] = Nil
> }
>  TestStatement(Map("abc" -> 1)).toString{code}
> {code:java}
> Changing the code to `case mapArg: Map[String, String]` will not work due to 
> type erasure. As a matter of fact, compiler gives the warning:
> {noformat}
> Warning:(41, 18) non-variable type argument String in type pattern 
> scala.collection.immutable.Map[String,String] (the underlying of 
> Map[String,String]) is unchecked since it is eliminated by erasure
> case mapArg: Map[String, String] =>{noformat}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27845) DataSourceV2: InsertInto multiple catalogs

2019-05-26 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27845:
--

 Summary: DataSourceV2: InsertInto multiple catalogs
 Key: SPARK-27845
 URL: https://issues.apache.org/jira/browse/SPARK-27845
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


Support multiple catalogs in the following InsertInto use cases:
 * INSERT INTO [TABLE] catalog.db.tbl
 * INSERT OVERWRITE TABLE catalog.db.tbl
 * DataFrameWriter.insertInto("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) DataSourceV2: Select from multiple catalogs

2019-05-25 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: DataSourceV2: Select from multiple catalogs  (was: DataSourceV2: 
Logical relation in multiple catalogs)

> DataSourceV2: Select from multiple catalogs
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) DataSourceV2: Logical relation in multiple catalogs

2019-05-23 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: DataSourceV2: Logical relation in multiple catalogs  (was: Select 
from multiple catalogs)

> DataSourceV2: Logical relation in multiple catalogs
> ---
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27813) DataSourceV2: Add DropTable logical operation

2019-05-23 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27813:
--

 Summary: DataSourceV2: Add DropTable logical operation
 Key: SPARK-27813
 URL: https://issues.apache.org/jira/browse/SPARK-27813
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.3
Reporter: John Zhuge


Support DROP TABLE from V2 catalog, e.g., "DROP TABLE testcat.ns1.ns2.tbl"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) Select from multiple catalogs

2019-05-22 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Description: 
Support multi-catalog in the following SELECT code paths:
 * SELECT * FROM catalog.db.tbl
 * TABLE catalog.db.tbl
 * JOIN or UNION tables from different catalogs
 * SparkSession.table("catalog.db.tbl")

  was:
Support multi-catalog in the following SELECT code paths:
* `SELECT * FROM catalog.db.tbl`
* `TABLE catalog.db.tbl`
* SELECT with JOIN or UNION
* SparkSession.table("catalog.db.tbl")


> Select from multiple catalogs
> -
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
>  * SELECT * FROM catalog.db.tbl
>  * TABLE catalog.db.tbl
>  * JOIN or UNION tables from different catalogs
>  * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) Select from multiple catalogs

2019-05-19 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: Select from multiple catalogs  (was: SELECT from multiple catalogs)

> Select from multiple catalogs
> -
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
> * `SELECT * FROM catalog.db.tbl`
> * `TABLE catalog.db.tbl`
> * SELECT with JOIN or UNION
> * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27739) df.persist should save stats from optimized plan

2019-05-15 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27739:
---
Summary: df.persist should save stats from optimized plan  (was: Persist 
should use stats from optimized plan)

> df.persist should save stats from optimized plan
> 
>
> Key: SPARK-27739
> URL: https://issues.apache.org/jira/browse/SPARK-27739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: John Zhuge
>Priority: Minor
>
> CacheManager.cacheQuery passes the stats for `planToCache` to 
> InMemoryRelation. Since the plan has not been optimized, the stats is 
> inaccurate because project and filter have not been applied. I'd suggest 
> passing the stats from the optimized plan.
> {code:java}
> class CacheManager extends Logging {
> ...
>   def cacheQuery(
>   query: Dataset[_],
>   tableName: Option[String] = None,
>   storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
> val planToCache = query.logicalPlan
> if (lookupCachedData(planToCache).nonEmpty) {
>   logWarning("Asked to cache already cached data.")
> } else {
>   val sparkSession = query.sparkSession
>   val inMemoryRelation = InMemoryRelation(
> sparkSession.sessionState.conf.useCompression,
> sparkSession.sessionState.conf.columnBatchSize, storageLevel,
> sparkSession.sessionState.executePlan(planToCache).executedPlan,
> tableName,
> planToCache)  <==
> ...
> }
> object InMemoryRelation {
>   def apply(
>   useCompression: Boolean,
>   batchSize: Int,
>   storageLevel: StorageLevel,
>   child: SparkPlan,
>   tableName: Option[String],
>   logicalPlan: LogicalPlan): InMemoryRelation = {
> val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
> storageLevel, child, tableName)
> val relation = new InMemoryRelation(child.output, cacheBuilder, 
> logicalPlan.outputOrdering)
> relation.statsOfPlanToCache = logicalPlan.stats   <==
> relation
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27739) Persist should use stats from optimized plan

2019-05-15 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27739:
---
Summary: Persist should use stats from optimized plan  (was: 
CacheManager.cacheQuery should copy stats from optimized plan)

> Persist should use stats from optimized plan
> 
>
> Key: SPARK-27739
> URL: https://issues.apache.org/jira/browse/SPARK-27739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: John Zhuge
>Priority: Minor
>
> CacheManager.cacheQuery passes the stats for `planToCache` to 
> InMemoryRelation. Since the plan has not been optimized, the stats is 
> inaccurate because project and filter have not been applied.
> {code:java}
> class CacheManager extends Logging {
> ...
>   def cacheQuery(
>   query: Dataset[_],
>   tableName: Option[String] = None,
>   storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
> val planToCache = query.logicalPlan
> if (lookupCachedData(planToCache).nonEmpty) {
>   logWarning("Asked to cache already cached data.")
> } else {
>   val sparkSession = query.sparkSession
>   val inMemoryRelation = InMemoryRelation(
> sparkSession.sessionState.conf.useCompression,
> sparkSession.sessionState.conf.columnBatchSize, storageLevel,
> sparkSession.sessionState.executePlan(planToCache).executedPlan,
> tableName,
> planToCache)  <==
> ...
> }
> object InMemoryRelation {
>   def apply(
>   useCompression: Boolean,
>   batchSize: Int,
>   storageLevel: StorageLevel,
>   child: SparkPlan,
>   tableName: Option[String],
>   logicalPlan: LogicalPlan): InMemoryRelation = {
> val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
> storageLevel, child, tableName)
> val relation = new InMemoryRelation(child.output, cacheBuilder, 
> logicalPlan.outputOrdering)
> relation.statsOfPlanToCache = logicalPlan.stats   <==
> relation
>   }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27739) Persist should use stats from optimized plan

2019-05-15 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27739:
---
Description: 
CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. 
Since the plan has not been optimized, the stats is inaccurate because project 
and filter have not been applied. I'd suggest passing the stats from the 
optimized plan.
{code:java}
class CacheManager extends Logging {
...
  def cacheQuery(
  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
  logWarning("Asked to cache already cached data.")
} else {
  val sparkSession = query.sparkSession
  val inMemoryRelation = InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize, storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName,
planToCache)  <==
...
}

object InMemoryRelation {

  def apply(
  useCompression: Boolean,
  batchSize: Int,
  storageLevel: StorageLevel,
  child: SparkPlan,
  tableName: Option[String],
  logicalPlan: LogicalPlan): InMemoryRelation = {
val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
storageLevel, child, tableName)
val relation = new InMemoryRelation(child.output, cacheBuilder, 
logicalPlan.outputOrdering)
relation.statsOfPlanToCache = logicalPlan.stats   <==
relation
  }
{code}

  was:
CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. 
Since the plan has not been optimized, the stats is inaccurate because project 
and filter have not been applied.

{code:java}
class CacheManager extends Logging {
...
  def cacheQuery(
  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
  logWarning("Asked to cache already cached data.")
} else {
  val sparkSession = query.sparkSession
  val inMemoryRelation = InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize, storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName,
planToCache)  <==
...
}

object InMemoryRelation {

  def apply(
  useCompression: Boolean,
  batchSize: Int,
  storageLevel: StorageLevel,
  child: SparkPlan,
  tableName: Option[String],
  logicalPlan: LogicalPlan): InMemoryRelation = {
val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
storageLevel, child, tableName)
val relation = new InMemoryRelation(child.output, cacheBuilder, 
logicalPlan.outputOrdering)
relation.statsOfPlanToCache = logicalPlan.stats   <==
relation
  }
{code}



> Persist should use stats from optimized plan
> 
>
> Key: SPARK-27739
> URL: https://issues.apache.org/jira/browse/SPARK-27739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0, 2.4.0
>Reporter: John Zhuge
>Priority: Minor
>
> CacheManager.cacheQuery passes the stats for `planToCache` to 
> InMemoryRelation. Since the plan has not been optimized, the stats is 
> inaccurate because project and filter have not been applied. I'd suggest 
> passing the stats from the optimized plan.
> {code:java}
> class CacheManager extends Logging {
> ...
>   def cacheQuery(
>   query: Dataset[_],
>   tableName: Option[String] = None,
>   storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
> val planToCache = query.logicalPlan
> if (lookupCachedData(planToCache).nonEmpty) {
>   logWarning("Asked to cache already cached data.")
> } else {
>   val sparkSession = query.sparkSession
>   val inMemoryRelation = InMemoryRelation(
> sparkSession.sessionState.conf.useCompression,
> sparkSession.sessionState.conf.columnBatchSize, storageLevel,
> sparkSession.sessionState.executePlan(planToCache).executedPlan,
> tableName,
> planToCache)  <==
> ...
> }
> object InMemoryRelation {
>   def apply(
>   useCompression: Boolean,
>   batchSize: Int,
>   storageLevel: StorageLevel,
>   child: SparkPlan,
>   tableName: Option[String],
>   logicalPlan: LogicalPlan): InMemoryRelation = {
> val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
> storageLevel, child, tableName)
> val relation = new InMemoryRelation(child.output, cacheBuild

[jira] [Created] (SPARK-27739) CacheManager.cacheQuery should copy stats from optimized plan

2019-05-15 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27739:
--

 Summary: CacheManager.cacheQuery should copy stats from optimized 
plan
 Key: SPARK-27739
 URL: https://issues.apache.org/jira/browse/SPARK-27739
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0, 2.3.0
Reporter: John Zhuge


CacheManager.cacheQuery passes the stats for `planToCache` to InMemoryRelation. 
Since the plan has not been optimized, the stats is inaccurate because project 
and filter have not been applied.

{code:java}
class CacheManager extends Logging {
...
  def cacheQuery(
  query: Dataset[_],
  tableName: Option[String] = None,
  storageLevel: StorageLevel = MEMORY_AND_DISK): Unit = {
val planToCache = query.logicalPlan
if (lookupCachedData(planToCache).nonEmpty) {
  logWarning("Asked to cache already cached data.")
} else {
  val sparkSession = query.sparkSession
  val inMemoryRelation = InMemoryRelation(
sparkSession.sessionState.conf.useCompression,
sparkSession.sessionState.conf.columnBatchSize, storageLevel,
sparkSession.sessionState.executePlan(planToCache).executedPlan,
tableName,
planToCache)  <==
...
}

object InMemoryRelation {

  def apply(
  useCompression: Boolean,
  batchSize: Int,
  storageLevel: StorageLevel,
  child: SparkPlan,
  tableName: Option[String],
  logicalPlan: LogicalPlan): InMemoryRelation = {
val cacheBuilder = CachedRDDBuilder(useCompression, batchSize, 
storageLevel, child, tableName)
val relation = new InMemoryRelation(child.output, cacheBuilder, 
logicalPlan.outputOrdering)
relation.statsOfPlanToCache = logicalPlan.stats   <==
relation
  }
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) SELECT from multiple catalogs

2019-03-29 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: SELECT from multiple catalogs  (was: Support SELECT from multiple 
catalogs)

> SELECT from multiple catalogs
> -
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
> * `SELECT * FROM catalog.db.tbl`
> * `TABLE catalog.db.tbl`
> * SELECT with JOIN or UNION
> * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27322) Support SELECT from multiple catalogs

2019-03-29 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27322:
---
Summary: Support SELECT from multiple catalogs  (was: Support multi-catalog 
in SELECT)

> Support SELECT from multiple catalogs
> -
>
> Key: SPARK-27322
> URL: https://issues.apache.org/jira/browse/SPARK-27322
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Major
>
> Support multi-catalog in the following SELECT code paths:
> * `SELECT * FROM catalog.db.tbl`
> * `TABLE catalog.db.tbl`
> * SELECT with JOIN or UNION
> * SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27322) Support multi-catalog in SELECT

2019-03-29 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27322:
--

 Summary: Support multi-catalog in SELECT
 Key: SPARK-27322
 URL: https://issues.apache.org/jira/browse/SPARK-27322
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge


Support multi-catalog in the following SELECT code paths:
* `SELECT * FROM catalog.db.tbl`
* `TABLE catalog.db.tbl`
* SELECT with JOIN or UNION
* SparkSession.table("catalog.db.tbl")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8

2019-03-22 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27250:
---
Description: 
Discovered by 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/509/console:
{noformat}
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(Array.empty, name))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
 Static methods in interface require -target:jvm-1.8
 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(parts.init.toArray, parts.last))
 [error] ^
 [error] three errors found
 [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
{noformat}

  was:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull

{noformat}
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(Array.empty, name))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
 Static methods in interface require -target:jvm-1.8
 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(parts.init.toArray, parts.last))
 [error] ^
 [error] three errors found
 [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
{noformat}


> Scala 2.11 maven compile should target Java 1.8
> ---
>
> Key: SPARK-27250
> URL: https://issues.apache.org/jira/browse/SPARK-27250
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> Discovered by 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/509/console:
> {noformat}
> [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
>  Static methods in interface require -target:jvm-1.8
>  [error] (None, Identifier.of(Array.empty, name))
>  [error] ^
>  [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
>  Static methods in interface require -target:jvm-1.8
>  [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
>  [error] ^
>  [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
>  Static methods in interface require -target:jvm-1.8
>  [error] (None, Identifier.of(parts.init.toArray, parts.last))
>  [error] ^
>  [error] three errors found
>  [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27251) SPARK-25196 Scala 2.1.1 maven build failure

2019-03-22 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27251?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27251:
---
Description: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull

{noformat}
[info] Compiling 371 Scala sources and 102 Java sources to 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes...
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162:
 values cannot be volatile
[error] @volatile var statsOfPlanToCache: Statistics)
{noformat}


  was:

{noformat}
[info] Compiling 371 Scala sources and 102 Java sources to 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes...
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162:
 values cannot be volatile
[error] @volatile var statsOfPlanToCache: Statistics)
{noformat}



> SPARK-25196 Scala 2.1.1 maven build failure
> ---
>
> Key: SPARK-27251
> URL: https://issues.apache.org/jira/browse/SPARK-27251
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull
> {noformat}
> [info] Compiling 371 Scala sources and 102 Java sources to 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes...
> [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162:
>  values cannot be volatile
> [error] @volatile var statsOfPlanToCache: Statistics)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27251) SPARK-25196 Scala 2.1.1 maven build failure

2019-03-22 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27251:
--

 Summary: SPARK-25196 Scala 2.1.1 maven build failure
 Key: SPARK-27251
 URL: https://issues.apache.org/jira/browse/SPARK-27251
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: John Zhuge



{noformat}
[info] Compiling 371 Scala sources and 102 Java sources to 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/target/scala-2.11/classes...
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala:162:
 values cannot be volatile
[error] @volatile var statsOfPlanToCache: Statistics)
{noformat}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8

2019-03-22 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-27250:
---
Description: 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull

{noformat}
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(Array.empty, name))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
 Static methods in interface require -target:jvm-1.8
 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(parts.init.toArray, parts.last))
 [error] ^
 [error] three errors found
 [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
{noformat}

  was:
{noformat}
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(Array.empty, name))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
 Static methods in interface require -target:jvm-1.8
 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(parts.init.toArray, parts.last))
 [error] ^
 [error] three errors found
 [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
{noformat}


> Scala 2.11 maven compile should target Java 1.8
> ---
>
> Key: SPARK-27250
> URL: https://issues.apache.org/jira/browse/SPARK-27250
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: John Zhuge
>Priority: Minor
>
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/507/consoleFull
> {noformat}
> [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
>  Static methods in interface require -target:jvm-1.8
>  [error] (None, Identifier.of(Array.empty, name))
>  [error] ^
>  [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
>  Static methods in interface require -target:jvm-1.8
>  [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
>  [error] ^
>  [error] 
> /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
>  Static methods in interface require -target:jvm-1.8
>  [error] (None, Identifier.of(parts.init.toArray, parts.last))
>  [error] ^
>  [error] three errors found
>  [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27250) Scala 2.11 maven compile should target Java 1.8

2019-03-22 Thread John Zhuge (JIRA)
John Zhuge created SPARK-27250:
--

 Summary: Scala 2.11 maven compile should target Java 1.8
 Key: SPARK-27250
 URL: https://issues.apache.org/jira/browse/SPARK-27250
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 3.0.0
Reporter: John Zhuge


{noformat}
[error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:40:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(Array.empty, name))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:44:
 Static methods in interface require -target:jvm-1.8
 [error] (Some(catalog), Identifier.of(tail.init.toArray, tail.last))
 [error] ^
 [error] 
/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-ubuntu-scala-2.11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalog/v2/LookupCatalog.scala:47:
 Static methods in interface require -target:jvm-1.8
 [error] (None, Identifier.of(parts.init.toArray, parts.last))
 [error] ^
 [error] three errors found
 [error] Compile failed at Mar 22, 2019 2:10:52 AM [42.688s]
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20314) Inconsistent error handling in JSON parsing SQL functions

2019-02-21 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge resolved SPARK-20314.

Resolution: Duplicate

Resolve it as duplicate.

> Inconsistent error handling in JSON parsing SQL functions
> -
>
> Key: SPARK-20314
> URL: https://issues.apache.org/jira/browse/SPARK-20314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Wasserman
>Priority: Major
>
> Most parse errors in the JSON parsing SQL functions (e.g. json_tuple, 
> get_json_object) will return a null(s) if the JSON is badly formed. However, 
> if Jackson determines that the string includes invalid characters it will 
> throw an exception (java.io.CharConversionException: Invalid UTF-32 
> character) that Spark does not catch. This creates a robustness problem in 
> that these functions cannot be used at all when there may be dirty data as 
> these exceptions will kill the jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26946:
---
Component/s: (was: Spark Core)

> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26946:
---
Description: 
Propose semantics for identifiers and a listing API to support multiple 
catalogs.

[~rdblue]'s SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]

  was:
Propose semantics for identifiers and a listing API to support multiple 
catalogs.

Ryan's SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]


> Identifiers for multi-catalog Spark
> ---
>
> Key: SPARK-26946
> URL: https://issues.apache.org/jira/browse/SPARK-26946
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core, SQL
>Affects Versions: 2.3.2
>Reporter: John Zhuge
>Priority: Major
>
> Propose semantics for identifiers and a listing API to support multiple 
> catalogs.
> [~rdblue]'s SPIP: 
> [https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26946) Identifiers for multi-catalog Spark

2019-02-20 Thread John Zhuge (JIRA)
John Zhuge created SPARK-26946:
--

 Summary: Identifiers for multi-catalog Spark
 Key: SPARK-26946
 URL: https://issues.apache.org/jira/browse/SPARK-26946
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core, SQL
Affects Versions: 2.3.2
Reporter: John Zhuge


Propose semantics for identifiers and a listing API to support multiple 
catalogs.

Ryan's SPIP: 
[https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned table

2019-01-09 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26576:
---
Summary: Broadcast hint not applied to partitioned table  (was: Broadcast 
hint not applied to partitioned Parquet table)

> Broadcast hint not applied to partitioned table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange HashedRelationBroadc

[jira] [Updated] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table

2019-01-09 Thread John Zhuge (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Zhuge updated SPARK-26576:
---
Affects Version/s: 2.2.2

> Broadcast hint not applied to partitioned Parquet table
> ---
>
> Key: SPARK-26576
> URL: https://issues.apache.org/jira/browse/SPARK-26576
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.2, 2.3.2, 2.4.0
>Reporter: John Zhuge
>Priority: Major
>
> Broadcast hint is not applied to partitioned Parquet table. Below 
> "SortMergeJoin" is chosen incorrectly and "ResolvedHit(broadcast)" is removed 
> in Optimized Plan.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_with_part (val STRING) 
> PARTITIONED BY (dateint INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_with_part`
> :  +- Relation[val#28,dateint#29] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_with_part`
>   +- Relation[val#32,dateint#33] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- SubqueryAlias `jzhuge`.`parquet_with_part`
>:  +- Relation[val#28,dateint#29] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_with_part`
>  +- Relation[val#32,dateint#33] parquet
> == Optimized Logical Plan ==
> Project [dateint#29, val#28, val#32]
> +- Join Inner, (dateint#29 = dateint#33)
>:- Project [val#28, dateint#29]
>:  +- Filter isnotnull(dateint#29)
>: +- Relation[val#28,dateint#29] parquet
>+- Project [val#32, dateint#33]
>   +- Filter isnotnull(dateint#33)
>  +- Relation[val#32,dateint#33] parquet
> == Physical Plan ==
> *(5) Project [dateint#29, val#28, val#32]
> +- *(5) SortMergeJoin [dateint#29], [dateint#33], Inner
>:- *(2) Sort [dateint#29 ASC NULLS FIRST], false, 0
>:  +- Exchange(coordinator id: 55629191) hashpartitioning(dateint#29, 
> 500), coordinator[target post-shuffle partition size: 67108864]
>: +- *(1) FileScan parquet jzhuge.parquet_with_part[val#28,dateint#29] 
> Batched: true, Format: Parquet, Location: PrunedInMemoryFileIndex[], 
> PartitionCount: 0, PartitionFilters: [isnotnull(dateint#29)], PushedFilters: 
> [], ReadSchema: struct
>+- *(4) Sort [dateint#33 ASC NULLS FIRST], false, 0
>   +- ReusedExchange [val#32, dateint#33], Exchange(coordinator id: 
> 55629191) hashpartitioning(dateint#29, 500), coordinator[target post-shuffle 
> partition size: 67108864]
> {noformat}
> Broadcast hint is applied to Parquet table without partition. Below 
> "BroadcastHashJoin" is chosen as expected.
> {noformat}
> scala> spark.sql("CREATE TABLE jzhuge.parquet_no_part (val STRING, dateint 
> INT) STORED AS parquet")
> scala> spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "-1")
> scala> Seq(spark.table("jzhuge.parquet_no_part")).map(df => 
> df.join(broadcast(df), "dateint").explain(true))
> == Parsed Logical Plan ==
> 'Join UsingJoin(Inner,List(dateint))
> :- SubqueryAlias `jzhuge`.`parquet_no_part`
> :  +- Relation[val#44,dateint#45] parquet
> +- ResolvedHint (broadcast)
>+- SubqueryAlias `jzhuge`.`parquet_no_part`
>   +- Relation[val#50,dateint#51] parquet
> == Analyzed Logical Plan ==
> dateint: int, val: string, val: string
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- SubqueryAlias `jzhuge`.`parquet_no_part`
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- SubqueryAlias `jzhuge`.`parquet_no_part`
>  +- Relation[val#50,dateint#51] parquet
> == Optimized Logical Plan ==
> Project [dateint#45, val#44, val#50]
> +- Join Inner, (dateint#45 = dateint#51)
>:- Filter isnotnull(dateint#45)
>:  +- Relation[val#44,dateint#45] parquet
>+- ResolvedHint (broadcast)
>   +- Filter isnotnull(dateint#51)
>  +- Relation[val#50,dateint#51] parquet
> == Physical Plan ==
> *(2) Project [dateint#45, val#44, val#50]
> +- *(2) BroadcastHashJoin [dateint#45], [dateint#51], Inner, BuildRight
>:- *(2) Project [val#44, dateint#45]
>:  +- *(2) Filter isnotnull(dateint#45)
>: +- *(2) FileScan parquet jzhuge.parquet_no_part[val#44,dateint#45] 
> Batched: true, Format: Parquet, Location: InMemoryFileIndex[...], 
> PartitionFilters: [], PushedFilters: [IsNotNull(dateint)], ReadSchema: 
> struct
>+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
> true] as bigint)))
>   +- *(1) Project [va

[jira] [Comment Edited] (SPARK-26576) Broadcast hint not applied to partitioned Parquet table

2019-01-09 Thread John Zhuge (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16737964#comment-16737964
 ] 

John Zhuge edited comment on SPARK-26576 at 1/9/19 5:12 PM:


No issue on the master branch. Please note "rightHint=(broadcast)" for the Join 
in Optimized Plan.
{noformat}
scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
df.join(broadcast(df), "dateint").explain(true))

== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(dateint))
:- SubqueryAlias `jzhuge`.`parquet_with_part`
:  +- Relation[val#34,dateint#35] parquet
+- ResolvedHint (broadcast)
   +- SubqueryAlias `jzhuge`.`parquet_with_part`
  +- Relation[val#40,dateint#41] parquet

== Analyzed Logical Plan ==
dateint: int, val: string, val: string
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41)
   :- SubqueryAlias `jzhuge`.`parquet_with_part`
   :  +- Relation[val#34,dateint#35] parquet
   +- ResolvedHint (broadcast)
  +- SubqueryAlias `jzhuge`.`parquet_with_part`
 +- Relation[val#40,dateint#41] parquet

== Optimized Logical Plan ==
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast)
   :- Project [val#34, dateint#35]
   :  +- Filter isnotnull(dateint#35)
   : +- Relation[val#34,dateint#35] parquet
   +- Project [val#40, dateint#41]
  +- Filter isnotnull(dateint#41)
 +- Relation[val#40,dateint#41] parquet

== Physical Plan ==
*(2) Project [dateint#35, val#34, val#40]
+- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight
   :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
true] as bigint)))
  +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct
{noformat}
>From a quick look at the source, EliminateResolvedHint pulls broadcast hint 
>into Join and eliminates the ResolvedHint node. It is called before 
>PruneFileSourcePartitions so the above code in 
>PhysicalOperation.collectProjectsAndFilters is never called on master branch 
>for the few cases I tried.


was (Author: jzhuge):
No issue on the master branch. Please note "rightHint=(broadcast)" for the Join 
in Optimized Plan.
{noformat}
scala> Seq(spark.table("jzhuge.parquet_with_part")).map(df => 
df.join(broadcast(df), "dateint").explain(true))

== Parsed Logical Plan ==
'Join UsingJoin(Inner,List(dateint))
:- SubqueryAlias `jzhuge`.`parquet_with_part`
:  +- Relation[val#34,dateint#35] parquet
+- ResolvedHint (broadcast)
   +- SubqueryAlias `jzhuge`.`parquet_with_part`
  +- Relation[val#40,dateint#41] parquet

== Analyzed Logical Plan ==
dateint: int, val: string, val: string
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41)
   :- SubqueryAlias `jzhuge`.`parquet_with_part`
   :  +- Relation[val#34,dateint#35] parquet
   +- ResolvedHint (broadcast)
  +- SubqueryAlias `jzhuge`.`parquet_with_part`
 +- Relation[val#40,dateint#41] parquet

== Optimized Logical Plan ==
Project [dateint#35, val#34, val#40]
+- Join Inner, (dateint#35 = dateint#41), rightHint=(broadcast)
   :- Project [val#34, dateint#35]
   :  +- Filter isnotnull(dateint#35)
   : +- Relation[val#34,dateint#35] parquet
   +- Project [val#40, dateint#41]
  +- Filter isnotnull(dateint#41)
 +- Relation[val#40,dateint#41] parquet

== Physical Plan ==
*(2) Project [dateint#35, val#34, val#40]
+- *(2) BroadcastHashJoin [dateint#35], [dateint#41], Inner, BuildRight
   :- *(2) FileScan parquet jzhuge.parquet_with_part[val#34,dateint#35] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#35)], PushedFilters: [], ReadSchema: struct
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[1, int, 
true] as bigint)))
  +- *(1) FileScan parquet jzhuge.parquet_with_part[val#40,dateint#41] 
Batched: true, DataFilters: [], Format: Parquet, Location: 
PrunedInMemoryFileIndex[], PartitionCount: 0, PartitionFilters: 
[isnotnull(dateint#41)], PushedFilters: [], ReadSchema: struct
{noformat}
>From a quick look at the source, EliminateResolvedHint pulls broadcast hint 
>into Join and eliminates the ResolvedHint node. It is called before 
>PruneFileSourcePartitions so the above code in 
>PhysicalOperation.collectProjectsAndFilters is never called on master branch.

> Broadcast hint not applied to partitioned Parquet table
> ---

  1   2   >