[jira] [Resolved] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36794.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34034
[https://github.com/apache/spark/pull/34034]

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36794:
---

Assignee: Cheng Su

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32855) Improve DPP for some join type do not support broadcast filtering side

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-32855:
-
  Assignee: (was: Yuming Wang)

> Improve DPP for some join type do not support broadcast filtering side
> --
>
> Key: SPARK-32855
> URL: https://issues.apache.org/jira/browse/SPARK-32855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For some filtering side can not broadcast by join type but can broadcast by 
> size,
> then we should not consider reuse broadcast only, for example:
> Left outer join and left side very small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32855) Improve DPP for some join type do not support broadcast filtering side

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32855.
-
Resolution: Won't Do

> Improve DPP for some join type do not support broadcast filtering side
> --
>
> Key: SPARK-32855
> URL: https://issues.apache.org/jira/browse/SPARK-32855
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> For some filtering side can not broadcast by join type but can broadcast by 
> size,
> then we should not consider reuse broadcast only, for example:
> Left outer join and left side very small.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36849:


Assignee: (was: Apache Spark)

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36849:


Assignee: Apache Spark

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421236#comment-17421236
 ] 

Apache Spark commented on SPARK-36849:
--

User 'dohongdayi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34126

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)
Vincent Doba created SPARK-36874:


 Summary: Ambiguous Self-Join detected only on right dataframe
 Key: SPARK-36874
 URL: https://issues.apache.org/jira/browse/SPARK-36874
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2
Reporter: Vincent Doba


When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Doba updated SPARK-36874:
-
Description: 
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

h2. Related issue

https://issues.apache.org/jira/browse/SPARK-28344

  was:
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}


> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Vincent Doba
>Priority: Major
>  Lab

[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Doba updated SPARK-36874:
-
Description: 
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

h1. Related issue

https://issues.apache.org/jira/browse/SPARK-28344

  was:
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

h2. Related issue

https://issues.apache.org/jira/browse/SPARK-28344


> Ambiguous Self-Join detected only on right dataframe
> 
>
> Key: SPARK-36874
> URL: https://issues.apache.org/jira/browse/SPARK-36874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>

[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Doba updated SPARK-36874:
-
Description: 
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3715)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1079)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1041)
 ...
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

h1. Related issue

https://issues.apache.org/jira/browse/SPARK-28344

  was:
When join

[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Doba updated SPARK-36874:
-
Description: 
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(3,4, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3715)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1079)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1041)
 ...
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

  was:
When joining two dataframes, if they share the same lineage and one dataframe 

[jira] [Updated] (SPARK-36874) Ambiguous Self-Join detected only on right dataframe

2021-09-28 Thread Vincent Doba (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vincent Doba updated SPARK-36874:
-
Description: 
When joining two dataframes, if they share the same lineage and one dataframe 
is a transformation of the other, Ambiguous Self Join detection only works when 
transformed dataframe is the right dataframe. 

For instance {{df1}} and {{df2}} where {{df2}} is a filtered {{df1}}, Ambiguous 
Self Join detection only works when {{df2}} is the right dataframe:

- {{df1.join(df2, ...)}} correctly fails with Ambiguous Self Join error
- {{df2.join(df1, ...)}} returns a valid dataframe

h1. Minimum Reproducible example
h2. Code
{code:scala}
import sparkSession.implicit._

val df1 = Seq((1, 2, "A1"),(2, 1, "A2")).toDF("key1", "key2", "value")

val df2 = df1.filter($"value" === "A2")

df2.join(df1, df1("key1") === df2("key2")).show()
{code}
h2. Expected Result

Throw the following exception:

{code}
Exception in thread "main" org.apache.spark.sql.AnalysisException: Column 
key2#11 are ambiguous. It's probably because you joined several Datasets 
together, and some of these Datasets are the same. This column points to one of 
the Datasets but Spark is unable to figure out which one. Please alias the 
Datasets with different names via `Dataset.as` before joining them, and specify 
the column using qualified name, e.g. `df.as("a").join(df.as("b"), $"a.id" > 
$"b.id")`. You can also set spark.sql.analyzer.failAmbiguousSelfJoin to false 
to disable this check.
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:157)
at 
org.apache.spark.sql.execution.analysis.DetectAmbiguousSelfJoin$.apply(DetectAmbiguousSelfJoin.scala:43)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:216)
at 
scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
at 
scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122)
at scala.collection.immutable.List.foldLeft(List.scala:91)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:213)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:205)
at scala.collection.immutable.List.foreach(List.scala:431)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:205)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:196)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:190)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:155)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:183)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:174)
at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:228)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:173)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73)
at 
org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at 
org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:143)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at 
org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:143)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73)
at 
org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71)
at 
org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63)
at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88)
at org.apache.spark.sql.Dataset.withPlan(Dataset.scala:3715)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1079)
at org.apache.spark.sql.Dataset.join(Dataset.scala:1041)
 ...
{code}
h2. Actual result

Empty dataframe:
{code:java}
+++-+++-+
|key1|key2|value|key1|key2|value|
+++-+++-+
+++-+++-+
{code}

  was:
When joining two dataframes, if they share the same lineage and one dataframe 

[jira] [Updated] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-36866:
-
Description: 
Spark doesn't push down filters with ANSI intervals to the Parquet datasource:

{code:scala}
scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
df: org.apache.spark.sql.DataFrame = [i: interval year to month]

scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")

scala> val readback = spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
readback: org.apache.spark.sql.DataFrame = [i: interval year to month]

scala> readback.explain(true)
...
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
Location: InMemoryFileIndex(1 paths)[file:/Users/maximgekk/tmp/parquet_filter], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}


> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark doesn't push down filters with ANSI intervals to the Parquet datasource:
> {code:scala}
> scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
> df: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")
> scala> val readback = 
> spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
> readback: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> readback.explain(true)
> ...
> == Physical Plan ==
> *(1) ColumnarToRow
> +- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/Users/maximgekk/tmp/parquet_filter], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36856) Building by "./build/mvn" may be stuck on MacOS

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36856.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34111
[https://github.com/apache/spark/pull/34111]

> Building by "./build/mvn" may be stuck on MacOS
> ---
>
> Key: SPARK-36856
> URL: https://issues.apache.org/jira/browse/SPARK-36856
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.3.0
> Environment: MacOS 11.4
>Reporter: copperybean
>Priority: Major
> Fix For: 3.2.0
>
>
> Command "./build/mvn" will be stuck on my MacOS 11.4. Because it is using 
> error java home. On my mac, "/usr/bin/java" is a real file instead of a 
> symbolic link, so the java home is set to path "/usr", and lead the launched 
> maven process stuck with this error java home.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36856) Building by "./build/mvn" may be stuck on MacOS

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-36856:
---
Affects Version/s: (was: 3.0.0)
   3.2.0

> Building by "./build/mvn" may be stuck on MacOS
> ---
>
> Key: SPARK-36856
> URL: https://issues.apache.org/jira/browse/SPARK-36856
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
> Environment: MacOS 11.4
>Reporter: copperybean
>Assignee: copperybean
>Priority: Major
> Fix For: 3.2.0
>
>
> Command "./build/mvn" will be stuck on my MacOS 11.4. Because it is using 
> error java home. On my mac, "/usr/bin/java" is a real file instead of a 
> symbolic link, so the java home is set to path "/usr", and lead the launched 
> maven process stuck with this error java home.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36856) Building by "./build/mvn" may be stuck on MacOS

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-36856:
---
Priority: Minor  (was: Major)

> Building by "./build/mvn" may be stuck on MacOS
> ---
>
> Key: SPARK-36856
> URL: https://issues.apache.org/jira/browse/SPARK-36856
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
> Environment: MacOS 11.4
>Reporter: copperybean
>Assignee: copperybean
>Priority: Minor
> Fix For: 3.2.0
>
>
> Command "./build/mvn" will be stuck on my MacOS 11.4. Because it is using 
> error java home. On my mac, "/usr/bin/java" is a real file instead of a 
> symbolic link, so the java home is set to path "/usr", and lead the launched 
> maven process stuck with this error java home.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36856) Building by "./build/mvn" may be stuck on MacOS

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36856:
--

Assignee: copperybean

> Building by "./build/mvn" may be stuck on MacOS
> ---
>
> Key: SPARK-36856
> URL: https://issues.apache.org/jira/browse/SPARK-36856
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.0.0, 3.3.0
> Environment: MacOS 11.4
>Reporter: copperybean
>Assignee: copperybean
>Priority: Major
> Fix For: 3.2.0
>
>
> Command "./build/mvn" will be stuck on my MacOS 11.4. Because it is using 
> error java home. On my mac, "/usr/bin/java" is a real file instead of a 
> symbolic link, so the java home is set to path "/usr", and lead the launched 
> maven process stuck with this error java home.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36866:


Assignee: Max Gekk  (was: Apache Spark)

> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark doesn't push down filters with ANSI intervals to the Parquet datasource:
> {code:scala}
> scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
> df: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")
> scala> val readback = 
> spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
> readback: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> readback.explain(true)
> ...
> == Physical Plan ==
> *(1) ColumnarToRow
> +- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/Users/maximgekk/tmp/parquet_filter], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421277#comment-17421277
 ] 

Apache Spark commented on SPARK-36866:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/34115

> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark doesn't push down filters with ANSI intervals to the Parquet datasource:
> {code:scala}
> scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
> df: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")
> scala> val readback = 
> spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
> readback: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> readback.explain(true)
> ...
> == Physical Plan ==
> *(1) ColumnarToRow
> +- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/Users/maximgekk/tmp/parquet_filter], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36866:


Assignee: Apache Spark  (was: Max Gekk)

> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark doesn't push down filters with ANSI intervals to the Parquet datasource:
> {code:scala}
> scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
> df: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")
> scala> val readback = 
> spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
> readback: org.apache.spark.sql.DataFrame = [i: interval year to month]
> scala> readback.explain(true)
> ...
> == Physical Plan ==
> *(1) ColumnarToRow
> +- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
> Location: InMemoryFileIndex(1 
> paths)[file:/Users/maximgekk/tmp/parquet_filter], PartitionFilters: [], 
> PushedFilters: [], ReadSchema: struct
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36809) Remove broadcast for InSubqueryExec used in DPP

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36809.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34051
[https://github.com/apache/spark/pull/34051]

> Remove broadcast for InSubqueryExec used in DPP
> ---
>
> Key: SPARK-36809
> URL: https://issues.apache.org/jira/browse/SPARK-36809
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently we include a broadcast variable in InSubqueryExec. We use it to 
> hold filtering side query result of DPP. It looks weird because we don't use 
> the result in executors but only need the result in the driver during query 
> planning. We already hold the original result, so basically we hold two 
> copied of query result at this moment.
> Another thing related is, in pruningHasBenefit we estimate if DPP pruning has 
> benefit when the join type does not support broadcast. Due to the broadcast 
> variable above, we also check the filtering side against the config 
> autoBroadcastJoinThreshold. The config is not for the purpose and it is not a 
> broadcast join. As the broadcast variable is unnecessary, we can remove this 
> check and leave benefit estimation to overhead and pruning size.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36781) The log could not get the correct line number

2021-09-28 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421281#comment-17421281
 ] 

Senthil Kumar commented on SPARK-36781:
---

[~chenxusheng] Could you please share the sample code to simulate this issue?

> The log could not get the correct line number
> -
>
> Key: SPARK-36781
> URL: https://issues.apache.org/jira/browse/SPARK-36781
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.3, 3.1.2
>Reporter: chenxusheng
>Priority: Major
>
> INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  MemoryStore cleared
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManager stopped
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  BlockManagerMaster stopped
>  INFO 18:16:46 [dispatcher-event-loop-0] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  OutputCommitCoordinator stopped!
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Successfully stopped SparkContext
>  INFO 18:16:46 [Thread-1] 
> org.apache.spark.internal.Logging$class.logInfo({color:#FF}Logging.scala:54{color})
>  Shutdown hook called
> all are : {color:#FF}Logging.scala:54{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk updated SPARK-36866:
-
Description: 
Spark doesn't push down filters with ANSI intervals to the Parquet datasource. 
The ticket aims to support them.


  was:
Spark doesn't push down filters with ANSI intervals to the Parquet datasource:

{code:scala}
scala> val df = Seq(Period.ofMonths(-1), Period.ofMonths(1)).toDF("i")
df: org.apache.spark.sql.DataFrame = [i: interval year to month]

scala> df.write.parquet("/Users/maximgekk/tmp/parquet_filter")

scala> val readback = spark.read.parquet("/Users/maximgekk/tmp/parquet_filter")
readback: org.apache.spark.sql.DataFrame = [i: interval year to month]

scala> readback.explain(true)
...
== Physical Plan ==
*(1) ColumnarToRow
+- FileScan parquet [i#11] Batched: true, DataFilters: [], Format: Parquet, 
Location: InMemoryFileIndex(1 paths)[file:/Users/maximgekk/tmp/parquet_filter], 
PartitionFilters: [], PushedFilters: [], ReadSchema: struct
{code}



> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Spark doesn't push down filters with ANSI intervals to the Parquet 
> datasource. The ticket aims to support them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36875) 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resourc

2021-09-28 Thread Davide Benedetto (Jira)
Davide Benedetto created SPARK-36875:


 Summary: 21/09/28 11:18:51 WARN YarnScheduler: Initial job has not 
accepted any resources; check your cluster UI to ensure that workers are 
registered and have sufficient resources
 Key: SPARK-36875
 URL: https://issues.apache.org/jira/browse/SPARK-36875
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Submit
Affects Versions: 3.1.2
 Environment: Eclipse
Hadoop 3.3

Spark3.1.2-hadoop3.2

Dependencies

 
{noformat}

org.apache.spark
  
spark-core_2.12 
3.1.2
  

 org.apache.spark 
spark-sql_2.12 
3.1.2  

 janino 
org.codehaus.janino 
   
 org.codehaus.janino 
janino 
3.0.8
 
  org.apache.spark 
spark-yarn_2.12
 3.1.2 provided
 
  
org.scala-lang
 scala-library 
2.12.13
 

Enviroment Variables set in eclipse: SPARK_HOME path/to/my/sparkfolder

OS Linux with UBUNTU 20
The test is launched on my first user davben. 
Spark folder and hadoop are on my second user hadoop{noformat}
 

 
Reporter: Davide Benedetto
 Fix For: 3.1.2


Hi, I am running a spark job with yarn programmatically using Eclipse IDE. 
Here I

1: open the spark session passing a SparkConf as input parameter,

 
{quote} 
{code:java}
System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop"); 
System.setProperty("hadoop.home.dir", "/home/hadoop/hadoop");        
System.setProperty("SPARK_YARN_MODE", "yarn");        
System.setProperty("HADOOP_USER_NAME", "hadoop");
SparkConf sparkConf = new 
SparkConf().setAppName("simpleTest2").setMaster("yarn") 
.set("spark.executor.memory", "1g")
.set("deploy.mode", "cluster")
.set("spark.yarn.stagingDir", "hdfs://localhost:9000/user/hadoop/")
.set("spark.yarn.am.memory", "512m") 
.set("spark.dynamicAllocation.minExecutors","1") 
.set("spark.dynamicAllocation.maxExecutors","40") 
.set("spark.dynamicAllocation.initialExecutors","2")         
.set("spark.shuffle.service.enabled", "true")         
.set("spark.dynamicAllocation.enabled", "false")
.set("spark.cores.max", "1")
 .set("spark.yarn.executor.memoryOverhead", "500m")         
.set("spark.executor.instances","2")
.set("spark.executor.memory","500m")
.set("spark.num.executors","2")
.set("spark.executor.cores","1")
.set("spark.worker.instances","1")
.set("spark.worker.memory","512m")
.set("spark.worker.max.heapsize","512m")
.set("spark.worker.cores","1")
.set("maximizeResourceAllocation", "true") 
.set("spark.yarn.nodemanager.resource.cpu-vcores","4") 
.set("spark.yarn.submit.file.replication", "1")
SparkSession spark = SparkSession.builder().config(sparkConf).getOrCreate(); 
{code}
 
{quote}
2: Create a dataset of two Rows and i Show them
{code:java}
List rows = new ArrayList<>(); List rows = new ArrayList<>(); 
rows.add(RowFactory.create("a", "b")); rows.add(RowFactory.create("b", "c")); 
rows.add(RowFactory.create("a", "a"));
 StructType structType = new StructType(); structType = 
structType.add("edge_1", DataTypes.StringType, false); structType = 
structType.add("edge_2", DataTypes.StringType, false); ExpressionEncoder 
edgeEncoder = RowEncoder.apply(structType);
 Dataset edge = spark.createDataset(rows, edgeEncoder); edge.show();
 
{code}
{{From now it is all Ok, the job is submitted on hadoop and the rows are showed 
correctly}}

{{3: I perform a Map that upper cases the elements in the row}}

 
{quote}
{code:java}
Dataset edge2 = edge.map(new MyFunction2(), edgeEncoder); Dataset 
edge2 = edge.map(new MyFunction2(), edgeEncoder);{code}
{quote}
 
{quote}{{ public static class MyFunction2 implements MapFunction { 
public static class MyFunction2 implements MapFunction {}}
{{ /** *  */ private static final long serialVersionUID = 1L;}}
{{ @Override public Row call(Row v1) throws Exception \{ String el1 = 
v1.get(0).toString().toUpperCase(); String el2 = 
v1.get(1).toString().toUpperCase(); return RowFactory.create(el1,el2); }}}
{{ }}}
{quote}
{{4: Then I show the dataset after map is performed}}
{quote}
{code:java}
edge2.show();{code}
{quote}
{{And precisely here the log Loops saying }}

{{21/09/28 11:18:51 WARN YarnScheduler: Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources}}

 
{quote}{{Here is the Full log}}

{{log4j:WARN No appenders could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN No appenders 
could be found for logger 
(org.apache.hadoop.metrics2.lib.MutableMetricsFactory).log4j:WARN Please 
initialize the log4j system properly.log4j:WARN See 
http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.Using 
Spark's default log4j profile: 
org/apache/spark/log4j-defaults.properties21/09/28 03:05:16 WARN Utils: Your 
hostname, davben-lubuntu resolves to a loopback address: 127.0.1.1; using 
192.168.1.36 instead (on interface wlo1)21/09/28 03:05:16 WARN Utils: Set 
SPARK_LOCAL_IP if you need to bind to ano

[jira] [Resolved] (SPARK-36863) Update dependency manifests for all released artifacts

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36863.
--
Fix Version/s: 3.3.0
 Assignee: Chao Sun
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34119

> Update dependency manifests for all released artifacts
> --
>
> Key: SPARK-36863
> URL: https://issues.apache.org/jira/browse/SPARK-36863
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Minor
> Fix For: 3.3.0
>
>
> We should update dependency manifests for all released artifacts. Currently 
> we don't do for modules such as {{hadoop-cloud}}, {{kinesis-asl}} etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36873.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34125
[https://github.com/apache/spark/pull/34125]

> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>  package com.google.common.annotations does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>  package com.google.common.base does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>  package com.google.common.collect does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
>  cannot find symbol
>   symbol:   class VisibleForTesting
>   location: class org.apache.spark.network.yarn.YarnShuffleService
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36836) "sha2" expression with bit_length of 224 returns incorrect results

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36836.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 34086
[https://github.com/apache/spark/pull/34086]

> "sha2" expression with bit_length of 224 returns incorrect results
> --
>
> Key: SPARK-36836
> URL: https://issues.apache.org/jira/browse/SPARK-36836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: Richard Chen
>Priority: Major
> Fix For: 3.2.0
>
>
> {{sha2(input, bit_length)}} returns incorrect results when {{bit_length == 
> 224}}.
>  
> This bug seems to have been present since the {{sha2}} expression was 
> introduced in 1.5.0.
>  
> Repro in spark shell:
> {{spark.sql("SELECT sha2('abc', 224)").show()}}
>  
> Spark currently returns a garbled string, consisting of invalid UTF:
>  {{#\t}"4�"�B�w��U�*��你���l��}}
> The expected return value is: 
> {{23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7}}
>  
> This appears to happen because the  {{MessageDigest.digest()}} function 
> appears to return bytes intended to be interpreted as a {{BigInt}} rather 
> than a string. Thus, the output of {{MessageDigest.digest()}} must first be 
> interpreted as a {{BigInt}} and then transformed into a hex string. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36836) "sha2" expression with bit_length of 224 returns incorrect results

2021-09-28 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36836:
--

Assignee: Richard Chen

> "sha2" expression with bit_length of 224 returns incorrect results
> --
>
> Key: SPARK-36836
> URL: https://issues.apache.org/jira/browse/SPARK-36836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0
>Reporter: Richard Chen
>Assignee: Richard Chen
>Priority: Major
> Fix For: 3.2.0
>
>
> {{sha2(input, bit_length)}} returns incorrect results when {{bit_length == 
> 224}}.
>  
> This bug seems to have been present since the {{sha2}} expression was 
> introduced in 1.5.0.
>  
> Repro in spark shell:
> {{spark.sql("SELECT sha2('abc', 224)").show()}}
>  
> Spark currently returns a garbled string, consisting of invalid UTF:
>  {{#\t}"4�"�B�w��U�*��你���l��}}
> The expected return value is: 
> {{23097d223405d8228642a477bda255b32aadbce4bda0b3f7e36c9da7}}
>  
> This appears to happen because the  {{MessageDigest.digest()}} function 
> appears to return bytes intended to be interpreted as a {{BigInt}} rather 
> than a string. Thus, the output of {{MessageDigest.digest()}} must first be 
> interpreted as a {{BigInt}} and then transformed into a hex string. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread angerszhu (Jira)
angerszhu created SPARK-36876:
-

 Summary: Support Dynamic Partition pruning for HiveTableScanExec
 Key: SPARK-36876
 URL: https://issues.apache.org/jira/browse/SPARK-36876
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.2, 3.0.3, 3.2.0
Reporter: angerszhu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-09-28 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-36876:
--
Description: 
Support dynamic partition pruning for hive serde scan


> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-28 Thread Senthil Kumar (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421328#comment-17421328
 ] 

Senthil Kumar commented on SPARK-36861:
---

[~tanelk] This is issue not reproducable even in 3.1.2

 

root
 |-- i: integer (nullable = true)
 |-- hour: string (nullable = true)

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-09-28 Thread Tanel Kiis (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421334#comment-17421334
 ] 

Tanel Kiis commented on SPARK-36861:


Yes, in 3.1 it is parsed as string. In 3.3 (master) it is parsed as date.

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36866) Pushdown filters with ANSI interval values to parquet

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36866.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34115
[https://github.com/apache/spark/pull/34115]

> Pushdown filters with ANSI interval values to parquet
> -
>
> Key: SPARK-36866
> URL: https://issues.apache.org/jira/browse/SPARK-36866
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> Spark doesn't push down filters with ANSI intervals to the Parquet 
> datasource. The ticket aims to support them.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421358#comment-17421358
 ] 

Apache Spark commented on SPARK-36849:
--

User 'dohongdayi' has created a pull request for this issue:
https://github.com/apache/spark/pull/34127

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36848) Migrate ShowCurrentNamespaceStatement to v2 command framework

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421369#comment-17421369
 ] 

Apache Spark commented on SPARK-36848:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34128

> Migrate ShowCurrentNamespaceStatement to v2 command framework
> -
>
> Key: SPARK-36848
> URL: https://issues.apache.org/jira/browse/SPARK-36848
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421374#comment-17421374
 ] 

Magdalena Pilawska commented on SPARK-36862:


I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on 3.0.0? 

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* and *3.1.2* versions.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:1

[jira] [Comment Edited] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421374#comment-17421374
 ] 

Magdalena Pilawska edited comment on SPARK-36862 at 9/28/21, 1:11 PM:
--

I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on spark 3.0.0? 


was (Author: mpilaw):
I get the physical execution plan as a part of output log but I cannot share 
that in public if you mean so.

Any thoughts why the same works on 3.0.0? 

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1, 3.1.2
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* and *3.1.2* versions.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution

[jira] [Updated] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Magdalena Pilawska (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Magdalena Pilawska updated SPARK-36862:
---
Affects Version/s: (was: 3.1.2)
  Description: 
Hi,

I am getting the following error running spark-submit command:

ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
321, Column 103: ')' expected instead of '['

 

It fails running the spark sql command on delta lake: 
spark.sql(sqlTransformation)

The template of sqlTransformation is as follows:

MERGE INTO target_table AS d
 USING source_table AS s 
 on s.id = d.id
 WHEN MATCHED AND d.hash_value <> s.hash_value
 THEN UPDATE SET d.name =s.name, d.address = s.address

 

It is permanent error both for *spark 3.1.1* version.

 

The same works fine with spark 3.0.0.

 

Here is the full log:

2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, 
Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
CodeGenerator: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 55, 
Column 103: ')' expected instead of 
'['org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
55, Column 103: ')' expected instead of '[' at 
org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362) 
at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
org.codehaus.janino.Parser.read(Parser.java:3703) at 
org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
 at 
org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
 at 
org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379) 
at 
org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
 at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
 at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:160)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD(ShuffleExchangeExec.scala:160)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture$lzycompute(ShuffleExchangeExec.scala:164)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.mapOutputStatisticsFuture(ShuffleExchangeExec.scala:163)
 at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.$anonfun$materializeFuture$2(ShuffleExchangeExec.scala:100)
 at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) 
at 
org.apache.spark.sql.execution.exchange.ShuffleExchangeLike.$anonfun$materializeFuture$1(ShuffleExchangeExec.scala:100)
 at org.apache.spark.sql.util.LazyValue.getOrInit(LazyValue.scala:41) at 
org.apache.spark.sql.execution.exchange.Exchange.getOrInitMaterializeFuture(Exchange.scala:68)
 at 
org.apache.spark.sql.execution.ex

[jira] [Assigned] (SPARK-36873) Add provided Guava dependency for network-yarn module

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-36873:
-

Assignee: Chao Sun  (was: Apache Spark)

> Add provided Guava dependency for network-yarn module
> -
>
> Key: SPARK-36873
> URL: https://issues.apache.org/jira/browse/SPARK-36873
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.2.0
>
>
> In Spark 3.1 and earlier the network-yarn module implicitly relies on guava 
> from hadoop-client dependency, which was changed by SPARK-33212 where we 
> moved to shaded Hadoop client which no longer expose the transitive guava 
> dependency. This was fine for a while since we were not using 
> {{createDependencyReducedPom}} so the module picks up the transitive 
> dependency from {{spark-network-common}}. However, this got changed by 
> SPARK-36835 when we restored {{createDependencyReducedPom}} and now it is no 
> longer able to find guava classes:
> {code}
> mvn test -pl common/network-yarn -Phadoop-3.2 -Phive-thriftserver 
> -Pkinesis-asl -Pkubernetes -Pmesos -Pnetlib-lgpl -Pscala-2.12 
> -Pspark-ganglia-lgpl -Pyarn
> ...
> [INFO] Compiling 1 Java source to 
> /Users/sunchao/git/spark/common/network-yarn/target/scala-2.12/classes ...
> [WARNING] [Warn] : bootstrap class path not set in conjunction with -source 8
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:32:
>  package com.google.common.annotations does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:33:
>  package com.google.common.base does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:34:
>  package com.google.common.collect does not exist
> [ERROR] [Error] 
> /Users/sunchao/git/spark/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:118:
>  cannot find symbol
>   symbol:   class VisibleForTesting
>   location: class org.apache.spark.network.yarn.YarnShuffleService
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-36877:

Attachment: Screen Shot 2021-09-28 at 09.32.20.png

> Calling ds.rdd with AQE enabled leads to being jobs being run, eventually 
> causing reruns
> 
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)
Shardul Mahadik created SPARK-36877:
---

 Summary: Calling ds.rdd with AQE enabled leads to being jobs being 
run, eventually causing reruns
 Key: SPARK-36877
 URL: https://issues.apache.org/jira/browse/SPARK-36877
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.2.1
Reporter: Shardul Mahadik
 Attachments: Screen Shot 2021-09-28 at 09.32.20.png

In one of our jobs we perform the following operation:
{code:scala}
val df = /* some expensive multi-table/multi-stage join */
val numPartitions = df.rdd.getNumPartitions
df.repartition(x).write.
{code}

With AQE enabled, we found that the expensive stages were being run twice 
causing significant performance regression after enabling AQE; once when 
calling {{df.rdd}} and again when calling {{df.write}}.

A more concrete example:
{code:scala}
scala> sql("SET spark.sql.adaptive.enabled=true")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
res1: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> val df1 = spark.range(10).withColumn("id2", $"id")
df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]

scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
"id").join(spark.range(10), "id")
df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]

scala> val df3 = df2.groupBy("id2").count()
df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]

scala> df3.rdd.getNumPartitions
res2: Int = 10(0 + 16) / 16]

scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
{code}

In the screenshot below, you can see that the first 3 stages (0 to 4) were 
rerun again (5 to 9).

I have two questions:
1) Should calling df.rdd trigger actual job execution when AQE is enabled?
2) Should calling df.write later cause rerun of the stages? If df.rdd has 
already partially executed the stages, shouldn't it reuse the result from 
previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to being jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421510#comment-17421510
 ] 

Shardul Mahadik commented on SPARK-36877:
-

cc: [~cloud_fan] [~mridulm80]

> Calling ds.rdd with AQE enabled leads to being jobs being run, eventually 
> causing reruns
> 
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36878) Optimization in PushDownPredicates to push all filters in a single iteration has broken some optimizations in PruneFilter rule

2021-09-28 Thread Asif (Jira)
Asif created SPARK-36878:


 Summary: Optimization in PushDownPredicates to push all filters in 
a single iteration has broken  some optimizations in PruneFilter rule
 Key: SPARK-36878
 URL: https://issues.apache.org/jira/browse/SPARK-36878
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Asif


It appears that the optimization in PushDownPredicates rule to try to push all 
filters in a single pass to reduce iteration has broken the PruneFilter rule to 
substitute with EmptyRelation when the filter condition is a composite and 
statically evaluates to false either because one of the non redundant predicate 
is Literal(false) or all the non redundant predicates are null.

The new PushDownPredicate rule is created by chaining CombineFilters, 
PushPredicateThroughNonJoin and PushPredicateThroughJoin.

so individual filters will get combined as a single filter while being pushed.

But the PruneFilters rule does not substitute it with empty relation if the 
filter is composite. It is coded to handle single predicates.

The test is falsely passing as it is testing PushPredicateThroughNonJoin, which 
does not combine filters. 

While  the actual rule in action has an effect produced by CombineFilters. 

In fact I believe all the places in other tests which are testing individually 
for PushDownPredicateThroughNonJoin or PushDownPredicateThroughJoin should be 
corrected ( may be with rule PushPredicates) & re tested.

I will add a bug test & open PR.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-36786:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

The aim is to improve the compile time performance of query which in WorkDay's 
use case takes > 24 hrs ( & eventually fails) , to  < 8 min.

To explain the problem, I will provide the context.

The query plan in our production system, is huge, with nested *case when* 
expressions ( level of nesting could be >  8) , where each *case when* can have 
branches sometimes > 1000.

The plan could look like
{quote}Project1
    |
   Filter 1
    |

Project2
    |
 Filter2
    |
 Project3
    |
 Filter3

  |

Join
{quote}
Now the optimizer has a Batch of Rules , intended to run at max 100 times.

*Also note that the, the batch will continue to run till one of the condition 
is satisfied*

*i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
achieved)*

One of the early  Rule is   *PushDownPredicateRule.*

**Followed by **CollapseProject**.

 

The first issue is *PushDownPredicate* rule.

It picks  one filter at a time & pushes it at lowest level ( I understand that 
in 3.1 it pushes through join, while in 2.4 it stops at Join) , but either case 
it picks 1 filter at time starting from top, in each iteration.

*The above comment is no longer true in 3.1 release as it now combines filters. 
so it does push now all the encountered filters in a single pass. But it still 
materializes the filter on each push by realiasing.*

*Also it seems that with the change, one of the PruneFilter rule's 
functionality of replacement with Empty Relation appears to be broken by the 
addition of CombineFilter rule in PushDownPredicates rule. More on it later.*

*I have filed a new bug ticket ([PruneFilter optimization 
broken|https://issues.apache.org/jira/browse/SPARK-36789])*

So if there are say  50 projects interspersed with Filters , the idempotency is 
guaranteedly not going to get achieved till around 49 iterations. Moreover, 
CollapseProject will also be modifying tree on each iteration as a filter will 
get removed within Project.

Moreover, on each movement of filter through project tree, the filter is 
re-aliased using transformUp rule.  transformUp is very expensive compared to 
transformDown. As the filter keeps getting pushed down , its size increases.

To optimize this rule , 2 things are needed
 # Instead of pushing one filter at a time,  collect all the filters as we 
traverse the tree in that iteration itself.
 # Do not re-alias the filters on each push. Collect the sequence of projects 
it has passed through, and  when the filters have reached their resting place, 
do the re-alias by processing the projects collected in down to up manner.

This will result in achieving idempotency in a couple of iterations. 

*How reducing the number of iterations help in performance*

There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals ( 
... there are around 6 more such rules)*  which traverse the tree using 
transformUp, and they run unnecessarily in each iteration , even when the 
expressions in an operator have not changed since the previous runs.

*I have a different proposal which I will share later, as to how to avoid the 
above rules from running unnecessarily, if it can be guaranteed that the 
expression is not going to mutate in the operator.* 

The cause of our huge compilation time has been identified as the above.
  
h2. Q2. What problem is this proposal NOT designed to solve?

It is not going to change any runtime profile.
h2. Q3. How is it done today, and what are the limits of current practice?

Like mentioned above , currently PushDownPredicate pushes one filter at a time  
& at each Project , it materialized the re-aliased filter.  This results in 
large number of iterations to achieve idempotency as well as immediate 
materialization of Filter after each Project pass,, results in unnecessary tree 
traversals of filter expression that too using transformUp. and the expression 
tree of filter is bound to keep increasing as it is pushed down.
h2. Q4. What is new in your approach and why do you think it will be successful?

In the new approach we push all the filters down in a single pass. And do not 
materialize filters as it pass through Project. Instead keep collecting 
projects in sequential order and materialize the final filter once its final 
position is achieved ( above a join , in case of 2.1 , or above the base 
relation etc).

This approach when coupled with the logic of identifying those Project operator 
whose expressions will not mutate ( which I will share later) , so that rules 
like 

NullPropagation,
 OptimizeIn.,
 LikeSimplification.,
 BooleanSimplification.,
 SimplifyConditionals.,
 RemoveDispensableExpressions.,
 SimplifyBinaryComparison.,
 SimplifyCaseConversionExpr

[jira] [Updated] (SPARK-36786) SPIP: Improving the compile time performance, by improving a couple of rules, from 24 hrs to under 8 minutes

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif updated SPARK-36786:
-
Description: 
h2. Q1. What are you trying to do? Articulate your objectives using absolutely 
no jargon.

The aim is to improve the compile time performance of query which in WorkDay's 
use case takes > 24 hrs ( & eventually fails) , to  < 8 min.

To explain the problem, I will provide the context.

The query plan in our production system, is huge, with nested *case when* 
expressions ( level of nesting could be >  8) , where each *case when* can have 
branches sometimes > 1000.

The plan could look like
{quote}Project1
    |
   Filter 1
    |

Project2
    |
 Filter2
    |
 Project3
    |
 Filter3

  |

Join
{quote}
Now the optimizer has a Batch of Rules , intended to run at max 100 times.

*Also note that the, the batch will continue to run till one of the condition 
is satisfied*

*i.e  either numIter == 100 || inputPlan == outputPlan (idempotency is 
achieved)*

One of the early  Rule is   *PushDownPredicateRule.*

**Followed by **CollapseProject**.

 

The first issue is *PushDownPredicate* rule.

It picks  one filter at a time & pushes it at lowest level ( I understand that 
in 3.1 it pushes through join, while in 2.4 it stops at Join) , but either case 
it picks 1 filter at time starting from top, in each iteration.

*The above comment is no longer true in 3.1 release as it now combines filters. 
so it does push now all the encountered filters in a single pass. But it still 
materializes the filter on each push by realiasing.*

*Also it seems that with the change, one of the PruneFilter rule's 
functionality of replacement with Empty Relation appears to be broken by the 
addition of CombineFilter rule in PushDownPredicates rule. More on it later.*

*I have filed a new bug ticket ([PruneFilter optimization 
broken|https://issues.apache.org/jira/browse/SPARK-36878])*

So if there are say  50 projects interspersed with Filters , the idempotency is 
guaranteedly not going to get achieved till around 49 iterations. Moreover, 
CollapseProject will also be modifying tree on each iteration as a filter will 
get removed within Project.

Moreover, on each movement of filter through project tree, the filter is 
re-aliased using transformUp rule.  transformUp is very expensive compared to 
transformDown. As the filter keeps getting pushed down , its size increases.

To optimize this rule , 2 things are needed
 # Instead of pushing one filter at a time,  collect all the filters as we 
traverse the tree in that iteration itself.
 # Do not re-alias the filters on each push. Collect the sequence of projects 
it has passed through, and  when the filters have reached their resting place, 
do the re-alias by processing the projects collected in down to up manner.

This will result in achieving idempotency in a couple of iterations. 

*How reducing the number of iterations help in performance*

There are many rules like *NullPropagation, OptimizeIn, SimplifyConditionals ( 
... there are around 6 more such rules)*  which traverse the tree using 
transformUp, and they run unnecessarily in each iteration , even when the 
expressions in an operator have not changed since the previous runs.

*I have a different proposal which I will share later, as to how to avoid the 
above rules from running unnecessarily, if it can be guaranteed that the 
expression is not going to mutate in the operator.* 

The cause of our huge compilation time has been identified as the above.
  
h2. Q2. What problem is this proposal NOT designed to solve?

It is not going to change any runtime profile.
h2. Q3. How is it done today, and what are the limits of current practice?

Like mentioned above , currently PushDownPredicate pushes one filter at a time  
& at each Project , it materialized the re-aliased filter.  This results in 
large number of iterations to achieve idempotency as well as immediate 
materialization of Filter after each Project pass,, results in unnecessary tree 
traversals of filter expression that too using transformUp. and the expression 
tree of filter is bound to keep increasing as it is pushed down.
h2. Q4. What is new in your approach and why do you think it will be successful?

In the new approach we push all the filters down in a single pass. And do not 
materialize filters as it pass through Project. Instead keep collecting 
projects in sequential order and materialize the final filter once its final 
position is achieved ( above a join , in case of 2.1 , or above the base 
relation etc).

This approach when coupled with the logic of identifying those Project operator 
whose expressions will not mutate ( which I will share later) , so that rules 
like 

NullPropagation,
 OptimizeIn.,
 LikeSimplification.,
 BooleanSimplification.,
 SimplifyConditionals.,
 RemoveDispensableExpressions.,
 SimplifyBinaryComparison.,
 SimplifyCaseConversionExpr

[jira] [Created] (SPARK-36879) Support Parquet v2 data page encodings for the vectorized path

2021-09-28 Thread Chao Sun (Jira)
Chao Sun created SPARK-36879:


 Summary: Support Parquet v2 data page encodings for the vectorized 
path
 Key: SPARK-36879
 URL: https://issues.apache.org/jira/browse/SPARK-36879
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.3.0
Reporter: Chao Sun


Currently Spark only support Parquet V1 encodings (i.e., PLAIN/DICTIONARY/RLE) 
in the vectorized path, and throws exception otherwise:
{code}
java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
{code}

It will be good to support v2 encodings too, including DELTA_BINARY_PACKED, 
DELTA_LENGTH_BYTE_ARRAY, DELTA_BYTE_ARRAY as well as BYTE_STREAM_SPLIT as 
listed in https://github.com/apache/parquet-format/blob/master/Encodings.md



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-09-28 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-36877:

Summary: Calling ds.rdd with AQE enabled leads to jobs being run, 
eventually causing reruns  (was: Calling ds.rdd with AQE enabled leads to being 
jobs being run, eventually causing reruns)

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36880:


 Summary: Inline type hints for python/pyspark/sql/functions.py
 Key: SPARK-36880
 URL: https://issues.apache.org/jira/browse/SPARK-36880
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints from python/pyspark/sql/functions.pyi to 
python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36880:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36880:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421603#comment-17421603
 ] 

Apache Spark commented on SPARK-36880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34130

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36880) Inline type hints for python/pyspark/sql/functions.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421602#comment-17421602
 ] 

Apache Spark commented on SPARK-36880:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34130

> Inline type hints for python/pyspark/sql/functions.py
> -
>
> Key: SPARK-36880
> URL: https://issues.apache.org/jira/browse/SPARK-36880
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints from python/pyspark/sql/functions.pyi to 
> python/pyspark/sql/functions.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34276.
---
  Assignee: Chao Sun
Resolution: Done

This is superseded by SPARK-36726 by [~csun].
I'm resolving this blocker issue because I don't see any open Parquet issues to 
block us.

cc [~Gengliang.Wang]

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Chao Sun
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11/1.12 and then decide whether we should 
> upgrade/revert Parquet. At the same time, we should encourage the whole 
> community to do the compatibility and performance tests for their production 
> workloads, including both read and write code paths.
> More details: 
> [https://github.com/apache/spark/pull/26804#issuecomment-768790620]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36869) Spark job fails due to java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36869.
---
Resolution: Duplicate

This is superseded by SPARK-36759 .

> Spark job fails due to java.io.InvalidClassException: 
> scala.collection.mutable.WrappedArray$ofRef; local class incompatible
> ---
>
> Key: SPARK-36869
> URL: https://issues.apache.org/jira/browse/SPARK-36869
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.2
> Environment: * RHEL 8.4
>  * Java 11.0.12
>  * Spark 3.1.2 (only prebuilt with *2.12.10)*
>  * Scala *2.12.14* for the application code
>Reporter: Hamid EL MAAZOUZ
>Priority: Blocker
>  Labels: scala, serialization, spark
>
> This is a Scala problem. It has been already reported here 
> [https://github.com/scala/bug/issues/5046] and a fix has been merged here 
> [https://github.com/scala/scala/pull/9166.|https://github.com/scala/scala/pull/9166]
> According to 
> [https://github.com/scala/bug/issues/5046#issuecomment-928108088], the *fix* 
> is available on *Scala 2.12.14*, but *Spark 3.0+* is only pre-built with 
> Scala *2.12.10*.
>  
>  * Stacktrace of the failure: (Taken from stderr of a worker process)
> {code:java}
> Spark Executor Command: "/usr/java/jdk-11.0.12/bin/java" "-cp" 
> "/opt/apache/spark-3.1.2-bin-hadoop3.2/conf/:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/*"
>  "-Xmx1024M" "-Dspark.driver.port=45887" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.191:45887" "--executor-id" "0" 
> "--hostname" "192.168.0.191" "--cores" "12" "--app-id" 
> "app-20210927231035-" "--worker-url" "spark://Worker@192.168.0.191:35261"
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/09/27 23:10:36 INFO CoarseGrainedExecutorBackend: Started daemon with 
> process name: 18957@localhost
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for TERM
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for HUP
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for INT
> 21/09/27 23:10:36 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.191 instead (on interface wlp82s0)
> 21/09/27 23:10:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) 
> to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/09/27 23:10:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls to: hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls to: 
> hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: 
> Set(hamidelmaazouz); groups with view permissions: Set(); users  with modify 
> permissions: Set(hamidelmaazouz); groups with modify permissions: Set()
> 21/09/27 23:10:37 INFO TransportClientFactory: Successfully created 
> connection to /192.168.0.191:45887 after 44 ms (0 ms spent in bootstraps)
> 21/09/27 23:10:37 WARN TransportChannelHandler: Exception in connection from 
> /192.168.0.191:45887
> java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; 
> local class incompatible: stream classdesc serialVersionUID = 
> 3456489343829468865, local class serialVersionUID = 1028182004549731694
>   at 
> java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
>   at 
> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
>   at 
> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
>   at 
> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
>   at 
> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
>   at 
> java.base/java.io.Obje

[jira] [Closed] (SPARK-36869) Spark job fails due to java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; local class incompatible

2021-09-28 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-36869.
-

> Spark job fails due to java.io.InvalidClassException: 
> scala.collection.mutable.WrappedArray$ofRef; local class incompatible
> ---
>
> Key: SPARK-36869
> URL: https://issues.apache.org/jira/browse/SPARK-36869
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.1.2
> Environment: * RHEL 8.4
>  * Java 11.0.12
>  * Spark 3.1.2 (only prebuilt with *2.12.10)*
>  * Scala *2.12.14* for the application code
>Reporter: Hamid EL MAAZOUZ
>Priority: Blocker
>  Labels: scala, serialization, spark
>
> This is a Scala problem. It has been already reported here 
> [https://github.com/scala/bug/issues/5046] and a fix has been merged here 
> [https://github.com/scala/scala/pull/9166.|https://github.com/scala/scala/pull/9166]
> According to 
> [https://github.com/scala/bug/issues/5046#issuecomment-928108088], the *fix* 
> is available on *Scala 2.12.14*, but *Spark 3.0+* is only pre-built with 
> Scala *2.12.10*.
>  
>  * Stacktrace of the failure: (Taken from stderr of a worker process)
> {code:java}
> Spark Executor Command: "/usr/java/jdk-11.0.12/bin/java" "-cp" 
> "/opt/apache/spark-3.1.2-bin-hadoop3.2/conf/:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/*"
>  "-Xmx1024M" "-Dspark.driver.port=45887" 
> "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" 
> "spark://CoarseGrainedScheduler@192.168.0.191:45887" "--executor-id" "0" 
> "--hostname" "192.168.0.191" "--cores" "12" "--app-id" 
> "app-20210927231035-" "--worker-url" "spark://Worker@192.168.0.191:35261"
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> 21/09/27 23:10:36 INFO CoarseGrainedExecutorBackend: Started daemon with 
> process name: 18957@localhost
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for TERM
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for HUP
> 21/09/27 23:10:36 INFO SignalUtils: Registering signal handler for INT
> 21/09/27 23:10:36 WARN Utils: Your hostname, localhost resolves to a loopback 
> address: 127.0.0.1; using 192.168.0.191 instead (on interface wlp82s0)
> 21/09/27 23:10:36 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
> another address
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform 
> (file:/opt/apache/spark-3.1.2-bin-hadoop3.2/jars/spark-unsafe_2.12-3.1.2.jar) 
> to constructor java.nio.DirectByteBuffer(long,int)
> WARNING: Please consider reporting this to the maintainers of 
> org.apache.spark.unsafe.Platform
> WARNING: Use --illegal-access=warn to enable warnings of further illegal 
> reflective access operations
> WARNING: All illegal access operations will be denied in a future release
> 21/09/27 23:10:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls to: hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls to: 
> hamidelmaazouz
> 21/09/27 23:10:36 INFO SecurityManager: Changing view acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: Changing modify acls groups to: 
> 21/09/27 23:10:36 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users  with view permissions: 
> Set(hamidelmaazouz); groups with view permissions: Set(); users  with modify 
> permissions: Set(hamidelmaazouz); groups with modify permissions: Set()
> 21/09/27 23:10:37 INFO TransportClientFactory: Successfully created 
> connection to /192.168.0.191:45887 after 44 ms (0 ms spent in bootstraps)
> 21/09/27 23:10:37 WARN TransportChannelHandler: Exception in connection from 
> /192.168.0.191:45887
> java.io.InvalidClassException: scala.collection.mutable.WrappedArray$ofRef; 
> local class incompatible: stream classdesc serialVersionUID = 
> 3456489343829468865, local class serialVersionUID = 1028182004549731694
>   at 
> java.base/java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:689)
>   at 
> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2012)
>   at 
> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1862)
>   at 
> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2169)
>   at 
> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1679)
>   at 
> java.base/java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2464)
> 

[jira] [Assigned] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36871:


Assignee: (was: Apache Spark)

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421778#comment-17421778
 ] 

Apache Spark commented on SPARK-36871:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34131

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36871:


Assignee: Apache Spark

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36871) Migrate CreateViewStatement to v2 command

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421779#comment-17421779
 ] 

Apache Spark commented on SPARK-36871:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34131

> Migrate CreateViewStatement to v2 command
> -
>
> Key: SPARK-36871
> URL: https://issues.apache.org/jira/browse/SPARK-36871
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-35874) AQE Shuffle should wait for its subqueries to finish before materializing

2021-09-28 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-35874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421780#comment-17421780
 ] 

Shardul Mahadik commented on SPARK-35874:
-

[~dongjoon] Should this be linked in SPARK-33828?

> AQE Shuffle should wait for its subqueries to finish before materializing
> -
>
> Key: SPARK-35874
> URL: https://issues.apache.org/jira/browse/SPARK-35874
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Xinrong Meng (Jira)
Xinrong Meng created SPARK-36881:


 Summary: Inline type hints for python/pyspark/sql/catalog.py
 Key: SPARK-36881
 URL: https://issues.apache.org/jira/browse/SPARK-36881
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Xinrong Meng


Inline type hints for python/pyspark/sql/catalog.py from Inline type hints for 
python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36881:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36881:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421816#comment-17421816
 ] 

Apache Spark commented on SPARK-36881:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34133

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36881) Inline type hints for python/pyspark/sql/catalog.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421817#comment-17421817
 ] 

Apache Spark commented on SPARK-36881:
--

User 'xinrong-databricks' has created a pull request for this issue:
https://github.com/apache/spark/pull/34133

> Inline type hints for python/pyspark/sql/catalog.py
> ---
>
> Key: SPARK-36881
> URL: https://issues.apache.org/jira/browse/SPARK-36881
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Priority: Major
>
> Inline type hints for python/pyspark/sql/catalog.py from Inline type hints 
> for python/pyspark/sql/catalog.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36878) Optimization in PushDownPredicates to push all filters in a single iteration has broken some optimizations in PruneFilter rule

2021-09-28 Thread Asif (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Asif resolved SPARK-36878.
--
Resolution: Not A Bug

Further inputs from my company colleague, who pointed out that there might be 
some other rule which would be handling the case, it turns out  
BooleanSimplificationRule which runs before PruneFilter would reduce the 
composite condition into a single condition ( by prunning unnecessary 
conditions) as a result the current PruneFilter code is sufficient. It does not 
have to handle the composite conditions containing nullls or false or true.


> Optimization in PushDownPredicates to push all filters in a single iteration 
> has broken  some optimizations in PruneFilter rule
> ---
>
> Key: SPARK-36878
> URL: https://issues.apache.org/jira/browse/SPARK-36878
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Asif
>Priority: Major
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> It appears that the optimization in PushDownPredicates rule to try to push 
> all filters in a single pass to reduce iteration has broken the PruneFilter 
> rule to substitute with EmptyRelation when the filter condition is a 
> composite and statically evaluates to false either because one of the non 
> redundant predicate is Literal(false) or all the non redundant predicates are 
> null.
> The new PushDownPredicate rule is created by chaining CombineFilters, 
> PushPredicateThroughNonJoin and PushPredicateThroughJoin.
> so individual filters will get combined as a single filter while being pushed.
> But the PruneFilters rule does not substitute it with empty relation if the 
> filter is composite. It is coded to handle single predicates.
> The test is falsely passing as it is testing PushPredicateThroughNonJoin, 
> which does not combine filters. 
> While  the actual rule in action has an effect produced by CombineFilters. 
> In fact I believe all the places in other tests which are testing 
> individually for PushDownPredicateThroughNonJoin or 
> PushDownPredicateThroughJoin should be corrected ( may be with rule 
> PushPredicates) & re tested.
> I will add a bug test & open PR.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36862) ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java'

2021-09-28 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421823#comment-17421823
 ] 

Jungtaek Lim commented on SPARK-36862:
--

No, I meant you're encouraged to paste the "generated" code in the log. It's 
less thing to worry about, as it's quite hard to infer the actual business 
logic from generated code.

> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java'
> -
>
> Key: SPARK-36862
> URL: https://issues.apache.org/jira/browse/SPARK-36862
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit, SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1 and Spark 3.1.2
> hadoop 3.2.1
>Reporter: Magdalena Pilawska
>Priority: Major
>
> Hi,
> I am getting the following error running spark-submit command:
> ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 321, Column 103: ')' expected instead of '['
>  
> It fails running the spark sql command on delta lake: 
> spark.sql(sqlTransformation)
> The template of sqlTransformation is as follows:
> MERGE INTO target_table AS d
>  USING source_table AS s 
>  on s.id = d.id
>  WHEN MATCHED AND d.hash_value <> s.hash_value
>  THEN UPDATE SET d.name =s.name, d.address = s.address
>  
> It is permanent error both for *spark 3.1.1* version.
>  
> The same works fine with spark 3.0.0.
>  
> Here is the full log:
> 2021-09-22 16:43:22,110 ERROR CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of '['2021-09-22 16:43:22,110 ERROR 
> CodeGenerator: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 55, Column 103: ')' expected instead of 
> '['org.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 55, Column 103: ')' expected instead of '[' at 
> org.codehaus.janino.TokenStreamImpl.compileException(TokenStreamImpl.java:362)
>  at org.codehaus.janino.TokenStreamImpl.read(TokenStreamImpl.java:150) at 
> org.codehaus.janino.Parser.read(Parser.java:3703) at 
> org.codehaus.janino.Parser.parseFormalParameters(Parser.java:1622) at 
> org.codehaus.janino.Parser.parseMethodDeclarationRest(Parser.java:1518) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:1028) at 
> org.codehaus.janino.Parser.parseClassBody(Parser.java:841) at 
> org.codehaus.janino.Parser.parseClassDeclarationRest(Parser.java:736) at 
> org.codehaus.janino.Parser.parseClassBodyDeclaration(Parser.java:941) at 
> org.codehaus.janino.ClassBodyEvaluator.cook(ClassBodyEvaluator.java:234) at 
> org.codehaus.janino.SimpleCompiler.cook(SimpleCompiler.java:205) at 
> org.codehaus.commons.compiler.Cookable.cook(Cookable.java:80) at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1427)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1524)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1521)
>  at 
> org.sparkproject.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>  at 
> org.sparkproject.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>  at org.sparkproject.guava.cache.LocalCache$Segment.get(LocalCache.java:2257) 
> at org.sparkproject.guava.cache.LocalCache.get(LocalCache.java:4000) at 
> org.sparkproject.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004) at 
> org.sparkproject.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>  at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1375)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.liftedTree1$1(WholeStageCodegenExec.scala:721)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:720)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:185)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:223)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:220) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:181) at 
> org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.inputRDD$lzycompute(ShuffleExchangeExec.scala:160)

[jira] [Commented] (SPARK-36664) Log time spent waiting for cluster resources

2021-09-28 Thread John Zhuge (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421824#comment-17421824
 ] 

John Zhuge commented on SPARK-36664:


This can be very useful. For example, we'd like to track how long YARN jobs are 
stuck ACCEPTED state.

> Log time spent waiting for cluster resources
> 
>
> Key: SPARK-36664
> URL: https://issues.apache.org/jira/browse/SPARK-36664
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Holden Karau
>Priority: Major
>
> To provide better visibility into why jobs might be running slow it would be 
> useful to log when we are waiting for resources and how long we are waiting 
> for resources so if there is an underlying cluster issue the user can be 
> aware.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36846) Inline most of type hint files under pyspark/sql/pandas folder

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36846.
--
Fix Version/s: 3.3.0
 Assignee: Takuya Ueshin
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/34101

> Inline most of type hint files under pyspark/sql/pandas folder
> --
>
> Key: SPARK-36846
> URL: https://issues.apache.org/jira/browse/SPARK-36846
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hint files under {{pyspark/sql/pandas}} folder, except for 
> {{pyspark/sql/pandas/functions.pyi}} and files under 
> {{pyspark/sql/pandas/_typing}}.
>  * Since the file contains a lot of overloads, we should revisit and manage 
> it separately.
>  * We can't inline files under {{pyspark/sql/pandas/_typing}} because it 
> includes new syntax for type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36813) Implement ps.merge_asof

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36813:


Assignee: Takuya Ueshin

> Implement ps.merge_asof
> ---
>
> Key: SPARK-36813
> URL: https://issues.apache.org/jira/browse/SPARK-36813
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36813) Implement ps.merge_asof

2021-09-28 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36813.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34053
[https://github.com/apache/spark/pull/34053]

> Implement ps.merge_asof
> ---
>
> Key: SPARK-36813
> URL: https://issues.apache.org/jira/browse/SPARK-36813
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread dohongdayi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421843#comment-17421843
 ] 

dohongdayi commented on SPARK-36849:


Hi [~huaxingao], I'm not sure why SparkQA didn't run automatically, could you 
advise please ?

Thanks

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421845#comment-17421845
 ] 

Huaxin Gao commented on SPARK-36849:


I just started the test

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36752) Support ILIKE by Scala/Java, PySpark and R APIs

2021-09-28 Thread Leona Yoda (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421847#comment-17421847
 ] 

Leona Yoda commented on SPARK-36752:


[~hyukjin.kwon] I'm sorry for worrying you ( I took a vacation), I will create 
a PR for Python today and for R soon.

> Support ILIKE by Scala/Java, PySpark and R APIs
> ---
>
> Key: SPARK-36752
> URL: https://issues.apache.org/jira/browse/SPARK-36752
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Add the ilike function to Scala/Java, Python and R APIs, update docs and 
> examples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36752) Support ILIKE by Scala/Java, PySpark and R APIs

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421852#comment-17421852
 ] 

Hyukjin Kwon commented on SPARK-36752:
--

it's totally no problem. I was just checking :-).

> Support ILIKE by Scala/Java, PySpark and R APIs
> ---
>
> Key: SPARK-36752
> URL: https://issues.apache.org/jira/browse/SPARK-36752
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> Add the ilike function to Scala/Java, Python and R APIs, update docs and 
> examples.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Leona Yoda (Jira)
Leona Yoda created SPARK-36882:
--

 Summary: Support ILIKE API on Python
 Key: SPARK-36882
 URL: https://issues.apache.org/jira/browse/SPARK-36882
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Leona Yoda


Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-36883:


 Summary: Upgrade R version to 4.1.1 in CI images
 Key: SPARK-36883
 URL: https://issues.apache.org/jira/browse/SPARK-36883
 Project: Spark
  Issue Type: Test
  Components: Tests
Affects Versions: 3.3.0
Reporter: Hyukjin Kwon


https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.

R 4.1.1 is released. We might better to test the latest version of R with 
SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421854#comment-17421854
 ] 

Hyukjin Kwon commented on SPARK-36883:
--

[~dongjoon], FYI. BTW, I think this could fix the issue in linter going on now 
in CI, e.g.) https://github.com/apache/spark/runs/3738920366

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36882:


Assignee: Apache Spark

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Apache Spark
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36882:


Assignee: (was: Apache Spark)

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36882) Support ILIKE API on Python

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421858#comment-17421858
 ] 

Apache Spark commented on SPARK-36882:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34135

> Support ILIKE API on Python
> ---
>
> Key: SPARK-36882
> URL: https://issues.apache.org/jira/browse/SPARK-36882
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Support ILIKE (case sensitive LIKE) API on Python



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36845) Inline type hint files

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421859#comment-17421859
 ] 

dgd_contributor commented on SPARK-36845:
-

Can i work on this :D ?

> Inline type hint files
> --
>
> Key: SPARK-36845
> URL: https://issues.apache.org/jira/browse/SPARK-36845
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36884:
-

 Summary: Inline type hints for python/pyspark/sql/session.py
 Key: SPARK-36884
 URL: https://issues.apache.org/jira/browse/SPARK-36884
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Inline type hints for python/pyspark/sql/session.py from Inline type hints for 
python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421861#comment-17421861
 ] 

Takuya Ueshin commented on SPARK-36884:
---

I'm working on this.

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py

2021-09-28 Thread Takuya Ueshin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421863#comment-17421863
 ] 

Takuya Ueshin commented on SPARK-36885:
---

I'm working on this.

> Inline type hints for python/pyspark/sql/dataframe.py
> -
>
> Key: SPARK-36885
> URL: https://issues.apache.org/jira/browse/SPARK-36885
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints 
> for python/pyspark/sql/dataframe.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py

2021-09-28 Thread Takuya Ueshin (Jira)
Takuya Ueshin created SPARK-36885:
-

 Summary: Inline type hints for python/pyspark/sql/dataframe.py
 Key: SPARK-36885
 URL: https://issues.apache.org/jira/browse/SPARK-36885
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Takuya Ueshin


Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints 
for python/pyspark/sql/dataframe.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36845) Inline type hint files

2021-09-28 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421865#comment-17421865
 ] 

Hyukjin Kwon commented on SPARK-36845:
--

please go ahead.

> Inline type hint files
> --
>
> Key: SPARK-36845
> URL: https://issues.apache.org/jira/browse/SPARK-36845
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark, SQL
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Currently there are type hint stub files ({{*.pyi}}) to show the expected 
> types for functions, but we can also take advantage of static type checking 
> within the functions by inlining the type hints.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36886) Inline type hints for python/pyspark/sql/column.py

2021-09-28 Thread dgd_contributor (Jira)
dgd_contributor created SPARK-36886:
---

 Summary: Inline type hints for python/pyspark/sql/column.py
 Key: SPARK-36886
 URL: https://issues.apache.org/jira/browse/SPARK-36886
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dgd_contributor


Inline type hints for python/pyspark/sql/column.py from Inline type hints for 
python/pyspark/sql/column.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36886) Inline type hints for python/pyspark/sql/column.py

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421878#comment-17421878
 ] 

dgd_contributor commented on SPARK-36886:
-

Working on this.

> Inline type hints for python/pyspark/sql/column.py
> --
>
> Key: SPARK-36886
> URL: https://issues.apache.org/jira/browse/SPARK-36886
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/column.py from Inline type hints for 
> python/pyspark/sql/column.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36887) Inline type hints for python/pyspark/sql/conf.py

2021-09-28 Thread dgd_contributor (Jira)
dgd_contributor created SPARK-36887:
---

 Summary: Inline type hints for python/pyspark/sql/conf.py
 Key: SPARK-36887
 URL: https://issues.apache.org/jira/browse/SPARK-36887
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dgd_contributor


Inline type hints for python/pyspark/sql/session.py from Inline type hints for 
python/pyspark/sql/conf.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36887) Inline type hints for python/pyspark/sql/conf.py

2021-09-28 Thread dgd_contributor (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421880#comment-17421880
 ] 

dgd_contributor commented on SPARK-36887:
-

I'm working on this.

> Inline type hints for python/pyspark/sql/conf.py
> 
>
> Key: SPARK-36887
> URL: https://issues.apache.org/jira/browse/SPARK-36887
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dgd_contributor
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/conf.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36849) Migrate UseStatement to v2 command framework

2021-09-28 Thread dohongdayi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421888#comment-17421888
 ] 

dohongdayi commented on SPARK-36849:


[~huaxingao] Thank you a lot !

> Migrate UseStatement to v2 command framework
> 
>
> Key: SPARK-36849
> URL: https://issues.apache.org/jira/browse/SPARK-36849
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421898#comment-17421898
 ] 

Apache Spark commented on SPARK-36884:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/34136

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36884:


Assignee: Apache Spark

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36884) Inline type hints for python/pyspark/sql/session.py

2021-09-28 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36884:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/sql/session.py
> ---
>
> Key: SPARK-36884
> URL: https://issues.apache.org/jira/browse/SPARK-36884
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Priority: Major
>
> Inline type hints for python/pyspark/sql/session.py from Inline type hints 
> for python/pyspark/sql/session.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36526) Add supportsIndex interface

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36526:
---

Assignee: Huaxin Gao

> Add supportsIndex interface
> ---
>
> Key: SPARK-36526
> URL: https://issues.apache.org/jira/browse/SPARK-36526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Add supportsIndex interface with the following APIs:
> * createIndex
> * deleteIndex
> * indexExists
> * listIndexes
> * dropIndex
> * restoreIndex
> * refreshIndex
> * alterIndex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36526) Add supportsIndex interface

2021-09-28 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36526.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33754
[https://github.com/apache/spark/pull/33754]

> Add supportsIndex interface
> ---
>
> Key: SPARK-36526
> URL: https://issues.apache.org/jira/browse/SPARK-36526
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.3.0
>
>
> Add supportsIndex interface with the following APIs:
> * createIndex
> * deleteIndex
> * indexExists
> * listIndexes
> * dropIndex
> * restoreIndex
> * refreshIndex
> * alterIndex



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421907#comment-17421907
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

Thank you, [~hyukjin.kwon].

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36883) Upgrade R version to 4.1.1 in CI images

2021-09-28 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17421908#comment-17421908
 ] 

Dongjoon Hyun commented on SPARK-36883:
---

Let me update the image.

> Upgrade R version to 4.1.1 in CI images
> ---
>
> Key: SPARK-36883
> URL: https://issues.apache.org/jira/browse/SPARK-36883
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> https://developer.r-project.org/#:~:text=Release%20plans,on%202021%2D08%2D10.
> R 4.1.1 is released. We might better to test the latest version of R with 
> SparkR.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >