[jira] [Assigned] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36876:
---

Assignee: angerszhu

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36876) Support Dynamic Partition pruning for HiveTableScanExec

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36876.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34139
[https://github.com/apache/spark/pull/34139]

> Support Dynamic Partition pruning for HiveTableScanExec
> ---
>
> Key: SPARK-36876
> URL: https://issues.apache.org/jira/browse/SPARK-36876
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.3, 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> Support dynamic partition pruning for hive serde scan



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36969) Inline type hints for python/pyspark/context.py

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426980#comment-17426980
 ] 

Apache Spark commented on SPARK-36969:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34238

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36969) Inline type hints for python/pyspark/context.py

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36969:


Assignee: Apache Spark

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Major
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36969) Inline type hints for python/pyspark/context.py

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36969:


Assignee: (was: Apache Spark)

> Inline type hints for python/pyspark/context.py
> ---
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36969) Inline type hints for SparkContext

2021-10-11 Thread dch nguyen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36969?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dch nguyen updated SPARK-36969:
---
Summary: Inline type hints for SparkContext  (was: Inline type hints for 
python/pyspark/context.py)

> Inline type hints for SparkContext
> --
>
> Key: SPARK-36969
> URL: https://issues.apache.org/jira/browse/SPARK-36969
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Major
>
> Many files can remove 
> {code:java}
> # type: ignore[attr-defined]
> {code}
> if this file is inlined type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36059) Add the ability to specify a scheduler & queue

2021-10-11 Thread Yikun Jiang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426984#comment-17426984
 ] 

Yikun Jiang commented on SPARK-36059:
-

We have the abilities of specify a scheduler on executor side, 
https://github.com/apache/spark/pull/26088, we just need add the ability on 
driver side.

> Add the ability to specify a scheduler & queue
> --
>
> Key: SPARK-36059
> URL: https://issues.apache.org/jira/browse/SPARK-36059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36059) Add the ability to specify a scheduler & queue

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36059:


Assignee: Apache Spark

> Add the ability to specify a scheduler & queue
> --
>
> Key: SPARK-36059
> URL: https://issues.apache.org/jira/browse/SPARK-36059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36059) Add the ability to specify a scheduler & queue

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17426985#comment-17426985
 ] 

Apache Spark commented on SPARK-36059:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/34239

> Add the ability to specify a scheduler & queue
> --
>
> Key: SPARK-36059
> URL: https://issues.apache.org/jira/browse/SPARK-36059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Leona Yoda (Jira)
Leona Yoda created SPARK-36972:
--

 Summary: Add max_by/min_by API to PySpark
 Key: SPARK-36972
 URL: https://issues.apache.org/jira/browse/SPARK-36972
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: Leona Yoda






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36059) Add the ability to specify a scheduler & queue

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36059:


Assignee: (was: Apache Spark)

> Add the ability to specify a scheduler & queue
> --
>
> Key: SPARK-36059
> URL: https://issues.apache.org/jira/browse/SPARK-36059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36059) Add the ability to specify a scheduler & queue

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36059:


Assignee: Apache Spark

> Add the ability to specify a scheduler & queue
> --
>
> Key: SPARK-36059
> URL: https://issues.apache.org/jira/browse/SPARK-36059
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Holden Karau
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Leona Yoda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leona Yoda updated SPARK-36972:
---
Description: 
Related issues

- https://issues.apache.org/jira/browse/SPARK-27653

- https://issues.apache.org/jira/browse/SPARK-36963

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Related issues
> - https://issues.apache.org/jira/browse/SPARK-27653
> - https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Leona Yoda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leona Yoda updated SPARK-36972:
---
Description: 
Related issues
 - https://issues.apache.org/jira/browse/SPARK-27653

 * https://issues.apache.org/jira/browse/SPARK-36963

  was:
Related issues

- https://issues.apache.org/jira/browse/SPARK-27653

- https://issues.apache.org/jira/browse/SPARK-36963


> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36932) Misuse "merge schema" when mapGroups

2021-10-11 Thread chendihao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427005#comment-17427005
 ] 

chendihao commented on SPARK-36932:
---

It is not related to join operation. I have a simpler case to re-produce the 
issue with single dataframe.

{code:scala}
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types.{IntegerType, LongType, StructField, 
StructType}
import org.apache.spark.sql.{Row, SparkSession}

object SchemaPuringIssue {

  def main(args: Array[String]): Unit = {

val spark = SparkSession.builder()
  .master("local")
  .getOrCreate()

val data1 = Seq(
  Row(1, 1l),
  Row(2, 2l))
val schema1 = StructType(List(
  StructField("col1", IntegerType),
  StructField("col1", LongType)))
val df = spark.createDataFrame(spark.sparkContext.makeRDD(data1), schema1)
df.show()

import spark.implicits._

val distinct = df
  .groupByKey {
row => row.getInt(0)
  }
  .mapGroups {
case (_, iter) =>
  iter.maxBy(row => {
row.getInt(0)
  })
  }(RowEncoder(df.schema))

distinct.show()
  }

}
{code}

It seems that the Catalog Optimizer uses SchemaPruning and it raises exception 
for the schema with the same name but different types.

> Misuse "merge schema" when mapGroups
> 
>
> Key: SPARK-36932
> URL: https://issues.apache.org/jira/browse/SPARK-36932
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 3.0.0
>Reporter: Wang Zekai
>Priority: Major
>
> {code:java}
> // Test case for this bug
> val spark = SparkSession.builder().master("local[*]").getOrCreate()
> val data1 = Seq(
>   Row("0", 1),
>   Row("0", 2))
> val schema1 = StructType(List(
>   StructField("col0", StringType),
>   StructField("col1", IntegerType))
> )
> val data2 = Seq(
>   Row("0", 1),
>   Row("0", 2))
> val schema2 = StructType(List(
>   StructField("str0", StringType),
>   StructField("col0", IntegerType))
> )
> val df1 = spark.createDataFrame(spark.sparkContext.makeRDD(data1), schema1)
> val df2 = spark.createDataFrame(spark.sparkContext.makeRDD(data2), schema2)
> val joined = df1.join(df2, df1("col0") === df2("str0"), "left")
> import spark.implicits._
> val distinct = joined
>   .groupByKey {
> row => row.getInt(1)
>   }
>   .mapGroups {
> case (_, iter) =>
> iter.maxBy(row => {
>   row.getInt(3)
> })
>   }(RowEncoder(joined.schema))
> distinct.show(){code}
> {code:java}
>  // A part of errors
> Exception in thread "main" org.apache.spark.SparkException: Failed to merge 
> fields 'col0' and 'col0'. Failed to merge incompatible data types string and 
> int 
> at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$2(StructType.scala:593)
> at scala.Option.map(Option.scala:163)
> at 
> org.apache.spark.sql.types.StructType$.$anonfun$merge$1(StructType.scala:585)
> at org.apache.spark.sql.types.StructType$.$anonfun$merge$1$adapted
> (StructType.scala:582)
> at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
> at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) 
> at org.apache.spark.sql.types.StructType$.merge(StructType.scala:582) 
> at org.apache.spark.sql.types.StructType.merge(StructType.scala:492)
> at org.apache.spark.sql.catalyst.expressions.SchemaPruning$.$
> anonfun$pruneDataSchema$2(SchemaPruning.scala:36)
> at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126)
> at 
> scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) 
> at scala.collection.immutable.List.foldLeft(List.scala:89) 
> at 
> scala.collection.LinearSeqOptimized.reduceLeft(LinearSeqOptimized.scala:140) 
> at 
> scala.collection.LinearSeqOptimized.reduceLeft$(LinearSeqOptimized.scala:138) 
> at scala.collection.immutable.List.reduceLeft(List.scala:89)
> {code}
> After left join two dataframe which have two shemas with the same name but 
> different types, we use groupByKey and mapGroups to get the result. But it 
> will makes some mistakes. Is it my grammatical mistake? If not, I think It 
> may be related to schema merge in StructType.scala: 593. How can I turn off 
> schema merging?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427029#comment-17427029
 ] 

Apache Spark commented on SPARK-36972:
--

User 'yoda-mon' has created a pull request for this issue:
https://github.com/apache/spark/pull/34240

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36972:


Assignee: Apache Spark

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Assignee: Apache Spark
>Priority: Major
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36972:


Assignee: (was: Apache Spark)

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Major
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-36794:
-

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-35531:
---

Assignee: angerszhu

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-11 Thread dch nguyen (Jira)
dch nguyen created SPARK-36973:
--

 Summary: Deduplicate prepare data method for HistogramPlotBase and 
KdePlotBase
 Key: SPARK-36973
 URL: https://issues.apache.org/jira/browse/SPARK-36973
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.3.0
Reporter: dch nguyen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36974) Try to raise memory and parallelism again for GA

2021-10-11 Thread angerszhu (Jira)
angerszhu created SPARK-36974:
-

 Summary: Try to raise memory and parallelism again for GA
 Key: SPARK-36974
 URL: https://issues.apache.org/jira/browse/SPARK-36974
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.2.1, 3.3.0
Reporter: angerszhu


See many OOM issue for GA 
https://github.com/AngersZh/spark/runs/3855782221?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-35531) Can not insert into hive bucket table if create table with upper case schema

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-35531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-35531.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34218
[https://github.com/apache/spark/pull/34218]

> Can not insert into hive bucket table if create table with upper case schema
> 
>
> Key: SPARK-35531
> URL: https://issues.apache.org/jira/browse/SPARK-35531
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.1, 3.2.0
>Reporter: Hongyi Zhang
>Priority: Major
> Fix For: 3.3.0
>
>
>  
>  
> create table TEST1(
>  V1 BIGINT,
>  S1 INT)
>  partitioned by (PK BIGINT)
>  clustered by (V1)
>  sorted by (S1)
>  into 200 buckets
>  STORED AS PARQUET;
>  
> insert into test1
>  select
>  * from values(1,1,1);
>  
>  
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Bucket columns V1 is not 
> part of the table columns ([FieldSchema(name:v1, type:bigint, comment:null), 
> FieldSchema(name:s1, type:int, comment:null)]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36974) Try to raise memory and parallelism again for GA

2021-10-11 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-36974.
---
Resolution: Duplicate

> Try to raise memory and parallelism again for GA
> 
>
> Key: SPARK-36974
> URL: https://issues.apache.org/jira/browse/SPARK-36974
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.2.1, 3.3.0
>Reporter: angerszhu
>Priority: Major
>
> See many OOM issue for GA 
> https://github.com/AngersZh/spark/runs/3855782221?check_suite_focus=true



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread angerszhu (Jira)
angerszhu created SPARK-36975:
-

 Summary: Refactor HiveClientImpl collect hive client call logic
 Key: SPARK-36975
 URL: https://issues.apache.org/jira/browse/SPARK-36975
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: angerszhu


Currently, we treat one call withHiveState as one Hive Client call, it's too 
weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427053#comment-17427053
 ] 

angerszhu commented on SPARK-36975:
---

Raise a pr soon

> Refactor HiveClientImpl collect hive client call logic
> --
>
> Key: SPARK-36975
> URL: https://issues.apache.org/jira/browse/SPARK-36975
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, we treat one call withHiveState as one Hive Client call, it's too 
> weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36975:


Assignee: Apache Spark

> Refactor HiveClientImpl collect hive client call logic
> --
>
> Key: SPARK-36975
> URL: https://issues.apache.org/jira/browse/SPARK-36975
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we treat one call withHiveState as one Hive Client call, it's too 
> weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36975:


Assignee: (was: Apache Spark)

> Refactor HiveClientImpl collect hive client call logic
> --
>
> Key: SPARK-36975
> URL: https://issues.apache.org/jira/browse/SPARK-36975
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, we treat one call withHiveState as one Hive Client call, it's too 
> weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427070#comment-17427070
 ] 

Apache Spark commented on SPARK-36975:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34241

> Refactor HiveClientImpl collect hive client call logic
> --
>
> Key: SPARK-36975
> URL: https://issues.apache.org/jira/browse/SPARK-36975
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, we treat one call withHiveState as one Hive Client call, it's too 
> weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36975) Refactor HiveClientImpl collect hive client call logic

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427071#comment-17427071
 ] 

Apache Spark commented on SPARK-36975:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34241

> Refactor HiveClientImpl collect hive client call logic
> --
>
> Key: SPARK-36975
> URL: https://issues.apache.org/jira/browse/SPARK-36975
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Currently, we treat one call withHiveState as one Hive Client call, it's too 
> weirld. Need to refator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36588) Use v2 commands by default

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36588:
---

Assignee: Terry Kim

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36588) Use v2 commands by default

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36588.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34137
[https://github.com/apache/spark/pull/34137]

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-36588) Use v2 commands by default

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-36588:
-

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-36588) Use v2 commands by default

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36588:

Comment: was deleted

(was: Issue resolved by pull request 34137
[https://github.com/apache/spark/pull/34137])

> Use v2 commands by default
> --
>
> Key: SPARK-36588
> URL: https://issues.apache.org/jira/browse/SPARK-36588
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> It's been a while after we introduce the v2 commands, and I think it's time 
> to use v2 commands by default even for the session catalog, with a legacy 
> config to fall back to the v1 commands.
> We can do this one command by one command, with tests for both the v1 and v2 
> versions. The tests should help us understand the behavior difference between 
> v1 and v2 commands, so that we can:
>  # fix the v2 commands to match the v1 behavior
>  # or accept the behavior difference and write migration guide
> We can reuse the test framework built in 
> https://issues.apache.org/jira/browse/SPARK-33381



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36678) Migrate SHOW TABLES to use V2 command by default

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427077#comment-17427077
 ] 

Wenchen Fan commented on SPARK-36678:
-

resolved by https://github.com/apache/spark/pull/34137

> Migrate SHOW TABLES to use V2 command by default
> 
>
> Key: SPARK-36678
> URL: https://issues.apache.org/jira/browse/SPARK-36678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Priority: Major
>
> Migrate SHOW TABLES to use V2 command by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36678) Migrate SHOW TABLES to use V2 command by default

2021-10-11 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36678.
-
Fix Version/s: 3.3.0
 Assignee: Terry Kim
   Resolution: Fixed

> Migrate SHOW TABLES to use V2 command by default
> 
>
> Key: SPARK-36678
> URL: https://issues.apache.org/jira/browse/SPARK-36678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TABLES to use V2 command by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36678) Migrate SHOW TABLES to use V2 command by default

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427079#comment-17427079
 ] 

Apache Spark commented on SPARK-36678:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/34137

> Migrate SHOW TABLES to use V2 command by default
> 
>
> Key: SPARK-36678
> URL: https://issues.apache.org/jira/browse/SPARK-36678
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.3.0
>
>
> Migrate SHOW TABLES to use V2 command by default.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36861) Partition columns are overly eagerly parsed as dates

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427085#comment-17427085
 ] 

Wenchen Fan commented on SPARK-36861:
-

I think partition value parsing needs to be stricter. cc [~maxgekk]

> Partition columns are overly eagerly parsed as dates
> 
>
> Key: SPARK-36861
> URL: https://issues.apache.org/jira/browse/SPARK-36861
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Tanel Kiis
>Priority: Blocker
>
> I have an input directory with subdirs:
> * hour=2021-01-01T00
> * hour=2021-01-01T01
> * hour=2021-01-01T02
> * ...
> in spark 3.1 the 'hour' column is parsed as a string type, but in 3.2 RC it 
> is parsed as date type and the hour part is lost.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427087#comment-17427087
 ] 

Wenchen Fan commented on SPARK-36877:
-

> Should calling df.rdd trigger actual job execution when AQE is enabled?

We should. Getting RDD means the physical plan is finalized. With AQE, 
finalizing the physical plan means running all the query stages except for the 
last stage.

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36877) Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing reruns

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36877?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427089#comment-17427089
 ] 

Wenchen Fan commented on SPARK-36877:
-

> shouldn't it reuse the result from previous stages?

One DataFrame means one query, and today Spark can't reuse 
shuffle/broadcast/subquery across queries.

> Calling ds.rdd with AQE enabled leads to jobs being run, eventually causing 
> reruns
> --
>
> Key: SPARK-36877
> URL: https://issues.apache.org/jira/browse/SPARK-36877
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.1
>Reporter: Shardul Mahadik
>Priority: Major
> Attachments: Screen Shot 2021-09-28 at 09.32.20.png
>
>
> In one of our jobs we perform the following operation:
> {code:scala}
> val df = /* some expensive multi-table/multi-stage join */
> val numPartitions = df.rdd.getNumPartitions
> df.repartition(x).write.
> {code}
> With AQE enabled, we found that the expensive stages were being run twice 
> causing significant performance regression after enabling AQE; once when 
> calling {{df.rdd}} and again when calling {{df.write}}.
> A more concrete example:
> {code:scala}
> scala> sql("SET spark.sql.adaptive.enabled=true")
> res0: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> sql("SET spark.sql.autoBroadcastJoinThreshold=-1")
> res1: org.apache.spark.sql.DataFrame = [key: string, value: string]
> scala> val df1 = spark.range(10).withColumn("id2", $"id")
> df1: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df2 = df1.join(spark.range(10), "id").join(spark.range(10), 
> "id").join(spark.range(10), "id")
> df2: org.apache.spark.sql.DataFrame = [id: bigint, id2: bigint]
> scala> val df3 = df2.groupBy("id2").count()
> df3: org.apache.spark.sql.DataFrame = [id2: bigint, count: bigint]
> scala> df3.rdd.getNumPartitions
> res2: Int = 10(0 + 16) / 
> 16]
> scala> df3.repartition(5).write.mode("overwrite").orc("/tmp/orc1")
> {code}
> In the screenshot below, you can see that the first 3 stages (0 to 4) were 
> rerun again (5 to 9).
> I have two questions:
> 1) Should calling df.rdd trigger actual job execution when AQE is enabled?
> 2) Should calling df.write later cause rerun of the stages? If df.rdd has 
> already partially executed the stages, shouldn't it reuse the result from 
> previous stages?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36972) Add max_by/min_by API to PySpark

2021-10-11 Thread Leona Yoda (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leona Yoda updated SPARK-36972:
---
Priority: Minor  (was: Major)

> Add max_by/min_by API to PySpark
> 
>
> Key: SPARK-36972
> URL: https://issues.apache.org/jira/browse/SPARK-36972
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Leona Yoda
>Priority: Minor
>
> Related issues
>  - https://issues.apache.org/jira/browse/SPARK-27653
>  * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36976) Add max_by/min_by API to SparkR

2021-10-11 Thread Leona Yoda (Jira)
Leona Yoda created SPARK-36976:
--

 Summary: Add max_by/min_by API to SparkR
 Key: SPARK-36976
 URL: https://issues.apache.org/jira/browse/SPARK-36976
 Project: Spark
  Issue Type: Improvement
  Components: R
Affects Versions: 3.3.0
Reporter: Leona Yoda


Related issues
 - https://issues.apache.org/jira/browse/SPARK-27653

 * https://issues.apache.org/jira/browse/SPARK-36963



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36540) AM should not just finish with Success when dissconnected

2021-10-11 Thread Thomas Graves (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-36540.
---
Fix Version/s: 3.3.0
 Assignee: angerszhu
   Resolution: Fixed

> AM should not just finish with Success when dissconnected
> -
>
> Key: SPARK-36540
> URL: https://issues.apache.org/jira/browse/SPARK-36540
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, YARN
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.3.0
>
>
> We meet a case AM lose connection
> {code}
> 21/08/18 02:14:15 ERROR TransportRequestHandler: Error sending result 
> RpcResponse{requestId=5675952834716124039, 
> body=NioManagedBuffer{buf=java.nio.HeapByteBuffer[pos=0 lim=47 cap=64]}} to 
> xx.xx.xx.xx:41420; closing connection
> java.nio.channels.ClosedChannelException
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865)
> at 
> io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
> at 
> io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
> at 
> io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1104)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> at 
> io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Check the code about client, when AMEndpoint dissconnected, will finish 
> Application with SUCCESS final status
> {code}
> override def onDisconnected(remoteAddress: RpcAddress): Unit = {
>   // In cluster mode or unmanaged am case, do not rely on the 
> disassociated event to exit
>   // This avoids potentially reporting incorrect exit codes if the driver 
> fails
>   if (!(isClusterMode || sparkConf.get(YARN_UNMANAGED_AM))) {
> logInfo(s"Driver terminated or disconnected! Shutting down. 
> $remoteAddress")
> finish(FinalApplicationStatus.SUCCEEDED, 
> ApplicationMaster.EXIT_SUCCESS)
>   }
> }
> {code}
> Nomally in client mode, when application success, driver will stop and AM 
> loss connection, it's ok that exit with SUCCESS, but if there is a not work 
> problem cause dissconnected. Still finish with final status is not correct.
> Then YarnClientSchedulerBackend will receive application report with final 
> status with success and stop SparkContext cause application failed but mark 
> it as a normal stop.
> {code}
>   private class MonitorThread extends Thread {
> private var allowInterrupt = true
> override def run() {
>   try {
> val YarnAppReport(_, state, diags) =
>   client.monitorApplication(appId.get, logApplicationReport = false)
> logError(s"YARN application has exited unexpectedly with state 
> $state! " +
>   "Check the YARN application logs for more details.")
> diags.foreach { err =>
>   logError(s"Diagnostics message: $err")
> }
> allowInterrupt = false
> sc.stop()
>   } catch {
> case e: InterruptedException => logInfo("Interrupting monitor thread")
>   }
> }
> def stopMonitor(): Unit = {
>   if (allowInterrupt) {
> this.interrupt()
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-36977:
--

 Summary: Update docs to reflect that Python 3.6 is no longer 
supported
 Key: SPARK-36977
 URL: https://issues.apache.org/jira/browse/SPARK-36977
 Project: Spark
  Issue Type: Sub-task
  Components: docs, PySpark
Affects Versions: 3.3.0
Reporter: Maciej Szymkiewicz






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427150#comment-17427150
 ] 

Apache Spark commented on SPARK-36977:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/34242

> Update docs to reflect that Python 3.6 is no longer supported
> -
>
> Key: SPARK-36977
> URL: https://issues.apache.org/jira/browse/SPARK-36977
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36977:


Assignee: (was: Apache Spark)

> Update docs to reflect that Python 3.6 is no longer supported
> -
>
> Key: SPARK-36977
> URL: https://issues.apache.org/jira/browse/SPARK-36977
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36977:


Assignee: Apache Spark

> Update docs to reflect that Python 3.6 is no longer supported
> -
>
> Key: SPARK-36977
> URL: https://issues.apache.org/jira/browse/SPARK-36977
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36949) Fix CREATE TABLE AS SELECT of ANSI intervals

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427171#comment-17427171
 ] 

Wenchen Fan commented on SPARK-36949:
-

Interesting. Does this work in Hive natively? CTAS with interval type.

> Fix CREATE TABLE AS SELECT of ANSI intervals
> 
>
> Key: SPARK-36949
> URL: https://issues.apache.org/jira/browse/SPARK-36949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> The given SQL should work:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR 
> TO MONTH AS YM;
> 21/10/07 21:35:59 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36962) Make HiveSerDe.serdeMap extensible

2021-10-11 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427172#comment-17427172
 ] 

Wenchen Fan commented on SPARK-36962:
-

Seems fine to make it extensible, but we need to carefully design the API.

> Make HiveSerDe.serdeMap extensible
> --
>
> Key: SPARK-36962
> URL: https://issues.apache.org/jira/browse/SPARK-36962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.9, 3.1.3
>Reporter: Yann Byron
>Priority: Major
>
> Currently, Only tables with certain types (e.g. json, orc, parquet, avro) 
> which defined in HiveSerDe.serdeMap can be created correctly in Hive 
> metastore.
> Other tables won't be persisted in a Hive compatible way. Only persisting 
> these tables into Hive metastore in Spark SQL specific format as follow:
> {code:java}
> col    array    from deserializer  
> {code}
> So, can we make this extensible, so that we can pass the serde info needed to 
> spark and create a hive-compatible table by spark-sql.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36949) Fix CREATE TABLE AS SELECT of ANSI intervals

2021-10-11 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427195#comment-17427195
 ] 

Max Gekk commented on SPARK-36949:
--

I wasn't able to create a table with intervals:

{code:sql}
0: jdbc:hive2://localhost:1/default> CREATE TABLE tbl1 (ym INTERVAL YEAR TO 
MONTH) USING PARQUET;
Error: Error while compiling statement: FAILED: ParseException line 1:22 cannot 
recognize input near 'INTERVAL' 'YEAR' 'TO' in column type 
(state=42000,code=4)
0: jdbc:hive2://localhost:1/default> CREATE TABLE tbl1 (ym INTERVAL YEAR TO 
MONTH);
Error: Error while compiling statement: FAILED: ParseException line 1:22 cannot 
recognize input near 'INTERVAL' 'YEAR' 'TO' in column type 
(state=42000,code=4)
0: jdbc:hive2://localhost:1/default> CREATE TABLE tbl1 (ym 
interval_year_month);
Error: Error while compiling statement: FAILED: ParseException line 1:22 cannot 
recognize input near 'interval_year_month' ')' '' in column type 
(state=42000,code=4)
0: jdbc:hive2://localhost:1/default> CREATE TABLE tbl1 (ym 
INTERVAL_YEAR_MONTH) USING PARQUET;
Error: Error while compiling statement: FAILED: ParseException line 1:22 cannot 
recognize input near 'INTERVAL_YEAR_MONTH' ')' 'USING' in column type 
(state=42000,code=4)
{code}
but Hive accepts intervals as literals:
{code:sql}
0: jdbc:hive2://localhost:1/default> SELECT INTERVAL '1-1' YEAR TO MONTH;
+--+
| _c0  |
+--+
| 1-1  |
+--+
{code}
 Maybe Hive just doesn't support saving intervals but only intervals in 
expressions.

> Fix CREATE TABLE AS SELECT of ANSI intervals
> 
>
> Key: SPARK-36949
> URL: https://issues.apache.org/jira/browse/SPARK-36949
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Priority: Major
>
> The given SQL should work:
> {code:sql}
> spark-sql> CREATE TABLE tbl1 STORED AS PARQUET AS SELECT INTERVAL '1-1' YEAR 
> TO MONTH AS YM;
> 21/10/07 21:35:59 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, 
> since hive.security.authorization.manager is set to instance of 
> HiveAuthorizerFactory.
> Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.lang.IllegalArgumentException: Error: type expected at the position 0 of 
> 'interval year to month' but 'interval year to month' is found
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36853) Code failing on checkstyle

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36853:


Assignee: Apache Spark

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Assignee: Apache Spark
>Priority: Trivial
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\stream

[jira] [Assigned] (SPARK-36853) Code failing on checkstyle

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36853:


Assignee: (was: Apache Spark)

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Priority: Trivial
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[120,25]

[jira] [Commented] (SPARK-36853) Code failing on checkstyle

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427207#comment-17427207
 ] 

Apache Spark commented on SPARK-36853:
--

User 'Shockang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34243

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Priority: Trivial
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match patt

[jira] [Commented] (SPARK-36853) Code failing on checkstyle

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427208#comment-17427208
 ] 

Apache Spark commented on SPARK-36853:
--

User 'Shockang' has created a pull request for this issue:
https://github.com/apache/spark/pull/34243

> Code failing on checkstyle
> --
>
> Key: SPARK-36853
> URL: https://issues.apache.org/jira/browse/SPARK-36853
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Abhinav Kumar
>Priority: Trivial
>
> There are more - just pasting sample 
>  
> [INFO] There are 32 errors reported by Checkstyle 8.43 with 
> dev/checkstyle.xml ruleset.
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF11.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 107).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF12.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 116).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 104).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF13.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 125).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 109).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF14.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 114).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF15.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 143).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 119).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF16.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 152).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 124).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF17.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 161).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 129).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF18.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 170).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 134).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF19.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 179).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 139).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF20.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 188).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 144).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF21.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 197).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[28] (sizes) 
> LineLength: Line is longer than 100 characters (found 149).
> [ERROR] src\main\java\org\apache\spark\sql\api\java\UDF22.java:[29] (sizes) 
> LineLength: Line is longer than 100 characters (found 206).
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[44,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[60,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[75,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[88,25] 
> (naming) MethodName: Method name 'ProcessingTime' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[100,25] 
> (naming) MethodName: Method name 'Once' must match pattern 
> '^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
> [ERROR] src\main\java\org\apache\spark\sql\streaming\Trigger.java:[110,25] 
> (naming) MethodName: Method name 'AvailableNow' must match patt

[jira] [Commented] (SPARK-36867) Misleading Error Message with Invalid Column and Group By

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427212#comment-17427212
 ] 

Apache Spark commented on SPARK-36867:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/34244

> Misleading Error Message with Invalid Column and Group By
> -
>
> Key: SPARK-36867
> URL: https://issues.apache.org/jira/browse/SPARK-36867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Alan Jackoway
>Priority: Major
>
> When you run a query with an invalid column that also does a group by on a 
> constructed column, the error message you get back references a missing 
> column for the group by rather than the invalid column.
> You can reproduce this in pyspark in 3.1.2 with the following code:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Group By Issue").getOrCreate()
> data = spark.createDataFrame(
> [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", 
> 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)],
> schema=["d", "v"]
> )
> data.createOrReplaceTempView("data")
> # This is valid
> spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> # This is invalid because val is the wrong variable
> spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> {code}
> The error message for the second spark.sql line is
> {quote}
> pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input 
> columns: [data.d, data.v]; line 1 pos 81;
> 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 
> as timestamp), Some(America/New_York)) as date) AS week#22]
> +- SubqueryAlias data
>+- LogicalRDD [d#0, v#1L], false
> {quote}
> but the actual problem is that I used the wrong variable name in a different 
> part of the query. Nothing is wrong with {{week}} in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36867) Misleading Error Message with Invalid Column and Group By

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36867:


Assignee: Apache Spark

> Misleading Error Message with Invalid Column and Group By
> -
>
> Key: SPARK-36867
> URL: https://issues.apache.org/jira/browse/SPARK-36867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Alan Jackoway
>Assignee: Apache Spark
>Priority: Major
>
> When you run a query with an invalid column that also does a group by on a 
> constructed column, the error message you get back references a missing 
> column for the group by rather than the invalid column.
> You can reproduce this in pyspark in 3.1.2 with the following code:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Group By Issue").getOrCreate()
> data = spark.createDataFrame(
> [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", 
> 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)],
> schema=["d", "v"]
> )
> data.createOrReplaceTempView("data")
> # This is valid
> spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> # This is invalid because val is the wrong variable
> spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> {code}
> The error message for the second spark.sql line is
> {quote}
> pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input 
> columns: [data.d, data.v]; line 1 pos 81;
> 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 
> as timestamp), Some(America/New_York)) as date) AS week#22]
> +- SubqueryAlias data
>+- LogicalRDD [d#0, v#1L], false
> {quote}
> but the actual problem is that I used the wrong variable name in a different 
> part of the query. Nothing is wrong with {{week}} in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36867) Misleading Error Message with Invalid Column and Group By

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36867?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36867:


Assignee: (was: Apache Spark)

> Misleading Error Message with Invalid Column and Group By
> -
>
> Key: SPARK-36867
> URL: https://issues.apache.org/jira/browse/SPARK-36867
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Alan Jackoway
>Priority: Major
>
> When you run a query with an invalid column that also does a group by on a 
> constructed column, the error message you get back references a missing 
> column for the group by rather than the invalid column.
> You can reproduce this in pyspark in 3.1.2 with the following code:
> {code:python}
> from pyspark.sql import SparkSession
> spark = SparkSession.builder.appName("Group By Issue").getOrCreate()
> data = spark.createDataFrame(
> [("2021-09-15", 1), ("2021-09-16", 2), ("2021-09-17", 10), ("2021-09-18", 
> 25), ("2021-09-19", 500), ("2021-09-20", 50), ("2021-09-21", 100)],
> schema=["d", "v"]
> )
> data.createOrReplaceTempView("data")
> # This is valid
> spark.sql("select sum(v) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> # This is invalid because val is the wrong variable
> spark.sql("select sum(val) as value, date(date_trunc('week', d)) as week from 
> data group by week").show()
> {code}
> The error message for the second spark.sql line is
> {quote}
> pyspark.sql.utils.AnalysisException: cannot resolve '`week`' given input 
> columns: [data.d, data.v]; line 1 pos 81;
> 'Aggregate ['week], ['sum('val) AS value#21, cast(date_trunc(week, cast(d#0 
> as timestamp), Some(America/New_York)) as date) AS week#22]
> +- SubqueryAlias data
>+- LogicalRDD [d#0, v#1L], false
> {quote}
> but the actual problem is that I used the wrong variable name in a different 
> part of the query. Nothing is wrong with {{week}} in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36794:


Assignee: Cheng Su  (was: Apache Spark)

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36794:


Assignee: Apache Spark  (was: Cheng Su)

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427293#comment-17427293
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/34245

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33277) Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427295#comment-17427295
 ] 

Apache Spark commented on SPARK-33277:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/34245

> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> 
>
> Key: SPARK-33277
> URL: https://issues.apache.org/jira/browse/SPARK-33277
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.7, 3.0.1
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 2.4.8, 3.0.2, 3.1.0
>
>
> Python/Pandas UDF right after off-heap vectorized reader could cause executor 
> crash.
> E.g.,:
> {code:java}
> spark.range(0, 10, 1, 1).write.parquet(path)
> spark.conf.set("spark.sql.columnVector.offheap.enabled", True)
> def f(x):
> return 0
> fUdf = udf(f, LongType())
> spark.read.parquet(path).select(fUdf('id')).head()
> {code}
> This is because, the Python evaluation consumes the parent iterator in a 
> separate thread and it consumes more data from the parent even after the task 
> ends and the parent is closed. If an off-heap column vector exists in the 
> parent iterator, it could cause segmentation fault which crashes the executor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36546) Make unionByName null-filling behavior work with array of struct columns

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427353#comment-17427353
 ] 

Apache Spark commented on SPARK-36546:
--

User 'Kimahriman' has created a pull request for this issue:
https://github.com/apache/spark/pull/34246

> Make unionByName null-filling behavior work with array of struct columns
> 
>
> Key: SPARK-36546
> URL: https://issues.apache.org/jira/browse/SPARK-36546
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vishal Dhavale
>Priority: Major
>
> Currently, unionByName workes with two DataFrames with slightly different 
> schemas. It would be good it works with an array of struct columns.
>  
> unionByName fails if we try to merge dataframe with an array of struct 
> columns with slightly different schema
> Below is the example.
> Step 1: dataframe arrayStructDf1 with columnbooksIntersted of type array of 
> struct
> {code:java}
> val arrayStructData = Seq(
>  Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
>  Row("Lilly",List(Row("Java","XY",200),Row("Scala","XB",500
> val arrayStructSchema = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)))
> val arrayStructDf1 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> arrayStructDf1.printSchema() 
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
> {code}
>  
> Step 2: Another dataframe arrayStructDf2 with column booksIntersted of type 
> array of a struct but struct contains an extra field called "new_column"
> {code:java}
> val arrayStructData2 = Seq(
>  
> Row("James",List(Row("Java","XX",120,"new_column_data"),Row("Scala","XA",300,"new_column_data"))),
>  
> Row("Lilly",List(Row("Java","XY",200,"new_column_data"),Row("Scala","XB",500,"new_column_data"
> val arrayStructSchemaNewClm = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)
>  .add("new_column",StringType)))
> val arrayStructDf2 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData2),arrayStructSchemaNewClm)
> arrayStructDf2.printSchema()
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
>  |||-- new_column: string (nullable = true){code}
>  
> Step3:  Merge arrayStructDf1 and arrayStructDf2 using unionByName
> We see the error org.apache.spark.sql.AnalysisException: Union can only be 
> performed on tables with the compatible column types. 
> {code:java}
> scala> arrayStructDf1.unionByName(arrayStructDf2,allowMissingColumns=true)
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> array> <> 
> array> at the second column of 
> the second table;
> 'Union false, false
> :- LogicalRDD [name#183, booksIntersted#184], false
> +- Project [name#204, booksIntersted#205]
>  +- LogicalRDD [name#204, booksIntersted#205], false{code}
>  
> unionByName should fill the missing data with null like it does column with 
> struct type  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36546) Make unionByName null-filling behavior work with array of struct columns

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36546:


Assignee: Apache Spark

> Make unionByName null-filling behavior work with array of struct columns
> 
>
> Key: SPARK-36546
> URL: https://issues.apache.org/jira/browse/SPARK-36546
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vishal Dhavale
>Assignee: Apache Spark
>Priority: Major
>
> Currently, unionByName workes with two DataFrames with slightly different 
> schemas. It would be good it works with an array of struct columns.
>  
> unionByName fails if we try to merge dataframe with an array of struct 
> columns with slightly different schema
> Below is the example.
> Step 1: dataframe arrayStructDf1 with columnbooksIntersted of type array of 
> struct
> {code:java}
> val arrayStructData = Seq(
>  Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
>  Row("Lilly",List(Row("Java","XY",200),Row("Scala","XB",500
> val arrayStructSchema = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)))
> val arrayStructDf1 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> arrayStructDf1.printSchema() 
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
> {code}
>  
> Step 2: Another dataframe arrayStructDf2 with column booksIntersted of type 
> array of a struct but struct contains an extra field called "new_column"
> {code:java}
> val arrayStructData2 = Seq(
>  
> Row("James",List(Row("Java","XX",120,"new_column_data"),Row("Scala","XA",300,"new_column_data"))),
>  
> Row("Lilly",List(Row("Java","XY",200,"new_column_data"),Row("Scala","XB",500,"new_column_data"
> val arrayStructSchemaNewClm = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)
>  .add("new_column",StringType)))
> val arrayStructDf2 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData2),arrayStructSchemaNewClm)
> arrayStructDf2.printSchema()
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
>  |||-- new_column: string (nullable = true){code}
>  
> Step3:  Merge arrayStructDf1 and arrayStructDf2 using unionByName
> We see the error org.apache.spark.sql.AnalysisException: Union can only be 
> performed on tables with the compatible column types. 
> {code:java}
> scala> arrayStructDf1.unionByName(arrayStructDf2,allowMissingColumns=true)
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> array> <> 
> array> at the second column of 
> the second table;
> 'Union false, false
> :- LogicalRDD [name#183, booksIntersted#184], false
> +- Project [name#204, booksIntersted#205]
>  +- LogicalRDD [name#204, booksIntersted#205], false{code}
>  
> unionByName should fill the missing data with null like it does column with 
> struct type  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36546) Make unionByName null-filling behavior work with array of struct columns

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36546:


Assignee: (was: Apache Spark)

> Make unionByName null-filling behavior work with array of struct columns
> 
>
> Key: SPARK-36546
> URL: https://issues.apache.org/jira/browse/SPARK-36546
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Vishal Dhavale
>Priority: Major
>
> Currently, unionByName workes with two DataFrames with slightly different 
> schemas. It would be good it works with an array of struct columns.
>  
> unionByName fails if we try to merge dataframe with an array of struct 
> columns with slightly different schema
> Below is the example.
> Step 1: dataframe arrayStructDf1 with columnbooksIntersted of type array of 
> struct
> {code:java}
> val arrayStructData = Seq(
>  Row("James",List(Row("Java","XX",120),Row("Scala","XA",300))),
>  Row("Lilly",List(Row("Java","XY",200),Row("Scala","XB",500
> val arrayStructSchema = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)))
> val arrayStructDf1 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
> arrayStructDf1.printSchema() 
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
> {code}
>  
> Step 2: Another dataframe arrayStructDf2 with column booksIntersted of type 
> array of a struct but struct contains an extra field called "new_column"
> {code:java}
> val arrayStructData2 = Seq(
>  
> Row("James",List(Row("Java","XX",120,"new_column_data"),Row("Scala","XA",300,"new_column_data"))),
>  
> Row("Lilly",List(Row("Java","XY",200,"new_column_data"),Row("Scala","XB",500,"new_column_data"
> val arrayStructSchemaNewClm = new StructType().add("name",StringType)
>  .add("booksIntersted",ArrayType(new StructType()
>  .add("name",StringType)
>  .add("author",StringType)
>  .add("pages",IntegerType)
>  .add("new_column",StringType)))
> val arrayStructDf2 = 
> spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData2),arrayStructSchemaNewClm)
> arrayStructDf2.printSchema()
> scala> arrayStructDf2.printSchema()
> root
>  |-- name: string (nullable = true)
>  |-- booksIntersted: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- name: string (nullable = true)
>  |||-- author: string (nullable = true)
>  |||-- pages: integer (nullable = true)
>  |||-- new_column: string (nullable = true){code}
>  
> Step3:  Merge arrayStructDf1 and arrayStructDf2 using unionByName
> We see the error org.apache.spark.sql.AnalysisException: Union can only be 
> performed on tables with the compatible column types. 
> {code:java}
> scala> arrayStructDf1.unionByName(arrayStructDf2,allowMissingColumns=true)
> org.apache.spark.sql.AnalysisException: Union can only be performed on tables 
> with the compatible column types. 
> array> <> 
> array> at the second column of 
> the second table;
> 'Union false, false
> :- LogicalRDD [name#183, booksIntersted#184], false
> +- Project [name#204, booksIntersted#205]
>  +- LogicalRDD [name#204, booksIntersted#205], false{code}
>  
> unionByName should fill the missing data with null like it does column with 
> struct type  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36794) Ignore duplicated join keys when building relation for SEMI/ANTI hash join

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427356#comment-17427356
 ] 

Apache Spark commented on SPARK-36794:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/34247

> Ignore duplicated join keys when building relation for SEMI/ANTI hash join
> --
>
> Key: SPARK-36794
> URL: https://issues.apache.org/jira/browse/SPARK-36794
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.3.0
>
>
> For LEFT SEMI and LEFT ANTI hash equi-join without extra join condition, we 
> only need to keep one row per unique join key(s) inside hash table 
> (`HashedRelation`) when building the hash table. This can help reduce the 
> size of hash table of join.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36978) InferConstraints rule should create IsNotNull constraints on the nested field instead of the root nested type

2021-10-11 Thread Utkarsh Agarwal (Jira)
Utkarsh Agarwal created SPARK-36978:
---

 Summary: InferConstraints rule should create IsNotNull constraints 
on the nested field instead of the root nested type 
 Key: SPARK-36978
 URL: https://issues.apache.org/jira/browse/SPARK-36978
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0, 3.0.0, 3.2.0
Reporter: Utkarsh Agarwal


[InferFiltersFromConstraints|https://github.com/apache/spark/blob/05c0fa573881b49d8ead9a5e16071190e5841e1b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L1206]
 optimization rule generates {{IsNotNull}} constraints corresponding to null 
intolerant predicates. The {{IsNotNull}} constraints are generated on the 
attribute inside the corresponding predicate. 
e.g. A predicate {{a > 0}}  on an integer column {{a}} will result in a 
constraint {{IsNotNull(a)}}. On the other hand a predicate on a nested int 
column {{structCol.b}} where {{structCol}} is a struct column results in a 
constraint {{IsNotNull(structCol)}}.

This generation of constraints on the root level nested type is extremely 
conservative as it could lead to materialization of the the entire struct. The 
constraint should instead be generated on the nested field being referenced by 
the predicate. In the above example, the constraint should be 
{{IsNotNull(structCol.b)}} instead of {{IsNotNull(structCol)}}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36885) Inline type hints for python/pyspark/sql/dataframe.py

2021-10-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36885.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34225
[https://github.com/apache/spark/pull/34225]

> Inline type hints for python/pyspark/sql/dataframe.py
> -
>
> Key: SPARK-36885
> URL: https://issues.apache.org/jira/browse/SPARK-36885
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints for python/pyspark/sql/dataframe.py from Inline type hints 
> for python/pyspark/sql/dataframe.pyi.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36977.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 34242
[https://github.com/apache/spark/pull/34242]

> Update docs to reflect that Python 3.6 is no longer supported
> -
>
> Key: SPARK-36977
> URL: https://issues.apache.org/jira/browse/SPARK-36977
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36977) Update docs to reflect that Python 3.6 is no longer supported

2021-10-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36977:


Assignee: Maciej Szymkiewicz

> Update docs to reflect that Python 3.6 is no longer supported
> -
>
> Key: SPARK-36977
> URL: https://issues.apache.org/jira/browse/SPARK-36977
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36647) Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427400#comment-17427400
 ] 

Apache Spark commented on SPARK-36647:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/34248

> Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet
> --
>
> Key: SPARK-36647
> URL: https://issues.apache.org/jira/browse/SPARK-36647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Currently Aggregate (Min/Max/Count) for Parquet is not pushed down if filter 
> is involved. Will enable push down if the filter is on a partition column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36647) Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36647:


Assignee: Apache Spark

> Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet
> --
>
> Key: SPARK-36647
> URL: https://issues.apache.org/jira/browse/SPARK-36647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
>
> Currently Aggregate (Min/Max/Count) for Parquet is not pushed down if filter 
> is involved. Will enable push down if the filter is on a partition column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36647) Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36647:


Assignee: (was: Apache Spark)

> Push down filter by partition column for Aggregate (Min/Max/Count) for Parquet
> --
>
> Key: SPARK-36647
> URL: https://issues.apache.org/jira/browse/SPARK-36647
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> Currently Aggregate (Min/Max/Count) for Parquet is not pushed down if filter 
> is involved. Will enable push down if the filter is on a partition column



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-36958) Reading of legacy timestamps from Parquet confusing in Spark 3, related config values don't seem working

2021-10-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36958.
--
Resolution: Not A Problem

> Reading of legacy timestamps from Parquet confusing in Spark 3, related 
> config values don't seem working
> 
>
> Key: SPARK-36958
> URL: https://issues.apache.org/jira/browse/SPARK-36958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.2
> Environment: emr-6.4.0
> spark 3.1.2
>Reporter: Dmitry Goldenberg
>Priority: Major
>
> I'm having a major issue with trying to run in Spark 3, reading parquet data 
> that got generated with Spark 2.4.
> The full stack trace is below.
> The error message is very confusing:
>  # I do not have dates that before 1582-10-15 or timestamps before 
> 1900-01-01T00:00:00Z
>  # The documentation does not state clearly how to work around/fix this 
> issue. What exactly is the difference between the LEGACY and CORRECTED values 
> of the config settings?
>  # Which of the following would I want to set and to what values? - 
> spark.sql.legacy.parquet.datetimeRebaseModeInWrite
> - spark.sql.legacy.parquet.datetimeRebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInRead
> - spark.sql.legacy.parquet.int96RebaseModeInWrite
> - spark.sql.legacy.timeParserPolicy
>  # I've tried setting these to CORRECTED,CORRECTED,CORRECTED,CORRECTED, and 
> LEGACY, respectively, and got the same error (see the stack trace).
> The issues that I see with this:
>  # Lack of thorough clear documentation on what this is and how it's meant to 
> work.
>  # The confusing error message.
>  # The fact that the error still occurs even when you set the config values.
>  
> {quote} py4j.protocol.Py4JJavaError: An error occurred while calling 
> o1134.count.py4j.protocol.Py4JJavaError: An error occurred while calling 
> o1134.count.: org.apache.spark.SparkException: Job aborted due to stage 
> failure: Task 8 in stage 36.0 failed 4 times, most recent failure: Lost task 
> 8.3 in stage 36.0 (TID 619) (ip-10-2-251-59.awsinternal.audiomack.com 
> executor 2): org.apache.spark.SparkUpgradeException: You may get a different 
> result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or 
> timestamps before 1900-01-01T00:00:00Z from Parquet INT96 files can be 
> ambiguous, as the files may be written by Spark 2.x or legacy versions of 
> Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s 
> Proleptic Gregorian calendar. See more details in SPARK-31404. You can set 
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'LEGACY' to rebase the 
> datetime values w.r.t. the calendar difference during reading. Or set 
> spark.sql.legacy.parquet.int96RebaseModeInRead to 'CORRECTED' to read the 
> datetime values as it is. at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils$.newRebaseExceptionInRead(DataSourceUtils.scala:159)
>  at 
> org.apache.spark.sql.execution.datasources.DataSourceUtils.newRebaseExceptionInRead(DataSourceUtils.scala)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseTimestamp(VectorizedColumnReader.java:228)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.rebaseInt96(VectorizedColumnReader.java:242)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:662)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:300)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:295)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:193)
>  at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:37)
>  at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:159)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:614)
>  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source) at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source) at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:35)
>  at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:832)
>  at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at 
> org.apache.spark.shuffle

[jira] [Created] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-11 Thread XiDuo You (Jira)
XiDuo You created SPARK-36979:
-

 Summary: Add RewriteLateralSubquery rule into nonExcludableRules
 Key: SPARK-36979
 URL: https://issues.apache.org/jira/browse/SPARK-36979
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: XiDuo You


Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if we 
set 
`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
 the lateral join query will fail with:
{code:java}
java.lang.AssertionError: assertion failed: No plan for LateralJoin 
lateral-subquery#218
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36979:


Assignee: Apache Spark

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Assignee: Apache Spark
>Priority: Minor
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36979:


Assignee: (was: Apache Spark)

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Minor
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36979) Add RewriteLateralSubquery rule into nonExcludableRules

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427412#comment-17427412
 ] 

Apache Spark commented on SPARK-36979:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/34249

> Add RewriteLateralSubquery rule into nonExcludableRules
> ---
>
> Key: SPARK-36979
> URL: https://issues.apache.org/jira/browse/SPARK-36979
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: XiDuo You
>Priority: Minor
>
> Lateral Join has no meaning without rule `RewriteLateralSubquery`. So now if 
> we set 
> `spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.RewriteLateralSubquery`,
>  the lateral join query will fail with:
> {code:java}
> java.lang.AssertionError: assertion failed: No plan for LateralJoin 
> lateral-subquery#218
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427415#comment-17427415
 ] 

Apache Spark commented on SPARK-36900:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/34250

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3563) Shuffle data not always be cleaned

2021-10-11 Thread wangkang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427417#comment-17427417
 ] 

wangkang commented on SPARK-3563:
-

I meet the same question. A user runs a spark streaming job, the shuffle and 
rdd data does not cleaned in ContextCleaner until the application end. I try to 
print the driver's jstack info, finding the broadcast object can be cleaned. 
But the rdd and shuffle data always can not cleaned. I also try to trigger the 
driver a full GC(GC meets expectation), but ContextCleaner doesn't has any 
cleaned RDD or cleaned Shuffle log. Is there any other reason?thanks.

> Shuffle data not always be cleaned
> --
>
> Key: SPARK-3563
> URL: https://issues.apache.org/jira/browse/SPARK-3563
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 1.0.2
>Reporter: shenh062326
>Priority: Major
>
> In our cluster, when we run a spark streaming job, after running for many 
> hours, the shuffle data seems not all be cleaned, here is the shuffle data:
> -rw-r- 1 tdwadmin users 23948 Sep 17 13:21 shuffle_132_34_0
> -rw-r- 1 tdwadmin users 18237 Sep 17 13:32 shuffle_143_22_1
> -rw-r- 1 tdwadmin users 22934 Sep 17 13:35 shuffle_146_15_0
> -rw-r- 1 tdwadmin users 27666 Sep 17 13:35 shuffle_146_36_1
> -rw-r- 1 tdwadmin users 12864 Sep 17 14:05 shuffle_176_12_0
> -rw-r- 1 tdwadmin users 22115 Sep 17 14:05 shuffle_176_33_1
> -rw-r- 1 tdwadmin users 15666 Sep 17 14:21 shuffle_192_0_1
> -rw-r- 1 tdwadmin users 13916 Sep 17 14:38 shuffle_209_53_0
> -rw-r- 1 tdwadmin users 20031 Sep 17 14:41 shuffle_212_26_0
> -rw-r- 1 tdwadmin users 15158 Sep 17 14:41 shuffle_212_47_1
> -rw-r- 1 tdwadmin users 42880 Sep 17 12:12 shuffle_63_1_1
> -rw-r- 1 tdwadmin users 32030 Sep 17 12:14 shuffle_65_40_0
> -rw-r- 1 tdwadmin users 34477 Sep 17 12:33 shuffle_84_2_1
> The shuffle data of stage 63, 65, 84, 132... are not cleaned.
> In ContextCleaner, it maintains a weak reference for each RDD, 
> ShuffleDependency, and Broadcast of interest,  to be processed when the 
> associated object goes out of scope of the application. Actual  cleanup is 
> performed in a separate daemon thread. 
> There must be some  reference for ShuffleDependency , and it's hard to find 
> out.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36900) "SPARK-36464: size returns correct positive number even with over 2GB data" will oom with JDK17

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427416#comment-17427416
 ] 

Apache Spark commented on SPARK-36900:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/34250

> "SPARK-36464: size returns correct positive number even with over 2GB data" 
> will oom with JDK17 
> 
>
> Key: SPARK-36900
> URL: https://issues.apache.org/jira/browse/SPARK-36900
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Sean R. Owen
>Priority: Minor
> Fix For: 3.2.1, 3.3.0
>
>
> Execute
>  
> {code:java}
> build/mvn clean install  -pl core -am -Dtest=none 
> -DwildcardSuites=org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite
> {code}
> with JDK 17,
> {code:java}
> ChunkedByteBufferOutputStreamSuite:
> - empty output
> - write a single byte
> - write a single near boundary
> - write a single at boundary
> - single chunk output
> - single chunk output at boundary size
> - multiple chunk output
> - multiple chunk output at boundary size
> *** RUN ABORTED ***
>   java.lang.OutOfMemoryError: Java heap space
>   at java.base/java.lang.Integer.valueOf(Integer.java:1081)
>   at scala.runtime.BoxesRunTime.boxToInteger(BoxesRunTime.java:67)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.allocateNewChunkIfNeeded(ChunkedByteBufferOutputStream.scala:87)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStream.write(ChunkedByteBufferOutputStream.scala:75)
>   at java.base/java.io.OutputStream.write(OutputStream.java:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite.$anonfun$new$22(ChunkedByteBufferOutputStreamSuite.scala:127)
>   at 
> org.apache.spark.util.io.ChunkedByteBufferOutputStreamSuite$$Lambda$179/0x0008011a75d8.apply(Unknown
>  Source)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36971) Query files directly with SQL is broken (with Glue)

2021-10-11 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-36971:
-
Priority: Major  (was: Critical)

> Query files directly with SQL is broken (with Glue)
> ---
>
> Key: SPARK-36971
> URL: https://issues.apache.org/jira/browse/SPARK-36971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Databricks Runtime 9.1 and 10.0 Beta
>Reporter: Lauri Koobas
>Priority: Major
>
> This is broken in DBR 9.1 (and 10.0 Beta):
> {{    select * from json.`filename`}}
> [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html]
> I have tried with JSON and Parquet files.
> The error:
> {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch 
> tables of db json}}{color}
> Down in the stack trace this also exists:
> {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not 
> found. (Service: AWSGlue; Status Code: 400; Error Code: 
> EntityNotFoundException; ... )){color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36971) Query files directly with SQL is broken (with Glue)

2021-10-11 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427419#comment-17427419
 ] 

Hyukjin Kwon commented on SPARK-36971:
--

is this an issue in Apache Spark? Doesn't look specific to Apache Spark but DBR 
or AWS glue's

> Query files directly with SQL is broken (with Glue)
> ---
>
> Key: SPARK-36971
> URL: https://issues.apache.org/jira/browse/SPARK-36971
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
> Environment: Databricks Runtime 9.1 and 10.0 Beta
>Reporter: Lauri Koobas
>Priority: Major
>
> This is broken in DBR 9.1 (and 10.0 Beta):
> {{    select * from json.`filename`}}
> [https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-file.html]
> I have tried with JSON and Parquet files.
> The error:
> {color:#FF}{{Error in SQL statement: SparkException: Unable to fetch 
> tables of db json}}{color}
> Down in the stack trace this also exists:
> {{{color:#FF}Caused by: NoSuchObjectException(message:Database json not 
> found. (Service: AWSGlue; Status Code: 400; Error Code: 
> EntityNotFoundException; ... )){color}}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427425#comment-17427425
 ] 

Apache Spark commented on SPARK-36973:
--

User 'dchvn' has created a pull request for this issue:
https://github.com/apache/spark/pull/34251

> Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
> -
>
> Key: SPARK-36973
> URL: https://issues.apache.org/jira/browse/SPARK-36973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36973:


Assignee: (was: Apache Spark)

> Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
> -
>
> Key: SPARK-36973
> URL: https://issues.apache.org/jira/browse/SPARK-36973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36973) Deduplicate prepare data method for HistogramPlotBase and KdePlotBase

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36973:


Assignee: Apache Spark

> Deduplicate prepare data method for HistogramPlotBase and KdePlotBase
> -
>
> Key: SPARK-36973
> URL: https://issues.apache.org/jira/browse/SPARK-36973
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: dch nguyen
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36980) Insert support query with CTE

2021-10-11 Thread angerszhu (Jira)
angerszhu created SPARK-36980:
-

 Summary: Insert support query with CTE
 Key: SPARK-36980
 URL: https://issues.apache.org/jira/browse/SPARK-36980
 Project: Spark
  Issue Type: Task
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: angerszhu


INSERT INTO t_delta (WITH v1(c1) as (values (1)) select 1, 2,3 from v1);  OK
INSERT INTO t_delta WITH v1(c1) as (values (1)) select 1, 2,3 from v1; FAIL




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36980) Insert support query with CTE

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36980:


Assignee: (was: Apache Spark)

> Insert support query with CTE
> -
>
> Key: SPARK-36980
> URL: https://issues.apache.org/jira/browse/SPARK-36980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> INSERT INTO t_delta (WITH v1(c1) as (values (1)) select 1, 2,3 from v1);  OK
> INSERT INTO t_delta WITH v1(c1) as (values (1)) select 1, 2,3 from v1; FAIL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36980) Insert support query with CTE

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427439#comment-17427439
 ] 

Apache Spark commented on SPARK-36980:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34252

> Insert support query with CTE
> -
>
> Key: SPARK-36980
> URL: https://issues.apache.org/jira/browse/SPARK-36980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> INSERT INTO t_delta (WITH v1(c1) as (values (1)) select 1, 2,3 from v1);  OK
> INSERT INTO t_delta WITH v1(c1) as (values (1)) select 1, 2,3 from v1; FAIL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36980) Insert support query with CTE

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36980:


Assignee: Apache Spark

> Insert support query with CTE
> -
>
> Key: SPARK-36980
> URL: https://issues.apache.org/jira/browse/SPARK-36980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Major
>
> INSERT INTO t_delta (WITH v1(c1) as (values (1)) select 1, 2,3 from v1);  OK
> INSERT INTO t_delta WITH v1(c1) as (values (1)) select 1, 2,3 from v1; FAIL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36980) Insert support query with CTE

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427440#comment-17427440
 ] 

Apache Spark commented on SPARK-36980:
--

User 'AngersZh' has created a pull request for this issue:
https://github.com/apache/spark/pull/34252

> Insert support query with CTE
> -
>
> Key: SPARK-36980
> URL: https://issues.apache.org/jira/browse/SPARK-36980
> Project: Spark
>  Issue Type: Task
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> INSERT INTO t_delta (WITH v1(c1) as (values (1)) select 1, 2,3 from v1);  OK
> INSERT INTO t_delta WITH v1(c1) as (values (1)) select 1, 2,3 from v1; FAIL



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Kousuke Saruta (Jira)
Kousuke Saruta created SPARK-36981:
--

 Summary: Upgrade joda-time to 2.10.12
 Key: SPARK-36981
 URL: https://issues.apache.org/jira/browse/SPARK-36981
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.3.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


joda-time 2.10.12 seems to support the updated TZDB.
https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Kousuke Saruta (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-36981:
---
Description: 
joda-time 2.10.12 seems to support an updated TZDB.
https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547

  was:
joda-time 2.10.12 seems to support the updated TZDB.
https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547


> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36981:


Assignee: Apache Spark  (was: Kousuke Saruta)

> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Apache Spark
>Priority: Minor
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36981:


Assignee: Kousuke Saruta  (was: Apache Spark)

> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427448#comment-17427448
 ] 

Apache Spark commented on SPARK-36981:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34253

> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36981) Upgrade joda-time to 2.10.12

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427449#comment-17427449
 ] 

Apache Spark commented on SPARK-36981:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/34253

> Upgrade joda-time to 2.10.12
> 
>
> Key: SPARK-36981
> URL: https://issues.apache.org/jira/browse/SPARK-36981
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.3.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> joda-time 2.10.12 seems to support an updated TZDB.
> https://github.com/JodaOrg/joda-time/compare/v2.10.10...v2.10.12
> https://github.com/JodaOrg/joda-time/issues/566#issuecomment-930207547



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-36905) Reading Hive view without explicit column names fails in Spark

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427450#comment-17427450
 ] 

Apache Spark commented on SPARK-36905:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34254

> Reading Hive view without explicit column names fails in Spark 
> ---
>
> Key: SPARK-36905
> URL: https://issues.apache.org/jira/browse/SPARK-36905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Hive view in which some columns are not explicitly named
> {code:sql}
> CREATE VIEW test_view AS
> SELECT 1
> FROM some_table
> {code}
> Reading this view in Spark leads to an {{AnalysisException}}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`_c0`' given input 
> columns: [1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:188)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:132)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:132)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:182)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(

[jira] [Assigned] (SPARK-36905) Reading Hive view without explicit column names fails in Spark

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36905:


Assignee: Apache Spark

> Reading Hive view without explicit column names fails in Spark 
> ---
>
> Key: SPARK-36905
> URL: https://issues.apache.org/jira/browse/SPARK-36905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Assignee: Apache Spark
>Priority: Major
>
> Consider a Hive view in which some columns are not explicitly named
> {code:sql}
> CREATE VIEW test_view AS
> SELECT 1
> FROM some_table
> {code}
> Reading this view in Spark leads to an {{AnalysisException}}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`_c0`' given input 
> columns: [1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:188)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:132)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:132)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:182)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis

[jira] [Assigned] (SPARK-36905) Reading Hive view without explicit column names fails in Spark

2021-10-11 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-36905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36905:


Assignee: (was: Apache Spark)

> Reading Hive view without explicit column names fails in Spark 
> ---
>
> Key: SPARK-36905
> URL: https://issues.apache.org/jira/browse/SPARK-36905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Hive view in which some columns are not explicitly named
> {code:sql}
> CREATE VIEW test_view AS
> SELECT 1
> FROM some_table
> {code}
> Reading this view in Spark leads to an {{AnalysisException}}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`_c0`' given input 
> columns: [1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:188)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:132)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:132)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:182)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:91)
>   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:155)
>   

[jira] [Commented] (SPARK-36905) Reading Hive view without explicit column names fails in Spark

2021-10-11 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-36905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17427451#comment-17427451
 ] 

Apache Spark commented on SPARK-36905:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/34254

> Reading Hive view without explicit column names fails in Spark 
> ---
>
> Key: SPARK-36905
> URL: https://issues.apache.org/jira/browse/SPARK-36905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Consider a Hive view in which some columns are not explicitly named
> {code:sql}
> CREATE VIEW test_view AS
> SELECT 1
> FROM some_table
> {code}
> Reading this view in Spark leads to an {{AnalysisException}}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`_c0`' given input 
> columns: [1]
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:188)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:340)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:406)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:357)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:337)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$transformExpressionsUp$1(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpression$1(QueryPlan.scala:116)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:127)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$3(QueryPlan.scala:132)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.recursiveTransform$1(QueryPlan.scala:132)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.$anonfun$mapExpressions$4(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:242)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:137)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsUp(QueryPlan.scala:104)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:185)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:182)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:94)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(

  1   2   >