[jira] [Resolved] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34128.
---
Fix Version/s: 3.1.2
   3.2.0
 Assignee: Kent Yao
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/31895

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34128:
--
Issue Type: Bug  (was: Improvement)

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32384) repartitionAndSortWithinPartitions avoid shuffle with same partitioner

2021-03-19 Thread wuyi (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-32384.
--
Fix Version/s: 3.2.0
 Assignee: zhengruifeng
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31480

> repartitionAndSortWithinPartitions avoid shuffle with same partitioner
> --
>
> Key: SPARK-32384
> URL: https://issues.apache.org/jira/browse/SPARK-32384
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.2.0
>
>
> In {{combineByKeyWithClassTag}}, there is a check so that if the partitioner 
> is the same as the one of the RDD:
> {code:java}
> if (self.partitioner == Some(partitioner)) {
>   self.mapPartitions(iter => {
> val context = TaskContext.get()
> new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, 
> context))
>   }, preservesPartitioning = true)
> } else {
>   new ShuffledRDD[K, V, C](self, partitioner)
> .setSerializer(serializer)
> .setAggregator(aggregator)
> .setMapSideCombine(mapSideCombine)
> }
>  {code}
>  
> In {{repartitionAndSortWithinPartitions}}, this shuffle can also be skipped 
> in this case.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34776:
-
Fix Version/s: 3.0.3
   2.4.8

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Resolved] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34796.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31892

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Critical
> Fix For: 3.2.0
>
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33122) Remove redundant aggregates in the Optimzier

2021-03-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-33122.
--
Fix Version/s: 3.2.0
 Assignee: Tanel Kiis
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/30018

> Remove redundant aggregates in the Optimzier
> 
>
> Key: SPARK-33122
> URL: https://issues.apache.org/jira/browse/SPARK-33122
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Tanel Kiis
>Assignee: Tanel Kiis
>Priority: Major
> Fix For: 3.2.0
>
>
> It is possible to have two or more consecutive aggregates whose sole purpose 
> is to keep only distinct values (for example TPCDS q87). We can remove all 
> but the last one do improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34719) fail if the view query has duplicated column names

2021-03-19 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34719:
-
Fix Version/s: 3.0.3

> fail if the view query has duplicated column names
> --
>
> Key: SPARK-34719
> URL: https://issues.apache.org/jira/browse/SPARK-34719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.1.0, 3.1.1
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.1.2, 3.0.3
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305262#comment-17305262
 ] 

Apache Spark commented on SPARK-34776:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31904

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305260#comment-17305260
 ] 

Apache Spark commented on SPARK-34776:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31904

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Assigned] (SPARK-34700) SessionCatalog's createTempView/createGlobalTempView should accept TemporaryViewRelation

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34700:


Assignee: (was: Apache Spark)

> SessionCatalog's createTempView/createGlobalTempView should accept 
> TemporaryViewRelation 
> -
>
> Key: SPARK-34700
> URL: https://issues.apache.org/jira/browse/SPARK-34700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> SessionCatalog's createTempView/createGlobalTempView currently accept 
> LogicalPlan to store temp views, but once all the commands are migrated to 
> store `TemporaryViewRelation`, it should accept TemporaryViewRelation instead.
> Once this is done, ViewHelper.needsToUncache can remove the following safely: 
> case p => !p.sameResult(aliasedPlan)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34700) SessionCatalog's createTempView/createGlobalTempView should accept TemporaryViewRelation

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34700:


Assignee: Apache Spark

> SessionCatalog's createTempView/createGlobalTempView should accept 
> TemporaryViewRelation 
> -
>
> Key: SPARK-34700
> URL: https://issues.apache.org/jira/browse/SPARK-34700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Assignee: Apache Spark
>Priority: Major
>
> SessionCatalog's createTempView/createGlobalTempView currently accept 
> LogicalPlan to store temp views, but once all the commands are migrated to 
> store `TemporaryViewRelation`, it should accept TemporaryViewRelation instead.
> Once this is done, ViewHelper.needsToUncache can remove the following safely: 
> case p => !p.sameResult(aliasedPlan)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34700) SessionCatalog's createTempView/createGlobalTempView should accept TemporaryViewRelation

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305231#comment-17305231
 ] 

Apache Spark commented on SPARK-34700:
--

User 'imback82' has created a pull request for this issue:
https://github.com/apache/spark/pull/31906

> SessionCatalog's createTempView/createGlobalTempView should accept 
> TemporaryViewRelation 
> -
>
> Key: SPARK-34700
> URL: https://issues.apache.org/jira/browse/SPARK-34700
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Terry Kim
>Priority: Major
>
> SessionCatalog's createTempView/createGlobalTempView currently accept 
> LogicalPlan to store temp views, but once all the commands are migrated to 
> store `TemporaryViewRelation`, it should accept TemporaryViewRelation instead.
> Once this is done, ViewHelper.needsToUncache can remove the following safely: 
> case p => !p.sameResult(aliasedPlan)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly

2021-03-19 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-34774:
-
Fix Version/s: 2.4.8

> The `change-scala- version.sh` script not replaced scala.version property 
> correctly
> ---
>
> Key: SPARK-34774
> URL: https://issues.apache.org/jira/browse/SPARK-34774
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 2.4.8, 3.2.0, 3.1.2, 3.0.3
>
>
> Execute the following commands in order
>  # dev/change-scala-version.sh 2.13
>  # dev/change-scala-version.sh 2.12
>  # git status
> there will generate git diff as follow:
> {code:java}
> diff --git a/pom.xml b/pom.xml
> index ddc4ce2f68..f43d8c8f78 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -162,7 +162,7 @@
>      3.4.1
>      
>      3.2.2
> -    2.12.10
> +    2.13.5
>      2.12
>      2.0.0
>      --test
> {code}
> seem 'scala.version' property was not replaced correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34806) Helper class for batch Dataset.observe()

2021-03-19 Thread Enrico Minack (Jira)
Enrico Minack created SPARK-34806:
-

 Summary: Helper class for batch Dataset.observe()
 Key: SPARK-34806
 URL: https://issues.apache.org/jira/browse/SPARK-34806
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.2.0
Reporter: Enrico Minack


The {{observe}} method has been added to the {{Dataset}} API in 3.0.0. It 
allows to collect aggregate metrics over data of a Dataset while they are being 
processed during an action.

These metrics are collected in a separate thread after registering 
{{QueryExecutionListener}} for batch datasets and {{StreamingQueryListener}} 
for stream datasets, respectively. While in streaming context it makes 
perfectly sense to process incremental metrics in an event-based fashion, for 
simple batch datatset processing, a single result should be retrievable without 
the need to register listeners or handling threading.

Introducing an {{Observation}} helper class can hide that complexity for simple 
use-cases in batch processing.

Similar to {{AccumulatorV2}} provided by {{SparkContext}} (e.g. 
{{SparkContext.LongAccumulator}}), the {{SparkSession}} can provide a method to 
create a new {{Observation}} instance and register it with the session.

Alternatively, an {{Observation}} instance could be instantiated on its own 
which on calling {{Observation.on(Dataset)}} registers with 
{{Dataset.sparkSession}}. This "registration" registers a listener with the 
session that retrieves the metrics.

The {{Observation}} class provides methods to retrieve the metrics. This 
retrieval has to wait for the listener to be called in a separate thread. So 
all methods will wait for this, optionally with a timeout:
 - {{Observation.get}} waits without timeout and returns the metric.
 - {{Observation.option(time, unit)}} waits at most {{time}}, returns the 
metric as an {{Option}}, or {{None}} when the timeout occurs.
 - {{Observation.waitCompleted(time, unit)}} waits for the metrics and 
indicates timeout by returning {{false}}.

Obviously, an action has to be called on the observed dataset before any of 
these methods are called, otherwise a timeout will occur.

With {{Observation.reset}}, another action can be observed. Finally, 
{{Observation.close}} unregisters the listener from the session.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Michael Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305162#comment-17305162
 ] 

Michael Chen edited comment on SPARK-34780 at 3/19/21, 8:00 PM:


I used computeStats because it was a simple way to display the problem. But any 
config that is read through the relation's spark session would have the same 
problem where the cached config is read instead of config set through SQLConf 
(or other mechanisms). For example, when building the file readers in 
DataSourceScanExec, the fsRelation.sparkSession is passed along so configs in 
these readers could be wrong.


was (Author: mikechen):
I used computeStats because it was a simple way to display the problem. But any 
config that is read through the relation's spark session would have the same 
problem where the cached config is read instead of config set through SQLConf 
(or other mechanisms)

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Michael Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305162#comment-17305162
 ] 

Michael Chen edited comment on SPARK-34780 at 3/19/21, 8:00 PM:


I used computeStats because it was a simple way to display the problem. But any 
config that is read through the relation's spark session would have the same 
problem where the cached config is read instead of config set through SQLConf 
(or other mechanisms). For example, when building the file readers in 
DataSourceScanExec, relation.sparkSession is passed along so configs in these 
readers could be wrong.


was (Author: mikechen):
I used computeStats because it was a simple way to display the problem. But any 
config that is read through the relation's spark session would have the same 
problem where the cached config is read instead of config set through SQLConf 
(or other mechanisms). For example, when building the file readers in 
DataSourceScanExec, the fsRelation.sparkSession is passed along so configs in 
these readers could be wrong.

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Michael Chen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305162#comment-17305162
 ] 

Michael Chen commented on SPARK-34780:
--

I used computeStats because it was a simple way to display the problem. But any 
config that is read through the relation's spark session would have the same 
problem where the cached config is read instead of config set through SQLConf 
(or other mechanisms)

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file

2021-03-19 Thread Suhas Jaladi (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305154#comment-17305154
 ] 

Suhas Jaladi commented on SPARK-33534:
--

[~nielsbasjes], Just checking if you have any alternate solution until " 
spark.sql.files.minPartitionBytes" is developed 

> Allow specifying a minimum number of bytes in a split of a file
> ---
>
> Key: SPARK-33534
> URL: https://issues.apache.org/jira/browse/SPARK-33534
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Niels Basjes
>Priority: Major
>
> *Background*
>  Long time ago I have written a way for reading a (usually large) Gzipped 
> file in a way that allows better distribution of the load over an Apache 
> Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip]
> Seems like people still need this kind of functionality and it turns out my 
> code works without modification in conjunction with Apache Spark.
>  See for example:
>  - SPARK-29102
>  - [https://stackoverflow.com/q/28127119/877069]
>  - [https://stackoverflow.com/q/27531816/877069]
> So [~nchammas] provided documentation to my project a while ago on how to use 
> it with Spark.
>  [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md]
> *The problem*
>  Now some people have indicated getting errors from this feature of mine.
> Fact is that this functionality cannot read a split if it is too small (the 
> number of bytes read from disk and the number of bytes coming out the 
> compression are different). So my code uses the {{io.file.buffer.size}} 
> setting but also has a hard coded lower limit split size of 4 KiB.
> Now the problem I found when looking into the reports I got is that Spark 
> does not have a minimum number of bytes in a split.
> In fact: When I created a test file and then set the 
> {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of 
> my test file my library gave the error:
> {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] 
> is 1 bytes which is too small. (Minimum is 65536)}}
> I found the code that does this calculation here 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74
> *Proposed enhancement*
> So what I propose is to have a new setting 
> ({{spark.sql.files.minPartitionBytes}}  ?) that will guarantee that no split 
> of a file is smaller than a configured number of bytes.
> I also propose to have this set to something like 64KiB as a default.
> Having some constraints on the values of 
> {{spark.sql.files.minPartitionBytes}} and possibly in relation with 
> {{spark.sql.files.maxPartitionBytes}} would be fine.
> *Notes*
> Hadoop already has code that does this: 
> https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34775) Push down limit through window when partitionSpec is not empty

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34775:


Assignee: Apache Spark

> Push down limit through window when partitionSpec is not empty
> --
>
> Key: SPARK-34775
> URL: https://issues.apache.org/jira/browse/SPARK-34775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:sql}
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM t LIMIT 10 
> ==>
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM (SELECT * 
> FROM t ORDER BY a, b LIMIT 10) tmp
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34775) Push down limit through window when partitionSpec is not empty

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34775:


Assignee: (was: Apache Spark)

> Push down limit through window when partitionSpec is not empty
> --
>
> Key: SPARK-34775
> URL: https://issues.apache.org/jira/browse/SPARK-34775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM t LIMIT 10 
> ==>
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM (SELECT * 
> FROM t ORDER BY a, b LIMIT 10) tmp
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34775) Push down limit through window when partitionSpec is not empty

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305151#comment-17305151
 ] 

Apache Spark commented on SPARK-34775:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31904

> Push down limit through window when partitionSpec is not empty
> --
>
> Key: SPARK-34775
> URL: https://issues.apache.org/jira/browse/SPARK-34775
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:sql}
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM t LIMIT 10 
> ==>
> SELECT *, ROW_NUMBER() OVER(PARTITION BY a ORDER BY b) AS rn FROM (SELECT * 
> FROM t ORDER BY a, b LIMIT 10) tmp
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34776:
-

Assignee: L. C. Hsieh

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: L. C. Hsieh
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 

[jira] [Resolved] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34776.
---
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31897
[https://github.com/apache/spark/pull/31897]

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0, 3.1.2
>
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-19 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305109#comment-17305109
 ] 

Chao Sun commented on SPARK-34780:
--

Thanks for the reporting [~mikechen], the test case you provided is very 
useful. 

I'm not sure, though, how severe is the issue since it only affects 
{{computeStats}}, and when the cache is actually materialized (e.g., via 
{{df2.count()}} after {{df2.cache()}}), the value from {{computeStats}} will be 
different anyways. Could you give more details?

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2021-03-19 Thread Mark Ressler (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Ressler updated SPARK-34805:
-
Attachment: jsonMetadataTest.py

> PySpark loses metadata in DataFrame fields when selecting nested columns
> 
>
> Key: SPARK-34805
> URL: https://issues.apache.org/jira/browse/SPARK-34805
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Mark Ressler
>Priority: Major
> Attachments: jsonMetadataTest.py
>
>
> For a DataFrame schema with nested StructTypes, where metadata is set for 
> fields in the schema, that metadata is lost when a DataFrame selects nested 
> fields.  For example, suppose
> {code:java}
> df.schema.fields[0].dataType.fields[0].metadata
> {code}
> returns a non-empty dictionary, then
> {code:java}
> df.select('Field0.SubField0').schema.fields[0].metadata{code}
> returns an empty dictionary, where "Field0" is the name of the first field in 
> the DataFrame and "SubField0" is the name of the first nested field under 
> "Field0".
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34805) PySpark loses metadata in DataFrame fields when selecting nested columns

2021-03-19 Thread Mark Ressler (Jira)
Mark Ressler created SPARK-34805:


 Summary: PySpark loses metadata in DataFrame fields when selecting 
nested columns
 Key: SPARK-34805
 URL: https://issues.apache.org/jira/browse/SPARK-34805
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.1, 3.0.1
Reporter: Mark Ressler


For a DataFrame schema with nested StructTypes, where metadata is set for 
fields in the schema, that metadata is lost when a DataFrame selects nested 
fields.  For example, suppose
{code:java}
df.schema.fields[0].dataType.fields[0].metadata
{code}
returns a non-empty dictionary, then
{code:java}
df.select('Field0.SubField0').schema.fields[0].metadata{code}
returns an empty dictionary, where "Field0" is the name of the first field in 
the DataFrame and "SubField0" is the name of the first nested field under 
"Field0".

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24374) SPIP: Support Barrier Execution Mode in Apache Spark

2021-03-19 Thread Xiangrui Meng (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-24374.
---
   Fix Version/s: 2.4.0
Target Version/s: 2.4.0
  Resolution: Fixed

I'm marking this epic jira as done given the major feature was implemented in 
2.4.

> SPIP: Support Barrier Execution Mode in Apache Spark
> 
>
> Key: SPARK-24374
> URL: https://issues.apache.org/jira/browse/SPARK-24374
> Project: Spark
>  Issue Type: Epic
>  Components: ML, Spark Core
>Affects Versions: 2.4.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Major
>  Labels: Hydrogen, SPIP
> Fix For: 2.4.0
>
> Attachments: SPIP_ Support Barrier Scheduling in Apache Spark.pdf
>
>
> (See details in the linked/attached SPIP doc.)
> {quote}
> The proposal here is to add a new scheduling model to Apache Spark so users 
> can properly embed distributed DL training as a Spark stage to simplify the 
> distributed training workflow. For example, Horovod uses MPI to implement 
> all-reduce to accelerate distributed TensorFlow training. The computation 
> model is different from MapReduce used by Spark. In Spark, a task in a stage 
> doesn’t depend on any other tasks in the same stage, and hence it can be 
> scheduled independently. In MPI, all workers start at the same time and pass 
> messages around. To embed this workload in Spark, we need to introduce a new 
> scheduling model, tentatively named “barrier scheduling”, which launches 
> tasks at the same time and provides users enough information and tooling to 
> embed distributed DL training. Spark can also provide an extra layer of fault 
> tolerance in case some tasks failed in the middle, where Spark would abort 
> all tasks and restart the stage.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305067#comment-17305067
 ] 

Apache Spark commented on SPARK-34804:
--

User 'sumeetmi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/31903

> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = (info, builder)
>  functionBuilders.put(normalizedName, newFunction) match {
>  case Some(previousFunction) if previousFunction != newFunction =>
>  logWarning(s"The function $normalizedName replaced a previously registered 
> function.")
>  case _ =>
>  }
> }{code}
> The *previousFunction != newFunction* equality comparison is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Sumeet Sharma (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305066#comment-17305066
 ] 

Sumeet Sharma commented on SPARK-34804:
---

Proposed changes: [https://github.com/apache/spark/pull/31903] 

> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = (info, builder)
>  functionBuilders.put(normalizedName, newFunction) match {
>  case Some(previousFunction) if previousFunction != newFunction =>
>  logWarning(s"The function $normalizedName replaced a previously registered 
> function.")
>  case _ =>
>  }
> }{code}
> The *previousFunction != newFunction* equality comparison is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305065#comment-17305065
 ] 

Apache Spark commented on SPARK-34804:
--

User 'sumeetmi2' has created a pull request for this issue:
https://github.com/apache/spark/pull/31903

> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = (info, builder)
>  functionBuilders.put(normalizedName, newFunction) match {
>  case Some(previousFunction) if previousFunction != newFunction =>
>  logWarning(s"The function $normalizedName replaced a previously registered 
> function.")
>  case _ =>
>  }
> }{code}
> The *previousFunction != newFunction* equality comparison is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34804:


Assignee: Apache Spark

> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Assignee: Apache Spark
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = (info, builder)
>  functionBuilders.put(normalizedName, newFunction) match {
>  case Some(previousFunction) if previousFunction != newFunction =>
>  logWarning(s"The function $normalizedName replaced a previously registered 
> function.")
>  case _ =>
>  }
> }{code}
> The *previousFunction != newFunction* equality comparison is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34804:


Assignee: (was: Apache Spark)

> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = (info, builder)
>  functionBuilders.put(normalizedName, newFunction) match {
>  case Some(previousFunction) if previousFunction != newFunction =>
>  logWarning(s"The function $normalizedName replaced a previously registered 
> function.")
>  case _ =>
>  }
> }{code}
> The *previousFunction != newFunction* equality comparison is incorrect



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305038#comment-17305038
 ] 

Apache Spark commented on SPARK-34600:
--

User 'eddyxu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31735

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> This is an umbrella ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34803:


Assignee: (was: Apache Spark)

> Util methods requiring certain versions of Pandas & PyArrow don't pass 
> through the raised ImportError
> -
>
> Key: SPARK-34803
> URL: https://issues.apache.org/jira/browse/SPARK-34803
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: John Hany
>Priority: Major
>
> When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
> except any {{ImportError}} and raise an error declaring the minimum version 
> of the respective package that's required to be in the Python environment.
> We don't however, pass the {{ImportError}} that might have been thrown by the 
> package itself. Take {{pandas}} as an example, when we call {{import 
> pandas}}, pandas itself might be in the environment, but can throw an 
> {{ImportError}} 
> [https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
>  if another package it requires isn't there. This error wouldn't be passed 
> through and we'd end up getting a misleading error message that states that 
> {{pandas}} isn't in the environment, while in fact it is but something else 
> makes us unable to import it.
> I believe this can be improved by chaining the exceptions and am happy to 
> provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34803:


Assignee: Apache Spark

> Util methods requiring certain versions of Pandas & PyArrow don't pass 
> through the raised ImportError
> -
>
> Key: SPARK-34803
> URL: https://issues.apache.org/jira/browse/SPARK-34803
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: John Hany
>Assignee: Apache Spark
>Priority: Major
>
> When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
> except any {{ImportError}} and raise an error declaring the minimum version 
> of the respective package that's required to be in the Python environment.
> We don't however, pass the {{ImportError}} that might have been thrown by the 
> package itself. Take {{pandas}} as an example, when we call {{import 
> pandas}}, pandas itself might be in the environment, but can throw an 
> {{ImportError}} 
> [https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
>  if another package it requires isn't there. This error wouldn't be passed 
> through and we'd end up getting a misleading error message that states that 
> {{pandas}} isn't in the environment, while in fact it is but something else 
> makes us unable to import it.
> I believe this can be improved by chaining the exceptions and am happy to 
> provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305004#comment-17305004
 ] 

Apache Spark commented on SPARK-34803:
--

User 'johnhany97' has created a pull request for this issue:
https://github.com/apache/spark/pull/31902

> Util methods requiring certain versions of Pandas & PyArrow don't pass 
> through the raised ImportError
> -
>
> Key: SPARK-34803
> URL: https://issues.apache.org/jira/browse/SPARK-34803
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: John Hany
>Priority: Major
>
> When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
> except any {{ImportError}} and raise an error declaring the minimum version 
> of the respective package that's required to be in the Python environment.
> We don't however, pass the {{ImportError}} that might have been thrown by the 
> package itself. Take {{pandas}} as an example, when we call {{import 
> pandas}}, pandas itself might be in the environment, but can throw an 
> {{ImportError}} 
> [https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
>  if another package it requires isn't there. This error wouldn't be passed 
> through and we'd end up getting a misleading error message that states that 
> {{pandas}} isn't in the environment, while in fact it is but something else 
> makes us unable to import it.
> I believe this can be improved by chaining the exceptions and am happy to 
> provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305003#comment-17305003
 ] 

Apache Spark commented on SPARK-34803:
--

User 'johnhany97' has created a pull request for this issue:
https://github.com/apache/spark/pull/31902

> Util methods requiring certain versions of Pandas & PyArrow don't pass 
> through the raised ImportError
> -
>
> Key: SPARK-34803
> URL: https://issues.apache.org/jira/browse/SPARK-34803
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: John Hany
>Priority: Major
>
> When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
> except any {{ImportError}} and raise an error declaring the minimum version 
> of the respective package that's required to be in the Python environment.
> We don't however, pass the {{ImportError}} that might have been thrown by the 
> package itself. Take {{pandas}} as an example, when we call {{import 
> pandas}}, pandas itself might be in the environment, but can throw an 
> {{ImportError}} 
> [https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
>  if another package it requires isn't there. This error wouldn't be passed 
> through and we'd end up getting a misleading error message that states that 
> {{pandas}} isn't in the environment, while in fact it is but something else 
> makes us unable to import it.
> I believe this can be improved by chaining the exceptions and am happy to 
> provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread John Hany (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17305001#comment-17305001
 ] 

John Hany commented on SPARK-34803:
---

I've gone ahead and opened https://github.com/apache/spark/pull/31902

> Util methods requiring certain versions of Pandas & PyArrow don't pass 
> through the raised ImportError
> -
>
> Key: SPARK-34803
> URL: https://issues.apache.org/jira/browse/SPARK-34803
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.1.1
>Reporter: John Hany
>Priority: Major
>
> When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
> except any {{ImportError}} and raise an error declaring the minimum version 
> of the respective package that's required to be in the Python environment.
> We don't however, pass the {{ImportError}} that might have been thrown by the 
> package itself. Take {{pandas}} as an example, when we call {{import 
> pandas}}, pandas itself might be in the environment, but can throw an 
> {{ImportError}} 
> [https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
>  if another package it requires isn't there. This error wouldn't be passed 
> through and we'd end up getting a misleading error message that states that 
> {{pandas}} isn't in the environment, while in fact it is but something else 
> makes us unable to import it.
> I believe this can be improved by chaining the exceptions and am happy to 
> provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Sumeet Sharma (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34804?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sumeet Sharma updated SPARK-34804:
--
Description: 
{code:java}
test("function registry warning") {
 implicit val ss = spark
 import ss.implicits._
 val dd = Seq((1),(2)).toDF("a")
 val dd1 = udf((i: Int) => i*2)
 (1 to 4).foreach {
   _ =>
 dd.sparkSession.udf.register("function", dd1)
 Thread.sleep(1000)
 }
 dd.withColumn("aa", expr("function(a)")).show(10)
}{code}
logs:

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
the static sql configurations will not take effect.
 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.
 21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
 21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
 21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
 ++--+
|a|aa|

++--+
|1|2|
|2|4|

++--+


Basically in the FunctionRegistry implementation
{code:java}
override def registerFunction(
 name: FunctionIdentifier,
 info: ExpressionInfo,
 builder: FunctionBuilder): Unit = synchronized {
 val normalizedName = normalizeFuncName(name)
 val newFunction = (info, builder)
 functionBuilders.put(normalizedName, newFunction) match {
 case Some(previousFunction) if previousFunction != newFunction =>
 logWarning(s"The function $normalizedName replaced a previously registered 
function.")
 case _ =>
 }
}{code}
The *previousFunction != newFunction* equality comparison is incorrect

  was:
{code:java}
test("function registry warning") {
 implicit val ss = spark
 import ss.implicits._
 val dd = Seq((1),(2)).toDF("a")
 val dd1 = udf((i: Int) => i*2)
 (1 to 4).foreach {
   _ =>
 dd.sparkSession.udf.register("function", dd1)
 Thread.sleep(1000)
 }
 dd.withColumn("aa", expr("function(a)")).show(10)
}{code}

logs:

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
the static sql configurations will not take effect.
21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.
21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
+---+---+
| a| aa|
+---+---+
| 1| 2|
| 2| 4|
+---+---+


> registerFunction shouldnt be logging warning message for same function being 
> re-registered
> --
>
> Key: SPARK-34804
> URL: https://issues.apache.org/jira/browse/SPARK-34804
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Sumeet Sharma
>Priority: Minor
>
> {code:java}
> test("function registry warning") {
>  implicit val ss = spark
>  import ss.implicits._
>  val dd = Seq((1),(2)).toDF("a")
>  val dd1 = udf((i: Int) => i*2)
>  (1 to 4).foreach {
>_ =>
>  dd.sparkSession.udf.register("function", dd1)
>  Thread.sleep(1000)
>  }
>  dd.withColumn("aa", expr("function(a)")).show(10)
> }{code}
> logs:
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> some spark core configurations may not take effect.
> 21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
> the static sql configurations will not take effect.
>  21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing 
> SparkSession; some spark core configurations may not take effect.
>  21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function 
> replaced a previously registered function.
>  ++--+
> |a|aa|
> ++--+
> |1|2|
> |2|4|
> ++--+
> Basically in the FunctionRegistry implementation
> {code:java}
> override def registerFunction(
>  name: FunctionIdentifier,
>  info: ExpressionInfo,
>  builder: FunctionBuilder): Unit = synchronized {
>  val normalizedName = normalizeFuncName(name)
>  val newFunction = 

[jira] [Commented] (SPARK-34802) Optimize Optimizer defaultBatches rule order

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304997#comment-17304997
 ] 

Apache Spark commented on SPARK-34802:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/31901

> Optimize Optimizer defaultBatches rule order
> 
>
> Key: SPARK-34802
> URL: https://issues.apache.org/jira/browse/SPARK-34802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("CREATE TABLE t1 using parquet AS SELECT id AS a FROM range(10)")
> spark.sql("CREATE TABLE t2 using parquet AS SELECT id AS b FROM range(10)")
> spark.sql("SELECT * FROM t1 INNER JOIN t2 ON (a = b AND true)").explain
> {code}
> {noformat}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates 
> ===
> !Join Inner, ((a#4L = b#5L) AND true)   Join Inner, (a#4L = b#5L)
> !:- Relation default.t1[a#4L] parquet   :- Filter true
> !+- Relation default.t2[b#5L] parquet   :  +- Relation default.t1[a#4L] 
> parquet
> !   +- Relation default.t2[b#5L] parquet
> {noformat}
> BooleanSimplification should before PushDownPredicates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34802) Optimize Optimizer defaultBatches rule order

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34802:


Assignee: (was: Apache Spark)

> Optimize Optimizer defaultBatches rule order
> 
>
> Key: SPARK-34802
> URL: https://issues.apache.org/jira/browse/SPARK-34802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("CREATE TABLE t1 using parquet AS SELECT id AS a FROM range(10)")
> spark.sql("CREATE TABLE t2 using parquet AS SELECT id AS b FROM range(10)")
> spark.sql("SELECT * FROM t1 INNER JOIN t2 ON (a = b AND true)").explain
> {code}
> {noformat}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates 
> ===
> !Join Inner, ((a#4L = b#5L) AND true)   Join Inner, (a#4L = b#5L)
> !:- Relation default.t1[a#4L] parquet   :- Filter true
> !+- Relation default.t2[b#5L] parquet   :  +- Relation default.t1[a#4L] 
> parquet
> !   +- Relation default.t2[b#5L] parquet
> {noformat}
> BooleanSimplification should before PushDownPredicates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34802) Optimize Optimizer defaultBatches rule order

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34802:


Assignee: Apache Spark

> Optimize Optimizer defaultBatches rule order
> 
>
> Key: SPARK-34802
> URL: https://issues.apache.org/jira/browse/SPARK-34802
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> For example:
> {code:scala}
> spark.sql("CREATE TABLE t1 using parquet AS SELECT id AS a FROM range(10)")
> spark.sql("CREATE TABLE t2 using parquet AS SELECT id AS b FROM range(10)")
> spark.sql("SELECT * FROM t1 INNER JOIN t2 ON (a = b AND true)").explain
> {code}
> {noformat}
> === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates 
> ===
> !Join Inner, ((a#4L = b#5L) AND true)   Join Inner, (a#4L = b#5L)
> !:- Relation default.t1[a#4L] parquet   :- Filter true
> !+- Relation default.t2[b#5L] parquet   :  +- Relation default.t1[a#4L] 
> parquet
> !   +- Relation default.t2[b#5L] parquet
> {noformat}
> BooleanSimplification should before PushDownPredicates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34804) registerFunction shouldnt be logging warning message for same function being re-registered

2021-03-19 Thread Sumeet Sharma (Jira)
Sumeet Sharma created SPARK-34804:
-

 Summary: registerFunction shouldnt be logging warning message for 
same function being re-registered
 Key: SPARK-34804
 URL: https://issues.apache.org/jira/browse/SPARK-34804
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: Sumeet Sharma


{code:java}
test("function registry warning") {
 implicit val ss = spark
 import ss.implicits._
 val dd = Seq((1),(2)).toDF("a")
 val dd1 = udf((i: Int) => i*2)
 (1 to 4).foreach {
   _ =>
 dd.sparkSession.udf.register("function", dd1)
 Thread.sleep(1000)
 }
 dd.withColumn("aa", expr("function(a)")).show(10)
}{code}

logs:

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.

21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
the static sql configurations will not take effect.
21/03/19 22:39:39 WARN SparkSession$Builder : Using an existing SparkSession; 
some spark core configurations may not take effect.
21/03/19 22:39:43 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
21/03/19 22:39:44 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
21/03/19 22:39:45 WARN SimpleFunctionRegistry : The function function replaced 
a previously registered function.
+---+---+
| a| aa|
+---+---+
| 1| 2|
| 2| 4|
+---+---+



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34803) Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

2021-03-19 Thread John Hany (Jira)
John Hany created SPARK-34803:
-

 Summary: Util methods requiring certain versions of Pandas & 
PyArrow don't pass through the raised ImportError
 Key: SPARK-34803
 URL: https://issues.apache.org/jira/browse/SPARK-34803
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.1.1
Reporter: John Hany


When checking that the we can import either {{pandas}} or {{pyarrow}}, we 
except any {{ImportError}} and raise an error declaring the minimum version of 
the respective package that's required to be in the Python environment.

We don't however, pass the {{ImportError}} that might have been thrown by the 
package itself. Take {{pandas}} as an example, when we call {{import pandas}}, 
pandas itself might be in the environment, but can throw an {{ImportError}} 
[https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438]
 if another package it requires isn't there. This error wouldn't be passed 
through and we'd end up getting a misleading error message that states that 
{{pandas}} isn't in the environment, while in fact it is but something else 
makes us unable to import it.

I believe this can be improved by chaining the exceptions and am happy to 
provide said contribution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34783) Support remote template files via spark.jars

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34783:
-

Assignee: Dongjoon Hyun

> Support remote template files via spark.jars
> 
>
> Key: SPARK-34783
> URL: https://issues.apache.org/jira/browse/SPARK-34783
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34783) Support remote template files

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34783:
--
Summary: Support remote template files  (was: Support remote template files 
via spark.jars)

> Support remote template files
> -
>
> Key: SPARK-34783
> URL: https://issues.apache.org/jira/browse/SPARK-34783
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34783) Support remote template files via spark.jars

2021-03-19 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34783?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34783.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31877
[https://github.com/apache/spark/pull/31877]

> Support remote template files via spark.jars
> 
>
> Key: SPARK-34783
> URL: https://issues.apache.org/jira/browse/SPARK-34783
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34802) Optimize Optimizer defaultBatches rule order

2021-03-19 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-34802:
---

 Summary: Optimize Optimizer defaultBatches rule order
 Key: SPARK-34802
 URL: https://issues.apache.org/jira/browse/SPARK-34802
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


For example:
{code:scala}
spark.sql("CREATE TABLE t1 using parquet AS SELECT id AS a FROM range(10)")
spark.sql("CREATE TABLE t2 using parquet AS SELECT id AS b FROM range(10)")
spark.sql("SELECT * FROM t1 INNER JOIN t2 ON (a = b AND true)").explain
{code}
{noformat}
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates ===
!Join Inner, ((a#4L = b#5L) AND true)   Join Inner, (a#4L = b#5L)
!:- Relation default.t1[a#4L] parquet   :- Filter true
!+- Relation default.t2[b#5L] parquet   :  +- Relation default.t1[a#4L] parquet
!   +- Relation default.t2[b#5L] parquet
{noformat}


BooleanSimplification should before PushDownPredicates.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-19 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-34793.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31884
[https://github.com/apache/spark/pull/31884]

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34798) Fix incorrect join condition

2021-03-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34798.
-
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31890
[https://github.com/apache/spark/pull/31890]

> Fix incorrect join condition
> 
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Hongyi Zhang
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34798) Fix incorrect join condition

2021-03-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-34798:
---

Assignee: Hongyi Zhang

> Fix incorrect join condition
> 
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Hongyi Zhang
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-19 Thread Max Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304965#comment-17304965
 ] 

Max Gekk commented on SPARK-27790:
--

> how will DateTime addition work with the new intervals?  Something like this?

[~mrpowers] You are right. We should add such functions(constructors) for both 
new types. The arithmetic operations have been already supported by 
https://github.com/apache/spark/pull/31832 and 
https://github.com/apache/spark/pull/31855. Regarding to the existing function 
make_interval(), maybe we could add a rule (under a SQL flag) which splits it 
to make_year_month_interval() and make_date_time_interval().

> Does ANSI SQL allow operations on dates using the YEAR-MONTH interval type?

[~simeons] As far as I know, yes, it does. This has been already supported by 
https://github.com/apache/spark/pull/31812

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34525) Update Spark Create Table DDL Docs

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304937#comment-17304937
 ] 

Apache Spark commented on SPARK-34525:
--

User 'nikriek' has created a pull request for this issue:
https://github.com/apache/spark/pull/31899

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34525) Update Spark Create Table DDL Docs

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34525:


Assignee: Apache Spark

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Assignee: Apache Spark
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34525) Update Spark Create Table DDL Docs

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304936#comment-17304936
 ] 

Apache Spark commented on SPARK-34525:
--

User 'nikriek' has created a pull request for this issue:
https://github.com/apache/spark/pull/31899

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34525) Update Spark Create Table DDL Docs

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34525:


Assignee: (was: Apache Spark)

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304905#comment-17304905
 ] 

Apache Spark commented on SPARK-34790:
--

User 'hezuojiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31898

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set spark.io.encryption.enabled=true, run the following test suite 
> which in AdaptiveQueryExecSuite:
>  
> {code:java}
>   test("SPARK-33494: Do not use local shuffle reader for repartition") {
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
>   val df = spark.table("testData").repartition('key)
>   df.collect()
>   // local shuffle reader breaks partitioning and shouldn't be used for 
> repartition operation
>   // which is specified by users.
>   checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
> numShufflesWithoutLocalReader = 1)
> }
>   }
> {code}
>  
> I got the following error message:
> {code:java}
> 14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 2.0 (TID 3) (11.240.37.88 executor driver): 
> FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
> mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
> (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
> 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
> message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
> java.io.DataInputStream.readInt(DataInputStream.java:387) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: 
> Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 25 more
> )
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34790:


Assignee: Apache Spark

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Assignee: Apache Spark
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set spark.io.encryption.enabled=true, run the following test suite 
> which in AdaptiveQueryExecSuite:
>  
> {code:java}
>   test("SPARK-33494: Do not use local shuffle reader for repartition") {
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
>   val df = spark.table("testData").repartition('key)
>   df.collect()
>   // local shuffle reader breaks partitioning and shouldn't be used for 
> repartition operation
>   // which is specified by users.
>   checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
> numShufflesWithoutLocalReader = 1)
> }
>   }
> {code}
>  
> I got the following error message:
> {code:java}
> 14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 2.0 (TID 3) (11.240.37.88 executor driver): 
> FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
> mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
> (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
> 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
> message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
> java.io.DataInputStream.readInt(DataInputStream.java:387) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: 
> Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 25 more
> )
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304900#comment-17304900
 ] 

Apache Spark commented on SPARK-34790:
--

User 'hezuojiao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31898

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set spark.io.encryption.enabled=true, run the following test suite 
> which in AdaptiveQueryExecSuite:
>  
> {code:java}
>   test("SPARK-33494: Do not use local shuffle reader for repartition") {
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
>   val df = spark.table("testData").repartition('key)
>   df.collect()
>   // local shuffle reader breaks partitioning and shouldn't be used for 
> repartition operation
>   // which is specified by users.
>   checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
> numShufflesWithoutLocalReader = 1)
> }
>   }
> {code}
>  
> I got the following error message:
> {code:java}
> 14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 2.0 (TID 3) (11.240.37.88 executor driver): 
> FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
> mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
> (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
> 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
> message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
> java.io.DataInputStream.readInt(DataInputStream.java:387) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: 
> Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 25 more
> )
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34790:


Assignee: (was: Apache Spark)

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set spark.io.encryption.enabled=true, run the following test suite 
> which in AdaptiveQueryExecSuite:
>  
> {code:java}
>   test("SPARK-33494: Do not use local shuffle reader for repartition") {
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
>   val df = spark.table("testData").repartition('key)
>   df.collect()
>   // local shuffle reader breaks partitioning and shouldn't be used for 
> repartition operation
>   // which is specified by users.
>   checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
> numShufflesWithoutLocalReader = 1)
> }
>   }
> {code}
>  
> I got the following error message:
> {code:java}
> 14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 2.0 (TID 3) (11.240.37.88 executor driver): 
> FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
> mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
> (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
> 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
> message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
> java.io.DataInputStream.readInt(DataInputStream.java:387) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: 
> Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 25 more
> )
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, 

[jira] [Created] (SPARK-34801) java.lang.NoSuchMethodException: org.apache.hadoop.hive.ql.metadata.Hive.loadPartition

2021-03-19 Thread zhaojk (Jira)
zhaojk created SPARK-34801:
--

 Summary: java.lang.NoSuchMethodException: 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition
 Key: SPARK-34801
 URL: https://issues.apache.org/jira/browse/SPARK-34801
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.2
 Environment: HDP3.1.4.0-315  spark 3.0.2
Reporter: zhaojk


use spark-sql  run this sql  insert overwrite table zry.zjk1 
partition(etl_dt=2) select * from zry.zry;

java.lang.NoSuchMethodException: 
org.apache.hadoop.hive.ql.metadata.Hive.loadPartition(org.apache.hadoop.fs.Path,
 org.apache.hadoop.hive.ql.metadata.Table, java.util.Map, 
org.apache.hadoop.hive.ql.plan.LoadTableDesc$LoadFileType, boolean, boolean, 
boolean, boolean, boolean, java.lang.Long, int, boolean)
 at java.lang.Class.getMethod(Class.java:1786)
 at org.apache.spark.sql.hive.client.Shim.findMethod(HiveShim.scala:177)
 at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod$lzycompute(HiveShim.scala:1289)
 at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartitionMethod(HiveShim.scala:1274)
 at 
org.apache.spark.sql.hive.client.Shim_v3_0.loadPartition(HiveShim.scala:1337)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$loadPartition$1(HiveClientImpl.scala:881)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:295)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:228)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:227)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:277)
 at 
org.apache.spark.sql.hive.client.HiveClientImpl.loadPartition(HiveClientImpl.scala:871)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$loadPartition$1(HiveExternalCatalog.scala:915)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:103)
 at 
org.apache.spark.sql.hive.HiveExternalCatalog.loadPartition(HiveExternalCatalog.scala:894)
 at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.loadPartition(ExternalCatalogWithListener.scala:179)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.processInsert(InsertIntoHiveTable.scala:318)
 at 
org.apache.spark.sql.hive.execution.InsertIntoHiveTable.run(InsertIntoHiveTable.scala:102)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:106)
 at 
org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:120)
 at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
 at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
 at org.apache.spark.sql.Dataset.(Dataset.scala:229)
 at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
 at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
 at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
 at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
 at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:650)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLDriver.run(SparkSQLDriver.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.processCmd(SparkSQLCLIDriver.scala:377)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1(SparkSQLCLIDriver.scala:496)
 at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.$anonfun$processLine$1$adapted(SparkSQLCLIDriver.scala:490)
 at scala.collection.Iterator.foreach(Iterator.scala:941)
 at scala.collection.Iterator.foreach$(Iterator.scala:941)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
 at scala.collection.IterableLike.foreach(IterableLike.scala:74)
 at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
 at 

[jira] [Issue Comment Deleted] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,c

2021-03-19 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-34673:

Comment: was deleted

(was: btw this issue is coming due to hive serde : 
{code:java}
Caused by: java.lang.IllegalArgumentException: Error: name expected at the 
position 10 of 'decimal(2,-2)' but '-' is found.
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:354)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseParams(TypeInfoUtils.java:379)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parsePrimitiveParts(TypeInfoUtils.java:518)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.parsePrimitiveParts(TypeInfoUtils.java:533)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.createPrimitiveTypeInfo(TypeInfoFactory.java:136)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getPrimitiveTypeInfo(TypeInfoFactory.java:109)
 at org.apache.hive.service.cli.TypeDescriptor.(TypeDescriptor.java:57)
 at 
org.apache.hive.service.cli.ColumnDescriptor.(ColumnDescriptor.java:53)
 at org.apache.hive.service.cli.TableSchema.(TableSchema.java:52)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$.getTableSchema(SparkExecuteStatementOperation.scala:300)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema$lzycompute(SparkExecuteStatementOperation.scala:68)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema(SparkExecuteStatementOperation.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getResultSetSchema(SparkExecuteStatementOperation.scala:155)
 at 
org.apache.hive.service.cli.operation.OperationManager.getOperationResultSetSchema(OperationManager.java:209)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.getResultSetMetadata(HiveSessionImpl.java:773)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
 ... 18 more{code})

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304760#comment-17304760
 ] 

Ankit Raj Boudh commented on SPARK-34673:
-

btw this issue is coming due to hive serde : 
{code:java}
Caused by: java.lang.IllegalArgumentException: Error: name expected at the 
position 10 of 'decimal(2,-2)' but '-' is found.
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:354)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseParams(TypeInfoUtils.java:379)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parsePrimitiveParts(TypeInfoUtils.java:518)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.parsePrimitiveParts(TypeInfoUtils.java:533)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.createPrimitiveTypeInfo(TypeInfoFactory.java:136)
 at 
org.apache.hadoop.hive.serde2.typeinfo.TypeInfoFactory.getPrimitiveTypeInfo(TypeInfoFactory.java:109)
 at org.apache.hive.service.cli.TypeDescriptor.(TypeDescriptor.java:57)
 at 
org.apache.hive.service.cli.ColumnDescriptor.(ColumnDescriptor.java:53)
 at org.apache.hive.service.cli.TableSchema.(TableSchema.java:52)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$.getTableSchema(SparkExecuteStatementOperation.scala:300)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema$lzycompute(SparkExecuteStatementOperation.scala:68)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.resultSchema(SparkExecuteStatementOperation.scala:63)
 at 
org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getResultSetSchema(SparkExecuteStatementOperation.scala:155)
 at 
org.apache.hive.service.cli.operation.OperationManager.getOperationResultSetSchema(OperationManager.java:209)
 at 
org.apache.hive.service.cli.session.HiveSessionImpl.getResultSetMetadata(HiveSessionImpl.java:773)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:498)
 at 
org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
 ... 18 more{code}

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304736#comment-17304736
 ] 

Ankit Raj Boudh edited comment on SPARK-34673 at 3/19/21, 9:10 AM:
---

 [~dongjoon] and [~hyukjin.kwon] , I have checked this case in sparl-sql and i 
found we are allow to create temporary view with Negative scale *decimal(2, 
-2)*  and able to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think we should not allow to create view with negative scale *decimal(2, -2)*.

I have checked master branch behaviour and found during creation for temporary 
view data_type is showing *double*.


was (Author: ankitraj):
 [~dongjoon] and [~hyukjin.kwon] , I have checked this case in sparl-sql and i 
found we are allow to create temporary view with Negative scale *decimal(2, 
-2)*  and able to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

I have checked master branch behaviour and found during creation for temporary 
view data_type is showing *double*.

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304736#comment-17304736
 ] 

Ankit Raj Boudh edited comment on SPARK-34673 at 3/19/21, 9:05 AM:
---

 [~dongjoon] and [~hyukjin.kwon] , I have checked this case in sparl-sql and i 
found we are allow to create temporary view with Negative scale *decimal(2, 
-2)*  and able to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

I have checked master branch behaviour and found during creation for temporary 
view data_type is showing *double*.


was (Author: ankitraj):
[~hyukjin.kwon] , I have checked this case in sparl-sql and i found we are 
allow to create temporary view with Negative scale *decimal(2, -2)*  and able 
to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

I have checked master branch behaviour and found during creation for temporary 
view data_type is showing *double*.

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304736#comment-17304736
 ] 

Ankit Raj Boudh edited comment on SPARK-34673 at 3/19/21, 9:03 AM:
---

[~hyukjin.kwon] , I have checked this case in sparl-sql and i found we are 
allow to create temporary view with Negative scale *decimal(2, -2)*  and able 
to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

I have checked master branch behaviour and found during creation for temporary 
view data_type is showing *double*.


was (Author: ankitraj):
[~hyukjin.kwon] , I have checked this case in sparl-sql and i found we are 
allow to create temporary view with Negative scale *decimal(2, -2)*  and able 
to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304736#comment-17304736
 ] 

Ankit Raj Boudh commented on SPARK-34673:
-

[~hyukjin.kwon] , I have checked this case in sparl-sql and i found we are 
allow to create temporary view with Negative scale *decimal(2, -2)*  and able 
to query the data from temporary table.

!Screenshot 2021-03-19 at 1.33.54 PM.png!

I think it's wrong.

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34673) Select queries fail with Error: java.lang.IllegalArgumentException: Error: name expected at the position 10 of 'decimal(2,-2)' but '-' is found. (state=,code=0)

2021-03-19 Thread Ankit Raj Boudh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34673?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ankit Raj Boudh updated SPARK-34673:

Attachment: Screenshot 2021-03-19 at 1.33.54 PM.png

> Select queries fail  with Error: java.lang.IllegalArgumentException: Error: 
> name expected at the position 10 of 'decimal(2,-2)' but '-' is found. 
> (state=,code=0)
> -
>
> Key: SPARK-34673
> URL: https://issues.apache.org/jira/browse/SPARK-34673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Spark 2.4.5
>Reporter: Chetan Bhat
>Priority: Minor
> Attachments: Screenshot 2021-03-10 at 8.47.00 PM.png, Screenshot 
> 2021-03-19 at 1.33.54 PM.png
>
>
> Temporary views are created
> Select filter queries are executed on the Temporary views.
>  
> [Actual Issue] : - Select queries fail with Error: 
> java.lang.IllegalArgumentException: Error: name expected at the position 10 
> of 'decimal(2,-2)' but '-' is found. (state=,code=0)
>  
> [Expected Result] :- Select queries should be success on Temporary views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304711#comment-17304711
 ] 

Apache Spark commented on SPARK-34776:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31897

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Assigned] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34776:


Assignee: (was: Apache Spark)

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> 

[jira] [Assigned] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34776:


Assignee: Apache Spark

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Assignee: Apache Spark
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 

[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304709#comment-17304709
 ] 

Apache Spark commented on SPARK-34776:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31897

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> 

[jira] [Updated] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-19 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34776:

Affects Version/s: 3.2.0

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> 

[jira] [Commented] (SPARK-32899) Support submit application with user-defined cluster manager

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304689#comment-17304689
 ] 

Apache Spark commented on SPARK-32899:
--

User 'ConeyLiu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31896

> Support submit application with user-defined cluster manager
> 
>
> Key: SPARK-32899
> URL: https://issues.apache.org/jira/browse/SPARK-32899
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Xianyang Liu
>Priority: Major
>
> We have supported users to define the customed cluster manager with 
> `ExternalClusterManager` trait. However, we can not submit the application 
> with `SparkSubmit`. This patch adds the support to submit applications with 
> user-defined cluster manager.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34651) Improve ZSTD support

2021-03-19 Thread Stanislav Savulchik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304676#comment-17304676
 ] 

Stanislav Savulchik edited comment on SPARK-34651 at 3/19/21, 7:18 AM:
---

[~dongjoon] I noticed that "zstd" is not supported as a short compression codec 
name for text files in spark 3.1.1 though I can use it via its full class name 
{{org.apache.hadoop.io.compress.ZStandardCodec}} in 
{code:java}
scala> 
spark.read.textFile("hdfs://path/to/file.txt").write.option("compression", 
"zstd").text("hdfs://path/to/file.txt.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs 
are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.
  at 
org.apache.spark.sql.catalyst.util.CompressionCodecs$.getCodecClassName(CompressionCodecs.scala:53)
  at 
org.apache.spark.sql.execution.datasources.text.TextOptions.$anonfun$compressionCodec$1(TextOptions.scala:37)
  at scala.Option.map(Option.scala:230)
  at 
org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:37)
  at 
org.apache.spark.sql.execution.datasources.text.TextOptions.(TextOptions.scala:32)
  at 
org.apache.spark.sql.execution.datasources.text.TextFileFormat.prepareWrite(TextFileFormat.scala:72)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:133)
...

scala> 
spark.read.textFile("hdfs://pato/to/file.txt").write.option("compression", 
"org.apache.hadoop.io.compress.ZStandardCodec").text("hdfs://path/to/file.txt.zst")

// no exceptions{code}
Source code for the stack frame above
 
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala#L29]
  

Should I create a Jira issue to add a short codec name for zstd to the list?


was (Author: savulchik):
[~dongjoon] I noticed that "zstd" is not supported as a short compression codec 
name for text files in spark 3.1.1 though I can use it via its full class name 
{{org.apache.hadoop.io.compress.ZStandardCodec}} in 
{code:java}
scala> 
spark.read.textFile("hdfs://path/to/file.txt").write.option("compression", 
"zstd").text("hdfs://path/to/file.txt.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs 
are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.
  at 
org.apache.spark.sql.catalyst.util.CompressionCodecs$.getCodecClassName(CompressionCodecs.scala:53)
...

scala> 
spark.read.textFile("hdfs://pato/to/file.txt").write.option("compression", 
"org.apache.hadoop.io.compress.ZStandardCodec").text("hdfs://path/to/file.txt.zst")

// no exceptions{code}
Source code for the stack frame above
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala#L29]
  

Should I create a Jira issue to add a short codec name for zstd to the list?

> Improve ZSTD support
> 
>
> Key: SPARK-34651
> URL: https://issues.apache.org/jira/browse/SPARK-34651
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34651) Improve ZSTD support

2021-03-19 Thread Stanislav Savulchik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304676#comment-17304676
 ] 

Stanislav Savulchik commented on SPARK-34651:
-

[~dongjoon] I noticed that "zstd" is not supported as a short compression codec 
name for text files in spark 3.1.1 though I can use it via its full class name 
{{org.apache.hadoop.io.compress.ZStandardCodec}} in 
{code:java}
scala> 
spark.read.textFile("hdfs://path/to/file.txt").write.option("compression", 
"zstd").text("hdfs://path/to/file.txt.zst")
java.lang.IllegalArgumentException: Codec [zstd] is not available. Known codecs 
are bzip2, deflate, uncompressed, lz4, gzip, snappy, none.
  at 
org.apache.spark.sql.catalyst.util.CompressionCodecs$.getCodecClassName(CompressionCodecs.scala:53)
...

scala> 
spark.read.textFile("hdfs://pato/to/file.txt").write.option("compression", 
"org.apache.hadoop.io.compress.ZStandardCodec").text("hdfs://path/to/file.txt.zst")

// no exceptions{code}
Source code for the stack frame above
[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/CompressionCodecs.scala#L29]
  

Should I create a Jira issue to add a short codec name for zstd to the list?

> Improve ZSTD support
> 
>
> Key: SPARK-34651
> URL: https://issues.apache.org/jira/browse/SPARK-34651
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304669#comment-17304669
 ] 

Apache Spark commented on SPARK-34128:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31895

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34128:


Assignee: (was: Apache Spark)

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34128:


Assignee: Apache Spark

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34128) Suppress excessive logging of TTransportExceptions in Spark ThriftServer

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304668#comment-17304668
 ] 

Apache Spark commented on SPARK-34128:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31895

>  Suppress excessive logging of TTransportExceptions in Spark ThriftServer
> -
>
> Key: SPARK-34128
> URL: https://issues.apache.org/jira/browse/SPARK-34128
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Minor
>
> Since Spark 3.0, the `libthrift` has been bumped up from 0.9.3 to 0.12.0.
> Due to THRIFT-4805, The SparkThrift Server will print annoying TExceptions. 
> For example, the current thrift server module test in Github action workflow 
> outputs more than 200MB of data for this error only, while the total size of 
> the test log only about 1GB.
>  
> I checked the latest `hive-service-rpc` module in the maven center,  
> [https://mvnrepository.com/artifact/org.apache.hive/hive-service-rpc/3.1.2.] 
>  It still uses the 0.9.3 version. 
>  
> Due to THRIFT-5274 , It looks like we need to wait for thrift 0.14.0 to 
> release or downgrade to 0.9.3 to fix this issue if any of them is appropriate
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34719) fail if the view query has duplicated column names

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304657#comment-17304657
 ] 

Apache Spark commented on SPARK-34719:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31894

> fail if the view query has duplicated column names
> --
>
> Key: SPARK-34719
> URL: https://issues.apache.org/jira/browse/SPARK-34719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.1.0, 3.1.1
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34774) The `change-scala- version.sh` script not replaced scala.version property correctly

2021-03-19 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34774?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304656#comment-17304656
 ] 

Apache Spark commented on SPARK-34774:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31893

> The `change-scala- version.sh` script not replaced scala.version property 
> correctly
> ---
>
> Key: SPARK-34774
> URL: https://issues.apache.org/jira/browse/SPARK-34774
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Execute the following commands in order
>  # dev/change-scala-version.sh 2.13
>  # dev/change-scala-version.sh 2.12
>  # git status
> there will generate git diff as follow:
> {code:java}
> diff --git a/pom.xml b/pom.xml
> index ddc4ce2f68..f43d8c8f78 100644
> --- a/pom.xml
> +++ b/pom.xml
> @@ -162,7 +162,7 @@
>      3.4.1
>      
>      3.2.2
> -    2.12.10
> +    2.13.5
>      2.12
>      2.0.0
>      --test
> {code}
> seem 'scala.version' property was not replaced correctly



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-03-19 Thread kondziolka9ld (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304654#comment-17304654
 ] 

kondziolka9ld commented on SPARK-34792:
---

[~dongjoon]

> This doesn't look like a bug.

It is why I submitted question, not a bug.

 

> Not only Spark's code, but also there is difference between Scala 2.11 and 
> Scala 2.12.

Not exactly, there was no such difference between spark-2-3-2-scala-2-11 and 
spark-2-4-7-scala-2-12.

 

> BTW, please send your question to d...@spark.apache.org next time.

Done

> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread hezuojiao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304651#comment-17304651
 ] 

hezuojiao commented on SPARK-34790:
---

Thanks for reply, I posted the failure message above in more detail. please 
check it.

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set spark.io.encryption.enabled=true, run the following test suite 
> which in AdaptiveQueryExecSuite:
>  
> {code:java}
>   test("SPARK-33494: Do not use local shuffle reader for repartition") {
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
>   val df = spark.table("testData").repartition('key)
>   df.collect()
>   // local shuffle reader breaks partitioning and shouldn't be used for 
> repartition operation
>   // which is specified by users.
>   checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
> numShufflesWithoutLocalReader = 1)
> }
>   }
> {code}
>  
> I got the following error message:
> {code:java}
> 14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
> stage 2.0 (TID 3) (11.240.37.88 executor driver): 
> FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
> mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
> org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
> (11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
> 11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
> message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
>  at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
>  at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
> java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
> java.io.DataInputStream.readInt(DataInputStream.java:387) at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
>  at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
> scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) 
> at scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
>  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) 
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
> org.apache.spark.scheduler.Task.run(Task.scala:131) at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: 
> Stream is corrupted at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
> net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
> net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
> org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
>  ... 25 more
> )
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To 

[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-19 Thread hezuojiao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hezuojiao updated SPARK-34790:
--
Description: 
When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.


For example:
After set spark.io.encryption.enabled=true, run the following test suite which 
in AdaptiveQueryExecSuite:

 
{code:java}
  test("SPARK-33494: Do not use local shuffle reader for repartition") {
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "true") {
  val df = spark.table("testData").repartition('key)
  df.collect()
  // local shuffle reader breaks partitioning and shouldn't be used for 
repartition operation
  // which is specified by users.
  checkNumLocalShuffleReaders(df.queryExecution.executedPlan, 
numShufflesWithoutLocalReader = 1)
}
  }

{code}
 
I got the following error message:
{code:java}
14:05:52.638 WARN org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in 
stage 2.0 (TID 3) (11.240.37.88 executor driver): 
FetchFailed(BlockManagerId(driver, 11.240.37.88, 63574, None), shuffleId=0, 
mapIndex=0, mapId=0, reduceId=2, message=14:05:52.638 WARN 
org.apache.spark.scheduler.TaskSetManager: Lost task 1.0 in stage 2.0 (TID 3) 
(11.240.37.88 executor driver): FetchFailed(BlockManagerId(driver, 
11.240.37.88, 63574, None), shuffleId=0, mapIndex=0, mapId=0, reduceId=2, 
message=org.apache.spark.shuffle.FetchFailedException: Stream is corrupted at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:772)
 at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:845)
 at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) at 
java.io.BufferedInputStream.read(BufferedInputStream.java:265) at 
java.io.DataInputStream.readInt(DataInputStream.java:387) at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.readSize(UnsafeRowSerializer.scala:113)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:129)
 at 
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:494) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) at 
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at 
scala.collection.Iterator$$anon$10.next(Iterator.scala:459) at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:131) at 
org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)Caused by: java.io.IOException: Stream 
is corrupted at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:200) at 
net.jpountz.lz4.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:226) at 
net.jpountz.lz4.LZ4BlockInputStream.read(LZ4BlockInputStream.java:157) at 
org.apache.spark.storage.BufferReleasingInputStream.read(ShuffleBlockFetcherIterator.scala:841)
 ... 25 more
)
{code}
 

 

  was:
When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.

 


> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
> For example:
> After set