date:20210318

[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread hezuojiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hezuojiao updated SPARK-34790:
--
Description: 
When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.

 

  was:
When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.

 

 

 


> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread hezuojiao (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hezuojiao updated SPARK-34790:
--
Description: 
When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.

 

 

 

  was:When set spark.io.encryption.enabled=true, lots of test cases in 
AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
incompatible with io encryption.


> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34772) RebaseDateTime loadRebaseRecords should use Spark classloader instead of context

2021-03-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34772?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-34772.
-
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
 Assignee: ulysses you
   Resolution: Fixed

Issue resolved by pull request 
[31864|https://github.com/apache/spark/pull/31864]
[https://github.com/apache/spark/pull/31864]

> RebaseDateTime loadRebaseRecords should use Spark classloader instead of 
> context
> 
>
> Key: SPARK-34772
> URL: https://issues.apache.org/jira/browse/SPARK-34772
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> With custom `spark.sql.hive.metastore.version` and 
> `spark.sql.hive.metastore.jars`.
> Spark would use date formatter in `HiveShim` that convert `date` to `string`, 
> if we set `spark.sql.legacy.timeParserPolicy=LEGACY` the `RebaseDateTime` 
> code will be invoked. At that moment, if `RebaseDateTime` is initialized the 
> first time then context class loader is `IsolatedClientLoader`. Such error 
> msg would throw:
> {code:java}
> java.lang.IllegalArgumentException: argument "src" is null
>   at 
> com.fasterxml.jackson.databind.ObjectMapper._assertNotNull(ObjectMapper.java:4413)
>   at 
> com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3157)
>   at 
> com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue(ScalaObjectMapper.scala:187)
>   at 
> com.fasterxml.jackson.module.scala.ScalaObjectMapper.readValue$(ScalaObjectMapper.scala:186)
>   at 
> org.apache.spark.sql.catalyst.util.RebaseDateTime$$anon$1.readValue(RebaseDateTime.scala:267)
>   at 
> org.apache.spark.sql.catalyst.util.RebaseDateTime$.loadRebaseRecords(RebaseDateTime.scala:269)
>   at 
> org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala:291)
>   at 
> org.apache.spark.sql.catalyst.util.RebaseDateTime$.(RebaseDateTime.scala)
>   at 
> org.apache.spark.sql.catalyst.util.DateTimeUtils$.toJavaDate(DateTimeUtils.scala:109)
>   at 
> org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format(DateFormatter.scala:95)
>   at 
> org.apache.spark.sql.catalyst.util.LegacyDateFormatter.format$(DateFormatter.scala:94)
>   at 
> org.apache.spark.sql.catalyst.util.LegacySimpleDateFormatter.format(DateFormatter.scala:138)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13$ExtractableLiteral$1$.unapply(HiveShim.scala:661)
>   at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:785)
>   at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304641#comment-17304641
 ] 

Apache Spark commented on SPARK-34796:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/31892

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34796:


Assignee: (was: Apache Spark)

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34796:


Assignee: Apache Spark

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34761) Add a day-time interval to a timestamp

2021-03-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34761.
-
Resolution: Fixed

Issue resolved by pull request 31855
[https://github.com/apache/spark/pull/31855]

> Add a day-time interval to a timestamp
> --
>
> Key: SPARK-34761
> URL: https://issues.apache.org/jira/browse/SPARK-34761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Support adding of DayTimeIntervalType values to TIMESTAMP values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31897) Enable codegen for GenerateExec

2021-03-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-31897:
---

Assignee: Karuppayya

> Enable codegen for GenerateExec
> ---
>
> Key: SPARK-31897
> URL: https://issues.apache.org/jira/browse/SPARK-31897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Assignee: Karuppayya
>Priority: Major
>
> Codegen had been disabled as part of 
> [https://github.com/apache/spark/commit/b5c5bd98ea5e8dbfebcf86c5459bdf765f5ceb53]
> There has been lots of improvements in the Generate code flow, after it had 
> been disabled.(https://issues.apache.org/jira/browse/SPARK-21657)
> Related discussion:
>  
> [http://apache-spark-developers-list.1001551.n3.nabble.com/GenerateExec-CodegenSupport-and-supportCodegen-flag-off-td22984.html]
> I tried enabling codegen and found query very perform with explode kind of 
> operations.
> Can we enable codegen by default for GenerateExec?
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31897) Enable codegen for GenerateExec

2021-03-18 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh resolved SPARK-31897.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 28715
[https://github.com/apache/spark/pull/28715]

> Enable codegen for GenerateExec
> ---
>
> Key: SPARK-31897
> URL: https://issues.apache.org/jira/browse/SPARK-31897
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Karuppayya
>Assignee: Karuppayya
>Priority: Major
> Fix For: 3.2.0
>
>
> Codegen had been disabled as part of 
> [https://github.com/apache/spark/commit/b5c5bd98ea5e8dbfebcf86c5459bdf765f5ceb53]
> There has been lots of improvements in the Generate code flow, after it had 
> been disabled.(https://issues.apache.org/jira/browse/SPARK-21657)
> Related discussion:
>  
> [http://apache-spark-developers-list.1001551.n3.nabble.com/GenerateExec-CodegenSupport-and-supportCodegen-flag-off-td22984.html]
> I tried enabling codegen and found query very perform with explode kind of 
> operations.
> Can we enable codegen by default for GenerateExec?
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-18 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304622#comment-17304622
 ] 

L. C. Hsieh commented on SPARK-34776:
-

Thanks for ping. Let me take a look.

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at

[jira] [Updated] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread rongchuan.jin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

rongchuan.jin updated SPARK-34800:
--
Description: 
We have modified the underlying hive meta store which a different hive database 
is placed in its own shard for performance. However, we found that  the 
synchronized limits the concurrency, we would like to fix it.

Related jstack trace like following:
{code:java}
"http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
[0x7f45949df000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
- waiting to lock <0x00011d983d90> (a 
org.apache.spark.sql.catalyst.catalog.SessionCatalog)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
...
{code}
we fixed as discussed in mail 
[http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]

 

  was:
We have modified the underlying hive meta store which a different hive > 
database is placed in its own shard for performance. However, we found that > 
the synchronized limits the concurrency, we would like to fix it.

Related jstack trace like following:
{code:java}
"http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
[0x7f45949df000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
- waiting to lock <0x00011d983d90> (a 
org.apache.spark.sql.catalyst.catalog.SessionCatalog)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
...
{code}
we fixed as discussed in mail 
[http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]

 


> Use fine-grained lock in SessionCatalog.tableExists
> ---
>
> Key: SPARK-34800
> URL: https://issues.apache.org/jira/browse/SPARK-34800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: rongchuan.jin
>Priority: Major
>
> We have modified the underlying hive meta store which a different hive 
> database is placed in its own shard for performance. However, we found that  
> the synchronized limits the concurrency, we would like to fix it.
> Related jstack trace like following:
> {code:java}
> "http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
> tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
> [0x7f45949df000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
> - waiting to lock <0x00011d983d90> (a 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
> at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
> at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
> ...
> {code}
> we fixed as discussed in mail 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304620#comment-17304620
 ] 

Apache Spark commented on SPARK-34800:
--

User 'woyumen4597' has created a pull request for this issue:
https://github.com/apache/spark/pull/31891

> Use fine-grained lock in SessionCatalog.tableExists
> ---
>
> Key: SPARK-34800
> URL: https://issues.apache.org/jira/browse/SPARK-34800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: rongchuan.jin
>Priority: Major
>
> We have modified the underlying hive meta store which a different hive > 
> database is placed in its own shard for performance. However, we found that > 
> the synchronized limits the concurrency, we would like to fix it.
> Related jstack trace like following:
> {code:java}
> "http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
> tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
> [0x7f45949df000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
> - waiting to lock <0x00011d983d90> (a 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
> at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
> at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
> ...
> {code}
> we fixed as discussed in mail 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34800:


Assignee: (was: Apache Spark)

> Use fine-grained lock in SessionCatalog.tableExists
> ---
>
> Key: SPARK-34800
> URL: https://issues.apache.org/jira/browse/SPARK-34800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: rongchuan.jin
>Priority: Major
>
> We have modified the underlying hive meta store which a different hive > 
> database is placed in its own shard for performance. However, we found that > 
> the synchronized limits the concurrency, we would like to fix it.
> Related jstack trace like following:
> {code:java}
> "http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
> tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
> [0x7f45949df000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
> - waiting to lock <0x00011d983d90> (a 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
> at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
> at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
> ...
> {code}
> we fixed as discussed in mail 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34800:


Assignee: Apache Spark

> Use fine-grained lock in SessionCatalog.tableExists
> ---
>
> Key: SPARK-34800
> URL: https://issues.apache.org/jira/browse/SPARK-34800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: rongchuan.jin
>Assignee: Apache Spark
>Priority: Major
>
> We have modified the underlying hive meta store which a different hive > 
> database is placed in its own shard for performance. However, we found that > 
> the synchronized limits the concurrency, we would like to fix it.
> Related jstack trace like following:
> {code:java}
> "http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
> tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
> [0x7f45949df000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
> - waiting to lock <0x00011d983d90> (a 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
> at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
> at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
> ...
> {code}
> we fixed as discussed in mail 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread rongchuan.jin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304619#comment-17304619
 ] 

rongchuan.jin commented on SPARK-34800:
---

PR link: https://github.com/apache/spark/pull/31891

> Use fine-grained lock in SessionCatalog.tableExists
> ---
>
> Key: SPARK-34800
> URL: https://issues.apache.org/jira/browse/SPARK-34800
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: rongchuan.jin
>Priority: Major
>
> We have modified the underlying hive meta store which a different hive > 
> database is placed in its own shard for performance. However, we found that > 
> the synchronized limits the concurrency, we would like to fix it.
> Related jstack trace like following:
> {code:java}
> "http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
> tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
> [0x7f45949df000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
> - waiting to lock <0x00011d983d90> (a 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
> at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
> at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
> at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
> ...
> {code}
> we fixed as discussed in mail 
> [http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34798) Fix incorrect join condition

2021-03-18 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-34798:

Summary: Fix incorrect join condition  (was: Replace `==` with `===` in 
join condition)

> Fix incorrect join condition
> 
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34798) Replace `==` with `===` in join condition

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304617#comment-17304617
 ] 

Apache Spark commented on SPARK-34798:
--

User 'opensky142857' has created a pull request for this issue:
https://github.com/apache/spark/pull/31890

> Replace `==` with `===` in join condition
> -
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34798) Replace `==` with `===` in join condition

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34798:


Assignee: Apache Spark

> Replace `==` with `===` in join condition
> -
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34798) Replace `==` with `===` in join condition

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304616#comment-17304616
 ] 

Apache Spark commented on SPARK-34798:
--

User 'opensky142857' has created a pull request for this issue:
https://github.com/apache/spark/pull/31890

> Replace `==` with `===` in join condition
> -
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34798) Replace `==` with `===` in join condition

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34798:


Assignee: (was: Apache Spark)

> Replace `==` with `===` in join condition
> -
>
> Key: SPARK-34798
> URL: https://issues.apache.org/jira/browse/SPARK-34798
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> [https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]
> [https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34800) Use fine-grained lock in SessionCatalog.tableExists

2021-03-18 Thread rongchuan.jin (Jira)

rongchuan.jin created SPARK-34800:
-

 Summary: Use fine-grained lock in SessionCatalog.tableExists
 Key: SPARK-34800
 URL: https://issues.apache.org/jira/browse/SPARK-34800
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: rongchuan.jin


We have modified the underlying hive meta store which a different hive > 
database is placed in its own shard for performance. However, we found that > 
the synchronized limits the concurrency, we would like to fix it.

Related jstack trace like following:
{code:java}
"http-nio-7070-exec-257" #19961734 daemon prio=5 os_prio=0 
tid=0x7f45f4ce1000 nid=0x1a85e6 waiting for monitor entry 
[0x7f45949df000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:415)
- waiting to lock <0x00011d983d90> (a 
org.apache.spark.sql.catalyst.catalog.SessionCatalog)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.requireTableExists(SessionCatalog.scala:185)
at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.getTableMetadata(SessionCatalog.scala:430)
at org.apache.spark.sql.DdlOperation$.getTableDesc(SourceFile:123)
at org.apache.spark.sql.DdlOperation.getTableDesc(SourceFile)
...
{code}
we fixed as discussed in mail 
[http://mail-archives.apache.org/mod_mbox/spark-dev/202103.mbox/browser]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-18 Thread Darcy Shen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304606#comment-17304606
 ] 

Darcy Shen edited comment on SPARK-34600 at 3/19/21, 3:14 AM:
--

[~hyukjin.kwon][~eddyxu]

Let me move the PR to the newly-created JIRA: 
https://issues.apache.org/jira/browse/SPARK-34799


was (Author: sadhen):
[~hyukjin.kwon][~eddyxu]

Let me move the PR to the newly-created JIRA.

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> This is an umbrella ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-18 Thread Darcy Shen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304606#comment-17304606
 ] 

Darcy Shen commented on SPARK-34600:


[~hyukjin.kwon][~eddyxu]

Let me move the PR to the newly-created JIRA.

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> This is an umbrella ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-18 Thread Darcy Shen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34600:
---
Description: This is an umbrella ticket.  (was: Because Pandas UDF uses 
pyarrow to passing data, it does not currently support UserDefinedTypes, as 
what normal python udf does.

For example:


{code:python}
class BoxType(UserDefinedType):
@classmethod
def sqlType(cls) -> StructType:
return StructType(
fields=[
StructField("xmin", DoubleType(), False),
StructField("ymin", DoubleType(), False),
StructField("xmax", DoubleType(), False),
StructField("ymax", DoubleType(), False),
]
)

@pandas_udf(
 returnType=StructType([StructField("boxes", ArrayType(Box()))]
)
def pandas_pf(s: pd.DataFrame) -> pd.DataFrame:
   yield s
{code}

The logs show
{code}
try:
to_arrow_type(self._returnType_placeholder)
except TypeError:
>   raise NotImplementedError(
"Invalid return type with scalar Pandas UDFs: %s is "
E   NotImplementedError: Invalid return type with scalar Pandas 
UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is not 
supported
{code}
)

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> This is an umbrella ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34799) Return User-defined types from Pandas UDF

2021-03-18 Thread Darcy Shen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darcy Shen updated SPARK-34799:
---
Description: 
Because Pandas UDF uses pyarrow to passing data, it does not currently support 
UserDefinedTypes, as what normal python udf does.

For example:


{code:python}
class BoxType(UserDefinedType):
@classmethod
def sqlType(cls) -> StructType:
return StructType(
fields=[
StructField("xmin", DoubleType(), False),
StructField("ymin", DoubleType(), False),
StructField("xmax", DoubleType(), False),
StructField("ymax", DoubleType(), False),
]
)

@pandas_udf(
 returnType=StructType([StructField("boxes", ArrayType(Box()))]
)
def pandas_pf(s: pd.DataFrame) -> pd.DataFrame:
   yield s
{code}

The logs show
{code}
try:
to_arrow_type(self._returnType_placeholder)
except TypeError:
>   raise NotImplementedError(
"Invalid return type with scalar Pandas UDFs: %s is "
E   NotImplementedError: Invalid return type with scalar Pandas 
UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is not 
supported
{code}


> Return User-defined types from Pandas UDF
> -
>
> Key: SPARK-34799
> URL: https://issues.apache.org/jira/browse/SPARK-34799
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Darcy Shen
>Priority: Major
>
> Because Pandas UDF uses pyarrow to passing data, it does not currently 
> support UserDefinedTypes, as what normal python udf does.
> For example:
> {code:python}
> class BoxType(UserDefinedType):
> @classmethod
> def sqlType(cls) -> StructType:
> return StructType(
> fields=[
> StructField("xmin", DoubleType(), False),
> StructField("ymin", DoubleType(), False),
> StructField("xmax", DoubleType(), False),
> StructField("ymax", DoubleType(), False),
> ]
> )
> @pandas_udf(
>  returnType=StructType([StructField("boxes", ArrayType(Box()))]
> )
> def pandas_pf(s: pd.DataFrame) -> pd.DataFrame:
>yield s
> {code}
> The logs show
> {code}
> try:
> to_arrow_type(self._returnType_placeholder)
> except TypeError:
> >   raise NotImplementedError(
> "Invalid return type with scalar Pandas UDFs: %s is "
> E   NotImplementedError: Invalid return type with scalar 
> Pandas UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is 
> not supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34799) Return User-defined types from Pandas UDF

2021-03-18 Thread Darcy Shen (Jira)

Darcy Shen created SPARK-34799:
--

 Summary: Return User-defined types from Pandas UDF
 Key: SPARK-34799
 URL: https://issues.apache.org/jira/browse/SPARK-34799
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Affects Versions: 3.1.1, 3.0.2
Reporter: Darcy Shen






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuxiaofei (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304603#comment-17304603
 ] 

wuxiaofei commented on SPARK-34788:
---

Please assign it to me, I will try to work it out later.

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users：
> {code:java}
> 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in reading 
> FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
>  offset=110254956, length=1875458}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:442)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:622)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:98)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
>   at org.apache.spark.scheduler.Task.run(Task.scala:112)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data
>  (No such file or directory)
>   at java.io.FileInputStream.open0(Native Method)
>   at java.io.FileInputStream.open(FileInputStream.java:195)
>   at

[jira] [Updated] (SPARK-30641) ML algs blockify input vectors

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30641:
-
Affects Version/s: 3.2.0

> ML algs blockify input vectors
> --
>
> Key: SPARK-30641
> URL: https://issues.apache.org/jira/browse/SPARK-30641
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.1.0, 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> stacking input vectors into blocks will benefit ML algs:
> 1, less RAM to persist datasets, since the overhead of object header is 
> reduced;
> 2, optimization potential for impl, since high-level BLAS can be used; Proven 
> in ALS/MLP;
> 3, maybe a way to perform efficient mini-batch sampling (To be confirmed)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34798) Replace `==` with `===` in join condition

2021-03-18 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-34798:
---

 Summary: Replace `==` with `===` in join condition
 Key: SPARK-34798
 URL: https://issues.apache.org/jira/browse/SPARK-34798
 Project: Spark
  Issue Type: Bug
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Yuming Wang


[https://github.com/apache/spark/blob/5988d2846450f8ec72fefcd2067f12819463cc4b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/PropagateEmptyRelationSuite.scala#L150]

[https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L77]

[https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L89]

[https://github.com/apache/spark/blob/04e53d2e3c9dc993b8ba88c5ef0bcf0a8a9b06b2/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/OptimizeLimitZeroSuite.scala#L91]

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34797) Refactor Logistic Aggregator - support virtual centering

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34797:
-
Summary: Refactor Logistic Aggregator - support virtual centering  (was: 
Refactor Logistic Aggregator)

> Refactor Logistic Aggregator - support virtual centering
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34797:


Assignee: (was: Apache Spark)

> Refactor Logistic Aggregator
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304590#comment-17304590
 ] 

Apache Spark commented on SPARK-34797:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31889

> Refactor Logistic Aggregator
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34797:


Assignee: Apache Spark

> Refactor Logistic Aggregator
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34797:


Assignee: Apache Spark

> Refactor Logistic Aggregator
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34782) Not equal functions (!=, <>) are missing in SQL doc

2021-03-18 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34782?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-34782.
--
Resolution: Duplicate

> Not equal functions (!=, <>) are missing in SQL doc 
> 
>
> Key: SPARK-34782
> URL: https://issues.apache.org/jira/browse/SPARK-34782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> https://spark.apache.org/docs/latest/api/sql/search.html?q=%3C%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34782) Not equal functions (!=, <>) are missing in SQL doc

2021-03-18 Thread wuyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304585#comment-17304585
 ] 

wuyi commented on SPARK-34782:
--

Oh yea, thanks for catching it. I'll close this.

> Not equal functions (!=, <>) are missing in SQL doc 
> 
>
> Key: SPARK-34782
> URL: https://issues.apache.org/jira/browse/SPARK-34782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> https://spark.apache.org/docs/latest/api/sql/search.html?q=%3C%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34797:
-
Description: 
1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator and 
related suites;

2, impl 'virtual centering' in standardization

> Refactor Logistic Aggregator
> 
>
> Key: SPARK-34797
> URL: https://issues.apache.org/jira/browse/SPARK-34797
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Priority: Major
>
> 1, add BinaryLogisticBlockAggregator and MultinomialLogisticBlockAggregator 
> and related suites;
> 2, impl 'virtual centering' in standardization



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34693) Support null in conversions to and from Arrow

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304583#comment-17304583
 ] 

Hyukjin Kwon commented on SPARK-34693:
--

Oh, okay. Seems like it's fixed in SPARK-33489 (Spark 3.2.0). It's sort of 
improvement which isn't backported in general.

> Support null in conversions to and from Arrow
> -
>
> Key: SPARK-34693
> URL: https://issues.apache.org/jira/browse/SPARK-34693
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Laurens
>Priority: Minor
>
> Looks like a regression from or related to SPARK-33489.
> I got the following running spark 3.1.1. with koalas 1.7.0
> {code:java}
> TypeError Traceback (most recent call last)
> /var/scratch/miniconda3/lib/python3.8/site-packages/pyspark/sql/udf.py in 
> returnType(self)
> 100 try:
> --> 101 to_arrow_type(self._returnType_placeholder)
> 102 except TypeError:
> /var/scratch/miniconda3/lib/python3.8/site-packages/pyspark/sql/pandas/types.py
>  in to_arrow_type(dt)
>  75 else:
> ---> 76 raise TypeError("Unsupported type in conversion to Arrow: " + 
> str(dt))
>  77 return arrow_type
> TypeError: Unsupported type in conversion to Arrow: NullType
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34797) Refactor Logistic Aggregator

2021-03-18 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-34797:


 Summary: Refactor Logistic Aggregator
 Key: SPARK-34797
 URL: https://issues.apache.org/jira/browse/SPARK-34797
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34788:
-
Description: 
When the disk is full, Spark throws FileNotFoundException instead of 
IOException with the hint. It's quite a confusing error to users：
{code:java}
9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
stream from local block
java.io.IOException: Error in reading 
FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
 offset=110254956, length=1875458}
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:622)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:98)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:95)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: 
/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:100)
... 35 more{code}
（The cause only says the file is not found, but we believe it's highly possible 
due to the disk full issue after investigation.）

And there's probably a way to detect the disk full: when we get 
`FileNotFoundException`, we try 
[http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] to 
see if SyncFailedException throws. If SyncFailedException throws, then we throw 
IOException with the disk full hint.

[jira] [Updated] (SPARK-34765) Linear Models standardization optimization

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34765:
-
Parent: SPARK-30641
Issue Type: Sub-task  (was: Improvement)

> Linear Models standardization optimization
> --
>
> Key: SPARK-34765
> URL: https://issues.apache.org/jira/browse/SPARK-34765
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: zhengruifeng
>Priority: Major
>
> Existing impl of standardization in linear models does *NOT* center the 
> vectors by removing the means, for the purpose of keep the dataset sparsity.
> However, this will cause feature values with small var be scaled to large 
> values, and underlying solver like LBFGS can not efficiently handle this 
> case. see SPARK-34448 for details.
> If internal vectors are centers (like other famous impl, i.e. 
> GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
> SPARK-34448, the number of iteration to convergence will be reduced from 93 
> to 6. Moreover, the final solution is much more close to the one in GLMNET.
> luckily, we find a new way to 'virtually' center the vectors without 
> densifying the dataset, iff:
> 1, fitIntercept is true;
>  2, no penalty on the intercept, it seem this is always true in existing 
> impls;
>  3, no bounds on the intercept;
>  
> We will also need to check whether this new methods work in all other linear 
> models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
> those models if possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34765) Linear Models standardization optimization

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34765:
-
Issue Type: Improvement  (was: Umbrella)

> Linear Models standardization optimization
> --
>
> Key: SPARK-34765
> URL: https://issues.apache.org/jira/browse/SPARK-34765
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0, 3.1.1
>Reporter: zhengruifeng
>Priority: Major
>
> Existing impl of standardization in linear models does *NOT* center the 
> vectors by removing the means, for the purpose of keep the dataset sparsity.
> However, this will cause feature values with small var be scaled to large 
> values, and underlying solver like LBFGS can not efficiently handle this 
> case. see SPARK-34448 for details.
> If internal vectors are centers (like other famous impl, i.e. 
> GLMNET/Scikit-Learn), the convergence ratio will be better. In the case in 
> SPARK-34448, the number of iteration to convergence will be reduced from 93 
> to 6. Moreover, the final solution is much more close to the one in GLMNET.
> luckily, we find a new way to 'virtually' center the vectors without 
> densifying the dataset, iff:
> 1, fitIntercept is true;
>  2, no penalty on the intercept, it seem this is always true in existing 
> impls;
>  3, no bounds on the intercept;
>  
> We will also need to check whether this new methods work in all other linear 
> models (i.e, mlor/svc/lir/aft, etc.) as we expected , and introduce it into 
> those models if possible.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34719) fail if the view query has duplicated column names

2021-03-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34719.
--
Fix Version/s: 3.1.2
 Assignee: Wenchen Fan
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31811

> fail if the view query has duplicated column names
> --
>
> Key: SPARK-34719
> URL: https://issues.apache.org/jira/browse/SPARK-34719
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.0, 3.1.0, 3.1.1
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>  Labels: correctness
> Fix For: 3.1.2
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304581#comment-17304581
 ] 

wuyi commented on SPARK-34788:
--

[~dongjoon] I have pasted the error stack. It happened when we reading the 
shuffle data. The cause only says the file is not found, but we believe it's 
highly possible due to the disk full issue after investigation.

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users：
> {code:java}
> 9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
> stream from local block
> java.io.IOException: Error in reading 
> FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
>  offset=110254956, length=1875458}
>   at 
> org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:442)
>   at 
> org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
>   at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
>   at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
>   at 
> org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:622)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:98)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
>   at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
>   at org.apache.spark.scheduler.Task.run(Task.scala:112)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.FileNotFoundException: 
> /local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data
>  (No such file

[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-34788:
-
Description: 
When the disk is full, Spark throws FileNotFoundException instead of 
IOException with the hint. It's quite a confusing error to users：
{code:java}
9/03/26 09:03:45 ERROR ShuffleBlockFetcherIterator: Failed to create input 
stream from local block
java.io.IOException: Error in reading 
FileSegmentManagedBuffer{file=/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data,
 offset=110254956, length=1875458}
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:111)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:442)
at 
org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:64)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.sort_addToSorter_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$11$$anon$1.hasNext(WholeStageCodegenExec.scala:622)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:98)
at 
org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:95)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:839)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:60)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:340)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:304)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1432)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.FileNotFoundException: 
/local_disk0/spark-c2f26f02-2572-4764-815a-cbba65ddb315/executor-b4b76a4c-788c-4cb6-b904-664a883be1aa/blockmgr-36804371-24fe-4131-a3dc-00b7f98f3a3e/11/shuffle_113_1029_0.data
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:195)
at java.io.FileInputStream.(FileInputStream.java:138)
at 
org.apache.spark.network.buffer.FileSegmentManagedBuffer.createInputStream(FileSegmentManagedBuffer.java:100)
... 35 more{code}
And there's probably a way to detect the disk full: when we get 
`FileNotFoundException`, we try 
[http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] to 
see if SyncFailedException throws. If SyncFailedException throws, then we throw 
IOException with the disk full hint.

  was:
When the disk is full, Spark throws FileNotFoundException instead of 
IOException with the hint. It's quite a confusing

[jira] [Commented] (SPARK-34776) Catalyst error on on certain struct operation (Couldn't find _gen_alias_)

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304571#comment-17304571
 ] 

Hyukjin Kwon commented on SPARK-34776:
--

cc [~viirya] FYI

> Catalyst error on on certain struct operation (Couldn't find _gen_alias_)
> -
>
> Key: SPARK-34776
> URL: https://issues.apache.org/jira/browse/SPARK-34776
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: Spark 3.1.1
> Scala 2.12.10
>Reporter: Daniel Solow
>Priority: Major
>
> When I run the following:
> {code:java}
> import org.apache.spark.sql.{functions => f}
> import org.apache.spark.sql.expressions.Window
> val df = Seq(
> ("t1", "123", "bob"),
> ("t1", "456", "bob"),
> ("t2", "123", "sam")
> ).toDF("type", "value", "name")
> val test = df.select(
> $"*",
> f.struct(f.count($"*").over(Window.partitionBy($"type", $"value", 
> $"name")).as("count"), $"name").as("name_count")
> ).select(
>   $"*",
>   f.max($"name_count").over(Window.partitionBy($"type", 
> $"value")).as("best_name")
> )
> test.printSchema
> {code}
> I get the following schema, which is fine:
> {code:java}
> root
>  |-- type: string (nullable = true)
>  |-- value: string (nullable = true)
>  |-- name: string (nullable = true)
>  |-- name_count: struct (nullable = false)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
>  |-- best_name: struct (nullable = true)
>  ||-- count: long (nullable = false)
>  ||-- name: string (nullable = true)
> {code}
> However when I get a subfield of the best_name struct, I get an error:
> {code:java}
> test.select($"best_name.name").show(10, false)
> org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding 
> attribute, tree: _gen_alias_3458#3458
>   at 
> org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:75)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:407)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:243)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:405)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:358)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:322)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:306)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReference(BoundAttribute.scala:74)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.$anonfun$bindReferences$1(BoundAttribute.scala:96)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.BindReferences$.bindReferences(BoundAttribute.scala:96)
>   at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.(InterpretedMutableProjection.scala:35)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1589)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$19.applyOrElse(Optimizer.scala:1586)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:317)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
>

[jira] [Commented] (SPARK-34600) Support user defined types in Pandas UDF

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34600?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304569#comment-17304569
 ] 

Hyukjin Kwon commented on SPARK-34600:
--

[~eddyxu] if this ticket is an umbrella ticket, let's create another JIRA and 
move your PR (https://github.com/apache/spark/pull/31735) to that JIRA. 
umbrella JIRA doesn't usually have a PR against it.

> Support user defined types in Pandas UDF
> 
>
> Key: SPARK-34600
> URL: https://issues.apache.org/jira/browse/SPARK-34600
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 3.0.2, 3.1.1
>Reporter: Lei (Eddy) Xu
>Priority: Major
>  Labels: pandas, sql, udf
>
> Because Pandas UDF uses pyarrow to passing data, it does not currently 
> support UserDefinedTypes, as what normal python udf does.
> For example:
> {code:python}
> class BoxType(UserDefinedType):
> @classmethod
> def sqlType(cls) -> StructType:
> return StructType(
> fields=[
> StructField("xmin", DoubleType(), False),
> StructField("ymin", DoubleType(), False),
> StructField("xmax", DoubleType(), False),
> StructField("ymax", DoubleType(), False),
> ]
> )
> @pandas_udf(
>  returnType=StructType([StructField("boxes", ArrayType(Box()))]
> )
> def pandas_pf(s: pd.DataFrame) -> pd.DataFrame:
>yield s
> {code}
> The logs show
> {code}
> try:
> to_arrow_type(self._returnType_placeholder)
> except TypeError:
> >   raise NotImplementedError(
> "Invalid return type with scalar Pandas UDFs: %s is "
> E   NotImplementedError: Invalid return type with scalar 
> Pandas UDFs: StructType(List(StructField(boxes,ArrayType(Box,true),true))) is 
> not supported
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34754) sparksql 'add jar' not support hdfs ha mode in k8s

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304568#comment-17304568
 ] 

Hyukjin Kwon commented on SPARK-34754:
--

[~lithiumlee-_-] can you test if it works in higher versions? K8S support just 
became GA from Spark 3.1.

> sparksql  'add jar' not support  hdfs ha mode in k8s  
> --
>
> Key: SPARK-34754
> URL: https://issues.apache.org/jira/browse/SPARK-34754
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.7
>Reporter: lithiumlee-_-
>Priority: Major
>
> Submit app to K8S,  the executors meet exception  
> "java.net.UnknownHostException: xx". 
> The udf jar uri using hdfs ha style, but the exception stack show  
> "...*createNonHAProxy*..."
>  
> hql: 
> {code:java}
> // code placeholder
> add jar hdfs://xx/test.jar;
> create temporary function test_udf as 'com.xxx.xxx';
> create table test.test_udf as 
> select test_udf('1') name_1;
>  {code}
>  
>  
> exception:
> {code:java}
> // code placeholder
>  TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 172.30.89.44, executor 
> 1): java.lang.IllegalArgumentException: java.net.UnknownHostException: xx
> at 
> org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:439)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:321)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:696)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:636)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:160)
> at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2796)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:99)
> at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2830)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2812)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:390)
> at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1866)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:721)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:816)
> at 
> org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:808)
> at 
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130)
> at 
> scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:130)
> at 
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
> at 
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:808)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: xx
> ... 28 more
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34747) Add virtual operators to the built-in function document.

2021-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34747.
--
Fix Version/s: 3.0.3
   3.1.2
   3.2.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/31841

> Add virtual operators to the built-in function document.
> 
>
> Key: SPARK-34747
> URL: https://issues.apache.org/jira/browse/SPARK-34747
> Project: Spark
>  Issue Type: Bug
>  Components: docs, SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> After SPARK-34697, DESCRIBE FUNCTION and SHOW FUNCTIONS can describe/show 
> built-in operators including the following virtual operators.
> * !=
> * <>
> * between
> * case
> * ||
> But they are still absent from the built-in functions document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34782) Not equal functions (!=, <>) are missing in SQL doc

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304560#comment-17304560
 ] 

Hyukjin Kwon commented on SPARK-34782:
--

Duplicate of SPARK-34747?

> Not equal functions (!=, <>) are missing in SQL doc 
> 
>
> Key: SPARK-34782
> URL: https://issues.apache.org/jira/browse/SPARK-34782
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> https://spark.apache.org/docs/latest/api/sql/search.html?q=%3C%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34780) Cached Table (parquet) with old Configs Used

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304561#comment-17304561
 ] 

Hyukjin Kwon commented on SPARK-34780:
--

cc [~sunchao] [~maxgekk] FYI

> Cached Table (parquet) with old Configs Used
> 
>
> Key: SPARK-34780
> URL: https://issues.apache.org/jira/browse/SPARK-34780
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.1
>Reporter: Michael Chen
>Priority: Major
>
> When a dataframe is cached, the logical plan can contain copies of the spark 
> session meaning the SQLConfs are stored. Then if a different dataframe can 
> replace parts of it's logical plan with a cached logical plan, the cached 
> SQLConfs will be used for the evaluation of the cached logical plan. This is 
> because HadoopFsRelation ignores sparkSession for equality checks (introduced 
> in https://issues.apache.org/jira/browse/SPARK-17358).
> {code:java}
> test("cache uses old SQLConf") {
>   import testImplicits._
>   withTempDir { dir =>
> val tableDir = dir.getAbsoluteFile + "/table"
> val df = Seq("a").toDF("key")
> df.write.parquet(tableDir)
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1Stats = spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "10")
> val df2 = spark.read.parquet(tableDir).select("key")
> df2.cache()
> val compression10Stats = df2.queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> SQLConf.get.setConfString(SQLConf.FILE_COMPRESSION_FACTOR.key, "1")
> val compression1StatsWithCache = 
> spark.read.parquet(tableDir).select("key").
>   queryExecution.optimizedPlan.collect {
>   case l: LogicalRelation => l
>   case m: InMemoryRelation => m
> }.map(_.computeStats())
> // I expect these stats to be the same because file compression factor is 
> the same
> assert(compression1Stats == compression1StatsWithCache)
> // Instead, we can see the file compression factor is being cached and 
> used along with
> // the logical plan
> assert(compression10Stats == compression1StatsWithCache)
>   }
> }{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31976) use MemoryUsage to control the size of block

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31976:


Assignee: zhengruifeng

> use MemoryUsage to control the size of block
> 
>
> Key: SPARK-31976
> URL: https://issues.apache.org/jira/browse/SPARK-31976
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> According to the performance test in 
> https://issues.apache.org/jira/browse/SPARK-31783, the performance gain is 
> mainly related to the nnz of block.
> So it maybe reasonable to control the size of block by memory usage, instead 
> of number of rows.
>  
> note1: param blockSize had already used in ALS and MLP to stack vectors 
> (expected to be dense);
> note2: we may refer to the {{Strategy.maxMemoryInMB}} in tree models;
>  
> There may be two ways to impl:
> 1, compute the sparsity of input vectors ahead of train (this can be computed 
> with other statistics computation, maybe no extra pass), and infer a 
> reasonable number of vectors to stack;
> 2, stack the input vectors adaptively, by monitoring the memory usage in a 
> block;



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31661) Document usage of blockSize

2021-03-18 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31661:


Assignee: zhengruifeng

> Document usage of blockSize
> ---
>
> Key: SPARK-31661
> URL: https://issues.apache.org/jira/browse/SPARK-31661
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304559#comment-17304559
 ] 

Cheng Su commented on SPARK-34796:
--

[~hyukjin.kwon] - yeah I agree with it. Thanks for correcting priority here.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304556#comment-17304556
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

See also https://github.com/apache/spark/pull/26429#discussion_r346103050

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304558#comment-17304558
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

cc [~falaki] FYI

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304555#comment-17304555
 ] 

Hyukjin Kwon commented on SPARK-34791:
--

It's possibly a regression from SPARK-29777 ... yeah it would be great if we 
can confirm if it works with the latest versions.


> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-34796:
-
Priority: Critical  (was: Blocker)

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Critical
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304554#comment-17304554
 ] 

Hyukjin Kwon commented on SPARK-34796:
--

[~chengsu] I am lowering the priority because we enable AQE by default now and 
users get affected less.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Blocker
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
>  
>   withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
>  
> https://gist.github.com/c21/ea760c75b546d903247582be656d9d66



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34796:
-
Description: 
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:

 

https://gist.github.com/c21/ea760c75b546d903247582be656d9d66

  was:
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226)

[jira] [Updated] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cheng Su updated SPARK-34796:
-
Description: 
Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")
 
  withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
 at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384)
 at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at 
org.codehaus.janino.UnitCompiler.access$400(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitPackageMemberClassDeclaration(UnitCompiler.java:389)
 at

[jira] [Commented] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304550#comment-17304550
 ] 

Cheng Su commented on SPARK-34796:
--

FYI I am working on a PR to fix it now.

> Codegen compilation error for query with LIMIT operator and without AQE
> ---
>
> Key: SPARK-34796
> URL: https://issues.apache.org/jira/browse/SPARK-34796
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Cheng Su
>Priority: Blocker
>
> Example (reproduced in unit test):
>  
> {code:java}
>   test("failed limit query") {
> withTable("left_table", "empty_right_table", "output_table") {
>   spark.range(5).toDF("k").write.saveAsTable("left_table")
>   spark.range(0).toDF("k").write.saveAsTable("empty_right_table")  
> withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
> spark.sql("CREATE TABLE output_table (k INT) USING parquet")
> spark.sql(
>   s"""
>  |INSERT INTO TABLE output_table
>  |SELECT t1.k FROM left_table t1
>  |JOIN empty_right_table t2
>  |ON t1.k = t2.k
>  |LIMIT 3
>  |""".stripMargin)
>   }
> }
>   }
> {code}
> Result:
> {code:java}
> 17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable 
> to load native-hadoop library for your platform... using builtin-java classes 
> where applicable
> 17:46:01.540 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
> compile: org.codehaus.commons.compiler.CompileException: File 
> 'generated.java', Line 54, Column 8: Expression "_limit_counter_1" is not an 
> rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
> Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
> org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
> org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
>  at 
> org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) at 
> org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
>  at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
> org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
> org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
> org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
> org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
> org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
>  at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
>  at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
> org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
> org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
> org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
>  at 
> org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
>  at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
> org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
> org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
> org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501)
>  at 
> org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490)
>  at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
> org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
> org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
> org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362)
>  at 
> org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335)
>  at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
> org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
> org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
> org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
>  at 
>

[jira] [Created] (SPARK-34796) Codegen compilation error for query with LIMIT operator and without AQE

2021-03-18 Thread Cheng Su (Jira)

Cheng Su created SPARK-34796:


 Summary: Codegen compilation error for query with LIMIT operator 
and without AQE
 Key: SPARK-34796
 URL: https://issues.apache.org/jira/browse/SPARK-34796
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.1.0, 3.2.0
Reporter: Cheng Su


Example (reproduced in unit test):

 
{code:java}
  test("failed limit query") {
withTable("left_table", "empty_right_table", "output_table") {
  spark.range(5).toDF("k").write.saveAsTable("left_table")
  spark.range(0).toDF("k").write.saveAsTable("empty_right_table")  
withSQLConf(SQLConf.ADAPTIVE_EXECUTION_ENABLED.key -> "false") {
spark.sql("CREATE TABLE output_table (k INT) USING parquet")
spark.sql(
  s"""
 |INSERT INTO TABLE output_table
 |SELECT t1.k FROM left_table t1
 |JOIN empty_right_table t2
 |ON t1.k = t2.k
 |LIMIT 3
 |""".stripMargin)
  }
}
  }
{code}
Result:
{code:java}
17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable17:45:52.720 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to 
load native-hadoop library for your platform... using builtin-java classes 
where applicable
17:46:01.540 ERROR 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: failed to 
compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an 
rvalueorg.codehaus.commons.compiler.CompileException: File 'generated.java', 
Line 54, Column 8: Expression "_limit_counter_1" is not an rvalue at 
org.codehaus.janino.UnitCompiler.compileError(UnitCompiler.java:12021) at 
org.codehaus.janino.UnitCompiler.toRvalueOrCompileException(UnitCompiler.java:7575)
 at org.codehaus.janino.UnitCompiler.getConstantValue2(UnitCompiler.java:5766) 
at org.codehaus.janino.UnitCompiler.access$10700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$18$1.visitAmbiguousName(UnitCompiler.java:5717)
 at org.codehaus.janino.Java$AmbiguousName.accept(Java.java:4429) at 
org.codehaus.janino.UnitCompiler$18.visitLvalue(UnitCompiler.java:5714) at 
org.codehaus.janino.Java$Lvalue.accept(Java.java:4353) at 
org.codehaus.janino.UnitCompiler.getConstantValue(UnitCompiler.java:5710) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4161) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compileBoolean2(UnitCompiler.java:4133) at 
org.codehaus.janino.UnitCompiler.access$6600(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:4008)
 at 
org.codehaus.janino.UnitCompiler$14.visitBinaryOperation(UnitCompiler.java:3986)
 at org.codehaus.janino.Java$BinaryOperation.accept(Java.java:5077) at 
org.codehaus.janino.UnitCompiler.compileBoolean(UnitCompiler.java:3986) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:1854) at 
org.codehaus.janino.UnitCompiler.access$2200(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1501) 
at 
org.codehaus.janino.UnitCompiler$6.visitWhileStatement(UnitCompiler.java:1490) 
at org.codehaus.janino.Java$WhileStatement.accept(Java.java:3245) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:1490) at 
org.codehaus.janino.UnitCompiler.compileStatements(UnitCompiler.java:1573) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:3420) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1362) 
at 
org.codehaus.janino.UnitCompiler.compileDeclaredMethods(UnitCompiler.java:1335) 
at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:807) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:975) at 
org.codehaus.janino.UnitCompiler.access$700(UnitCompiler.java:226) at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:392)
 at 
org.codehaus.janino.UnitCompiler$2.visitMemberClassDeclaration(UnitCompiler.java:384)
 at org.codehaus.janino.Java$MemberClassDeclaration.accept(Java.java:1445) at 
org.codehaus.janino.UnitCompiler.compile(UnitCompiler.java:384) at 
org.codehaus.janino.UnitCompiler.compileDeclaredMemberTypes(UnitCompiler.java:1312)
 at org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:833) at 
org.codehaus.janino.UnitCompiler.compile2(UnitCompiler.java:410) at

[jira] [Resolved] (SPARK-34781) Eliminate LEFT SEMI/ANTI join to its left child side with AQE

2021-03-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34781.
--
Fix Version/s: 3.2.0
 Assignee: Cheng Su
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31873

> Eliminate LEFT SEMI/ANTI join to its left child side with AQE
> -
>
> Key: SPARK-34781
> URL: https://issues.apache.org/jira/browse/SPARK-34781
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Trivial
> Fix For: 3.2.0
>
>
> In `EliminateJoinToEmptyRelation.scala`, we can extend it to cover more cases 
> for LEFT SEMI and LEFT ANI joins:
>  # Join is left semi join, join right side is non-empty and condition is 
> empty. Eliminate join to its left side.
>  # Join is left anti join, join right side is empty. Eliminate join to its 
> left side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304539#comment-17304539
 ] 

Apache Spark commented on SPARK-34596:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31888

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34596) NewInstance.doGenCode should not throw malformed class name error

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304538#comment-17304538
 ] 

Apache Spark commented on SPARK-34596:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31888

> NewInstance.doGenCode should not throw malformed class name error
> -
>
> Key: SPARK-34596
> URL: https://issues.apache.org/jira/browse/SPARK-34596
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.1.0
>Reporter: Kris Mok
>Priority: Major
> Fix For: 3.2.0, 3.1.2, 3.0.3
>
>
> Similar to SPARK-32238 and SPARK-32999, the use of 
> {{java.lang.Class.getSimpleName}} in {{NewInstance.doGenCode}} is problematic 
> because Scala classes may trigger {{java.lang.InternalError: Malformed class 
> name}}.
> This happens more often when using nested classes in Scala (or declaring 
> classes in Scala REPL which implies class nesting).
> Note that on newer versions of JDK the underlying malformed class name no 
> longer reproduces (fixed in the JDK by 
> https://bugs.java.com/bugdatabase/view_bug.do?bug_id=8057919), so it's less 
> of an issue there. But on JDK8u this problem still exists so we still have to 
> fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Daniel Solow (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304533#comment-17304533
 ] 

Daniel Solow commented on SPARK-34794:
--

PR is at https://github.com/apache/spark/pull/31887

Thanks to [~rspitzer] for suggesting AtomicInteger

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304532#comment-17304532
 ] 

Apache Spark commented on SPARK-34794:
--

User 'dmsolow' has created a pull request for this issue:
https://github.com/apache/spark/pull/31887

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34794:


Assignee: (was: Apache Spark)

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34794:


Assignee: Apache Spark

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Assignee: Apache Spark
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-18 Thread Simeon Simeonov (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304526#comment-17304526
 ] 

Simeon Simeonov commented on SPARK-27790:
-

Maxim, this is good stuff. 

Does ANSI SQL allow operations on dates using the YEAR-MONTH interval type? I 
didn't see that mentioned in Milestone 1.

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34795:


Assignee: (was: Apache Spark)

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34795:


Assignee: Apache Spark

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304521#comment-17304521
 ] 

Apache Spark commented on SPARK-34795:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/31886

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34795:
-
Description: This ticket aims at adding a new job in GitHub Actions to 
check the output of TPC-DS queries. There are some cases where we noticed 
runtime-realted bugs after merging commits (e.g. .SPARK-33822). Therefore, I 
think it is worth adding a new job in GitHub Actions to check query output of 
TPC-DS (sf=1).  (was: This ticket aims at adding a new job in GitHub Actions to 
check the output of TPC-DS queries. There are some cases where we noticed 
runtime-realted bugs far after merging commits (e.g. .SPARK-33822). Therefore, 
I think it is worth adding a new job in GitHub Actions to check query output of 
TPC-DS (sf=1).)

> Adds a new job in GitHub Actions to check the output of TPC-DS queries
> --
>
> Key: SPARK-34795
> URL: https://issues.apache.org/jira/browse/SPARK-34795
> Project: Spark
>  Issue Type: Test
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> This ticket aims at adding a new job in GitHub Actions to check the output of 
> TPC-DS queries. There are some cases where we noticed runtime-realted bugs 
> after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
> adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34795) Adds a new job in GitHub Actions to check the output of TPC-DS queries

2021-03-18 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-34795:


 Summary: Adds a new job in GitHub Actions to check the output of 
TPC-DS queries
 Key: SPARK-34795
 URL: https://issues.apache.org/jira/browse/SPARK-34795
 Project: Spark
  Issue Type: Test
  Components: SQL, Tests
Affects Versions: 3.2.0
Reporter: Takeshi Yamamuro


This ticket aims at adding a new job in GitHub Actions to check the output of 
TPC-DS queries. There are some cases where we noticed runtime-realted bugs far 
after merging commits (e.g. .SPARK-33822). Therefore, I think it is worth 
adding a new job in GitHub Actions to check query output of TPC-DS (sf=1).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34787) Option variable in Spark historyServer log should be displayed as actual value instead of Some(XX)

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34787:
--
Affects Version/s: (was: 3.0.1)
   3.2.0

> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX)
> --
>
> Key: SPARK-34787
> URL: https://issues.apache.org/jira/browse/SPARK-34787
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: akiyamaneko
>Priority: Minor
>
> Option variable in Spark historyServer log should be displayed as actual 
> value instead of Some(XX):
> {code:html}
> 21/02/25 10:10:45 INFO ApplicationCache: Failed to load application attempt 
> application_1613641231234_0421/Some(1) 21/02/25 10:10:52 INFO 
> FsHistoryProvider: Parsing 
> hdfs://graph-product-001:8020/system/spark2-history/application_1613641231234_0421
>  for listing data...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304511#comment-17304511
 ] 

Dongjoon Hyun commented on SPARK-34788:
---

Could you be more specific, [~Ngone51]? In case of disk full, many parts are 
related. Which code path do you see `FileNotFoundException` instead of 
`IOException`?

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.4.7, 3.0.0, 3.1.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34788) Spark throws FileNotFoundException instead of IOException when disk is full

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34788:
--
Affects Version/s: (was: 2.4.7)
   (was: 3.1.0)
   (was: 3.0.0)
   3.2.0

> Spark throws FileNotFoundException instead of IOException when disk is full
> ---
>
> Key: SPARK-34788
> URL: https://issues.apache.org/jira/browse/SPARK-34788
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: wuyi
>Priority: Major
>
> When the disk is full, Spark throws FileNotFoundException instead of 
> IOException with the hint. It's quite a confusing error to users.
>  
> And there's probably a way to detect the disk full: when we get 
> `FileNotFoundException`, we try 
> [http://weblog.janek.org/Archive/2004/12/20/ExceptionWhenWritingToAFu.html] 
> to see if SyncFailedException throws. If SyncFailedException throws, then we 
> throw IOException with the disk full hint.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304509#comment-17304509
 ] 

Dongjoon Hyun commented on SPARK-34789:
---

Thank you for working on this, [~attilapiros]. This will be useful.

> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Task
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt;.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34789) Introduce Jetty based construct for integration tests where HTTP(S) is used

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34789:
--
Issue Type: Improvement  (was: Task)

> Introduce Jetty based construct for integration tests where HTTP(S) is used
> ---
>
> Key: SPARK-34789
> URL: https://issues.apache.org/jira/browse/SPARK-34789
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up during 
> https://github.com/apache/spark/pull/31877#discussion_r596831803.
> Short summary: we have some tests where HTTP(S) is used to access files. The 
> current solution uses github urls like 
> "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt;.
> This connects two Spark version in an unhealthy way like connecting the 
> "master" branch which is moving part with the committed test code which is a 
> non-moving (as it might be even released).
> So this way a test running for an earlier version of Spark expects something 
> (filename, content, path) from a the latter release and what is worse when 
> the moving version is changed the earlier test will break. 
> The idea is to introduce a method like:
> {noformat}
> withHttpServer(files) {
> }
> {noformat}
> Which uses a Jetty ResourceHandler to serve the listed files (or directories 
> / or just the root where it is started from) and stops the server in the 
> finally.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304508#comment-17304508
 ] 

Dongjoon Hyun commented on SPARK-34790:
---

According to the issue title, I convert this into a subtask of SPARK-33828.

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34790:
--
Parent: SPARK-33828
Issue Type: Sub-task  (was: Bug)

> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34790) Fail in fetch shuffle blocks in batch when i/o encryption is enabled.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304507#comment-17304507
 ] 

Dongjoon Hyun commented on SPARK-34790:
---

Thank you for reporting, [~hezuojiao]. However, it may not be a bug. Sometimes, 
a test suite like AdaptiveQueryExecSuite has some assumption. Could you be more 
specific by posting the failure output you saw?


> Fail in fetch shuffle blocks in batch when i/o encryption is enabled.
> -
>
> Key: SPARK-34790
> URL: https://issues.apache.org/jira/browse/SPARK-34790
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: hezuojiao
>Priority: Major
>
> When set spark.io.encryption.enabled=true, lots of test cases in 
> AdaptiveQueryExecSuite will be failed. Fetching shuffle blocks in batch is 
> incompatible with io encryption.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34791) SparkR throws node stack overflow

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304504#comment-17304504
 ] 

Dongjoon Hyun commented on SPARK-34791:
---

Thank you for reporting, [~jeetendray]. Could you try to use the latest 
version, Apache Spark 3.1.1?
Our CI is using R 4.0.3.
{code}
root@e3c1c7d8cbf8:/# R

R version 4.0.3 (2020-10-10) -- "Bunny-Wunnies Freak Out"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)
{code}

> SparkR throws node stack overflow
> -
>
> Key: SPARK-34791
> URL: https://issues.apache.org/jira/browse/SPARK-34791
> Project: Spark
>  Issue Type: Question
>  Components: SparkR
>Affects Versions: 3.0.1
>Reporter: obfuscated_dvlper
>Priority: Major
>
> SparkR throws "node stack overflow" error upon running code (sample below) on 
> R-4.0.2 with Spark 3.0.1.
> Same piece of code works on R-3.3.3 with Spark 2.2.1 (& SparkR 2.4.5)
> {code:java}
> source('sample.R')
> myclsr = myclosure_func()
> myclsr$get_some_date('2021-01-01')
> ## spark.lapply throws node stack overflow
> result = spark.lapply(c('2021-01-01', '2021-01-02'), function (rdate) {
> source('sample.R')
> another_closure = myclosure_func()
> return(another_closure$get_some_date(rdate))
> })
> {code}
> Sample.R
> {code:java}
> ## util function, which calls itself
> getPreviousBusinessDate <- function(asofdate) {
> asdt <- asofdate;
> asdt <- as.Date(asofdate)-1;
> wd <- format(as.Date(asdt),"%A")
> if(wd == "Saturday" | wd == "Sunday") {
> return (getPreviousBusinessDate(asdt));
> }
> return (asdt);
> }
> ## closure which calls util function
> myclosure_func = function() {
> myclosure = list()
> get_some_date = function (random_date) {
> return (getPreviousBusinessDate(random_date))
> }
> myclosure$get_some_date = get_some_date
> return(myclosure)
> }
> {code}
> This seems to have caused by sourcing sample.R twice. Once before invoking 
> Spark session and another within Spark session.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34792) Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34792.
---
Resolution: Not A Problem

Hi, [~kondziolka9ld]. 
This doesn't look like a bug. It can be different in many reasons. Not only 
Spark's code, but also there is difference between Scala 2.11 and Scala 2.12. 
BTW, please send your question to d...@spark.apache.org next time.

> Restore previous behaviour of randomSplit from spark-2.4.7 in spark-3
> -
>
> Key: SPARK-34792
> URL: https://issues.apache.org/jira/browse/SPARK-34792
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core, SQL
>Affects Versions: 3.0.1
>Reporter: kondziolka9ld
>Priority: Major
>
> Hi, 
> Please consider a following difference of `randomSplit` method even despite 
> of the same seed.
>  
> {code:java}
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.7
>   /_/
>  
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_282)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |4|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |5|
> |6|
> |7|
> |8|
> |9|
> |   10|
> +-+
> {code}
> while as on spark-3
> {code:java}
> scala> val Array(f, s) =  
> Seq(1,2,3,4,5,6,7,8,9,10).toDF.randomSplit(Array(0.3, 0.7), 42)
> f: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> s: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [value: int]
> scala> f.show
> +-+
> |value|
> +-+
> |5|
> |   10|
> +-+
> scala> s.show
> +-+
> |value|
> +-+
> |1|
> |2|
> |3|
> |4|
> |6|
> |7|
> |8|
> |9|
> +-+
> {code}
> I guess that implementation of `sample` method changed.
> Is it possible to restore previous behaviour?
> Thanks in advance!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34794) Nested higher-order functions broken in DSL

2021-03-18 Thread Daniel Solow (Jira)

Daniel Solow created SPARK-34794:


 Summary: Nested higher-order functions broken in DSL
 Key: SPARK-34794
 URL: https://issues.apache.org/jira/browse/SPARK-34794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
 Environment: 3.1.1
Reporter: Daniel Solow


In Spark 3, if I have:

{code:java}
val df = Seq(
(Seq(1,2,3), Seq("a", "b", "c"))
).toDF("numbers", "letters")
{code}
and I want to take the cross product of these two arrays, I can do the 
following in SQL:

{code:java}
df.selectExpr("""
FLATTEN(
TRANSFORM(
numbers,
number -> TRANSFORM(
letters,
letter -> (number AS number, letter AS letter)
)
)
) AS zipped
""").show(false)
++
|zipped  |
++
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
++
{code}

This works fine. But when I try the equivalent using the scala DSL, the result 
is wrong:

{code:java}
df.select(
f.flatten(
f.transform(
$"numbers",
(number: Column) => { f.transform(
$"letters",
(letter: Column) => { f.struct(
number.as("number"),
letter.as("letter")
) }
) }
)
).as("zipped")
).show(10, false)
++
|zipped  |
++
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
++
{code}

Note that the numbers are not included in the output. The explain for this 
second version is:

{code:java}
== Parsed Logical Plan ==
'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Analyzed Logical Plan ==
zipped: array>
Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
x#446, false)), lambda x#445, false))) AS zipped#444]
+- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
   +- LocalRelation [_1#303, _2#304]

== Optimized Logical Plan ==
LocalRelation [zipped#444]

== Physical Plan ==
LocalTableScan [zipped#444]
{code}

Seems like variable name x is hardcoded. And sure enough: 
https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304472#comment-17304472
 ] 

Dongjoon Hyun edited comment on SPARK-34674 at 3/18/21, 9:39 PM:
-

I reopen this because the affected version is not the same according to 
[~Kotlov]. The following is a copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.


was (Author: dongjoon):
I reopen this because the affected version is not the same. The following is a 
copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-34674) Spark app on k8s doesn't terminate without call to sparkContext.stop() method

2021-03-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34674?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-34674:
---

I reopen this because the affected version is not the same. The following is a 
copy of my comment at the other JIRA.

1. Does your example work with any Spark versions before? Then, what is the 
latest Spark version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> Spark app on k8s doesn't terminate without call to sparkContext.stop() method
> -
>
> Key: SPARK-34674
> URL: https://issues.apache.org/jira/browse/SPARK-34674
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.1
>Reporter: Sergey
>Priority: Major
>
> Hello!
>  I have run into a problem that if I don't call the method 
> sparkContext.stop() explicitly, then a Spark driver process doesn't terminate 
> even after its Main method has been completed. This behaviour is different 
> from spark on yarn, where the manual sparkContext stopping is not required.
>  It looks like, the problem is in using non-daemon threads, which prevent the 
> driver jvm process from terminating.
>  At least I see two non-daemon threads, if I don't call sparkContext.stop():
> {code:java}
> Thread[OkHttp kubernetes.default.svc,5,main]
> Thread[OkHttp kubernetes.default.svc Writer,5,main]
> {code}
> Could you tell please, if it is possible to solve this problem?
> Docker image from the official release of spark-3.1.1 hadoop3.2 is used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304470#comment-17304470
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

I found your JIRA, SPARK-34674 . Please comment on there because this issue 
seems irrelevant to yours technically.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304459#comment-17304459
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Does your example work with any Spark versions before? Then, what is the latest 
Spark version?

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27812) kubernetes client import non-daemon thread which block jvm exit.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304457#comment-17304457
 ] 

Dongjoon Hyun commented on SPARK-27812:
---

Thank you for reporting, [~Kotlov].
1. Could you file a new JIRA because it might be related to K8s client version?
2. BTW, `sparkContext.stop()` or `spark.stop()` should be called by App. I 
don't think your use case is a legit Spark example although it might be a 
behavior change across Spark versions.

> kubernetes client import non-daemon thread which block jvm exit.
> 
>
> Key: SPARK-27812
> URL: https://issues.apache.org/jira/browse/SPARK-27812
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes, Spark Core
>Affects Versions: 2.4.3, 2.4.4
>Reporter: Henry Yu
>Assignee: Igor Calabria
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> I try spark-submit to k8s with cluster mode. Driver pod failed to exit with 
> An Okhttp Websocket Non-Daemon Thread.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304454#comment-17304454
 ] 

Dongjoon Hyun edited comment on SPARK-34785 at 3/18/21, 9:14 PM:
-

FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

The recommendation is to use MR code path if you are in that situation.


was (Author: dongjoon):
FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304454#comment-17304454
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

FYI, each data sources have different schema evolution capabilities. And, 
Parquet is not the best built-in data source in terms of it. We are tracking it 
with the following test suite.
- 
https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/ReadSchemaTest.scala

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
>

[jira] [Commented] (SPARK-34785) SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true which is default value. Error Parquet column cannot be converted in file.

2021-03-18 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304453#comment-17304453
 ] 

Dongjoon Hyun commented on SPARK-34785:
---

SPARK-34212 is complete by itself because it's designed to fix the correctness 
issue. You will not get incorrect values.
For this specific `Schema evolution` requirement in the PR description, I don't 
have a bandwidth, [~jobitmathew].

> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value. Error Parquet column cannot be converted in file.
> --
>
> Key: SPARK-34785
> URL: https://issues.apache.org/jira/browse/SPARK-34785
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: jobit mathew
>Priority: Major
>
> SPARK-34212 issue not fixed if spark.sql.parquet.enableVectorizedReader=true 
> which is default value.
> IF spark.sql.parquet.enableVectorizedReader=false below scenario pass but it 
> will reduce the performance.
> In Hive, 
> {code:java}
> create table test_decimal(amt decimal(18,2)) stored as parquet; 
> insert into test_decimal select 100;
> alter table test_decimal change amt amt decimal(19,3);
> {code}
> In Spark,
> {code:java}
> select * from test_decimal;
> {code}
> {code:java}
> ++
> |amt |
> ++
> | 100.000 |{code}
> but if spark.sql.parquet.enableVectorizedReader=true below error
> {code:java}
> : jdbc:hive2://10.21.18.161:23040/> select * from test_decimal;
> going to print operations logs
> printed operations logs
> going to print operations logs
> printed operations logs
> Getting log thread is interrupted, since query is done!
> Error: org.apache.hive.service.cli.HiveSQLException: Error running query: 
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 
> (TID 4) (vm2 executor 2): 
> org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot 
> be converted in file 
> hdfs://hacluster/user/hive/warehouse/test_decimal/00_0. Column: [amt], 
> Expected: decimal(19,3), Found: FIXED_LEN_BYTE_ARRAY
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:179)
> at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
> at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:503)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> at org.apache.spark.scheduler.Task.run(Task.scala:131)
> at 
> org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
> at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1461)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: 
> org.apache.spark.sql.execution.datasources.SchemaColumnConvertNotSupportedException
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.constructConvertNotSupportedException(VectorizedColumnReader.java:339)
> at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readFixedLenByteArrayBatch(VectorizedColumnReader.java:735)
> at 
>

[jira] [Commented] (SPARK-27790) Support ANSI SQL INTERVAL types

2021-03-18 Thread Matthew Powers (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304443#comment-17304443
 ] 

Matthew Powers commented on SPARK-27790:


[~maxgekk] - how will DateTime addition work with the new intervals?  Something 
like this?
 * Add three months: col("some_date") + make_year_month(???)
 * Add 2 years and 10 seconds: col("some_time") + make_year_month(???) + 
make_day_second(???)

Thanks for the great description of the problem in this JIRA ticket.  

> Support ANSI SQL INTERVAL types
> ---
>
> Key: SPARK-27790
> URL: https://issues.apache.org/jira/browse/SPARK-27790
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Spark has an INTERVAL data type, but it is “broken”:
> # It cannot be persisted
> # It is not comparable because it crosses the month day line. That is there 
> is no telling whether “1 Month 1 Day” is equal to “1 Month 1 Day” since not 
> all months have the same number of days.
> I propose here to introduce the two flavours of INTERVAL as described in the 
> ANSI SQL Standard and deprecate the Sparks interval type.
> * ANSI describes two non overlapping “classes”: 
> ** YEAR-MONTH, 
> ** DAY-SECOND ranges
> * Members within each class can be compared and sorted.
> * Supports datetime arithmetic
> * Can be persisted.
> The old and new flavors of INTERVAL can coexist until Spark INTERVAL is 
> eventually retired. Also any semantic “breakage” can be controlled via legacy 
> config settings. 
> *Milestone 1* --  Spark Interval equivalency (   The new interval types meet 
> or exceed all function of the existing SQL Interval):
> * Add two new DataType implementations for interval year-month and 
> day-second. Includes the JSON format and DLL string.
> * Infra support: check the caller sides of DateType/TimestampType
> * Support the two new interval types in Dataset/UDF.
> * Interval literals (with a legacy config to still allow mixed year-month 
> day-seconds fields and return legacy interval values)
> * Interval arithmetic(interval * num, interval / num, interval +/- interval)
> * Datetime functions/operators: Datetime - Datetime (to days or day second), 
> Datetime +/- interval
> * Cast to and from the new two interval types, cast string to interval, cast 
> interval to string (pretty printing), with the SQL syntax to specify the types
> * Support sorting intervals.
> *Milestone 2* -- Persistence:
> * Ability to create tables of type interval
> * Ability to write to common file formats such as Parquet and JSON.
> * INSERT, SELECT, UPDATE, MERGE
> * Discovery
> *Milestone 3* --  Client support
> * JDBC support
> * Hive Thrift server
> *Milestone 4* -- PySpark and Spark R integration
> * Python UDF can take and return intervals
> * DataFrame support



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304434#comment-17304434
 ] 

Apache Spark commented on SPARK-34636:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31885

> sql method in UnresolvedAttribute, AttributeReference and Alias don't quote 
> qualified names properly.
> -
>
> Key: SPARK-34636
> URL: https://issues.apache.org/jira/browse/SPARK-34636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> UnresolvedReference, AttributeReference and Alias which take qualifier don't 
> quote qualified names properly.
> One instance is reported in SPARK-34626.
> Other instances are like as follows.
> {code}
> UnresolvedAttribute("a`b"::"c.d"::Nil).sql
> a`b.`c.d` // expected: `a``b`.`c.d`
> AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
> c.d.`a.b` // expected: `c.d`.`a.b`
> Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = 
> "d.e"::Nil).sql
> `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34636) sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly.

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304433#comment-17304433
 ] 

Apache Spark commented on SPARK-34636:
--

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/31885

> sql method in UnresolvedAttribute, AttributeReference and Alias don't quote 
> qualified names properly.
> -
>
> Key: SPARK-34636
> URL: https://issues.apache.org/jira/browse/SPARK-34636
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.2.0
>
>
> UnresolvedReference, AttributeReference and Alias which take qualifier don't 
> quote qualified names properly.
> One instance is reported in SPARK-34626.
> Other instances are like as follows.
> {code}
> UnresolvedAttribute("a`b"::"c.d"::Nil).sql
> a`b.`c.d` // expected: `a``b`.`c.d`
> AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql
> c.d.`a.b` // expected: `c.d`.`a.b`
> Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = 
> "d.e"::Nil).sql
> `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c`
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17304427#comment-17304427
 ] 

Apache Spark commented on SPARK-34793:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31884

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34793) Prohibit saving of day-time and year-month intervals

2021-03-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34793:


Assignee: Apache Spark  (was: Max Gekk)

> Prohibit saving of day-time and year-month intervals
> 
>
> Key: SPARK-34793
> URL: https://issues.apache.org/jira/browse/SPARK-34793
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Temporary prohibit saving of year-month and day-time intervals to datasources 
> till they are supported.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 155 matches

Mail list logo