[jira] [Commented] (SPARK-31345) Spark fails to write hive parquet table with empty array

2022-02-17 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-31345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494336#comment-17494336
 ] 

Mr.黄 commented on SPARK-31345:
--

Hello, I did not have this problem in Spark2.4.0-CDH6.3.2 version, but this 
problem was repeated in Spark2.4.3 version, I do not understand why the lower 
version succeeded and the higher version failed, I would like to ask whether 
the fix of this bug does not support the 2.4.3 version? The following is my 
version information and error message:
{code:java}
spark version: 2.4.0-cdh6.3.2
hive version: 2.1.1-cdh.6.3.2
scala> spark.sql("create table test STORED AS PARQUET as select map() as a")
scala> sql("select * from test").show
+---+                                                                           
|  a|
+---+
| []|
+---+

-
spark version: 2.4.3
hive version: 3.1.2
scala> spark.sql("create table test STORED AS PARQUET as select map() as a")

Caused by: org.apache.spark.SparkException: Task failed while writing rows.
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:257)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  at org.apache.spark.scheduler.Task.run(Task.scala:121)
  at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.RuntimeException: Parquet record is malformed: empty 
fields are illegal, the field should be ommited completely instead
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
  at 
parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:121)
  at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:123)
  at parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:42)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:111)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:124)
  at 
org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
  at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
  at 
org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
  at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:248)
  ... 10 more
Caused by: parquet.io.ParquetEncodingException: empty fields are illegal, the 
field should be ommited completely instead
  at 
parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:244)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeGroupFields(DataWritableWriter.java:89)
  at 
org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:60)
  ... 23 more {code}

> Spark fails to write hive parquet table with empty array
> 
>
> Key: SPARK-31345
> URL: https://issues.apache.org/jira/browse/SPARK-31345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 

[jira] [Commented] (SPARK-25271) Creating parquet table with all the column null throws exception

2022-02-17 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494342#comment-17494342
 ] 

L. C. Hsieh commented on SPARK-25271:
-

Based on this JIRA, we only have this fix since 2.4.8.

I guess 2.4.0-cdh6.3.2 may backport the fix as this is a distribution 
maintained by the vendor, I don't know about the detail.

> Creating parquet table with all the column null throws exception
> 
>
> Key: SPARK-25271
> URL: https://issues.apache.org/jira/browse/SPARK-25271
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Shivu Sondur
>Assignee: L. C. Hsieh
>Priority: Critical
> Fix For: 2.4.8, 3.0.0
>
> Attachments: image-2018-09-07-09-12-34-944.png, 
> image-2018-09-07-09-29-33-370.png, image-2018-09-07-09-29-52-899.png, 
> image-2018-09-07-09-32-43-892.png, image-2018-09-07-09-33-03-095.png
>
>
> {code:java}
>  1)cat /data/parquet.dat
> 1$abc2$pqr:3$xyz
> null{code}
>  
> {code:java}
> 2)spark.sql("create table vp_reader_temp (projects map) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ':' 
> MAP KEYS TERMINATED BY '$'")
> {code}
> {code:java}
> 3)spark.sql("
> LOAD DATA LOCAL INPATH '/data/parquet.dat' INTO TABLE vp_reader_temp")
> {code}
> {code:java}
> 4)spark.sql("create table vp_reader STORED AS PARQUET as select * from 
> vp_reader_temp")
> {code}
> *Result :* Throwing exception (Working fine with spark 2.2.1)
> {code:java}
> java.lang.RuntimeException: Parquet record is malformed: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.write(DataWritableWriter.java:64)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:59)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriteSupport.write(DataWritableWriteSupport.java:31)
>   at 
> org.apache.parquet.hadoop.InternalParquetRecordWriter.write(InternalParquetRecordWriter.java:123)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:180)
>   at 
> org.apache.parquet.hadoop.ParquetRecordWriter.write(ParquetRecordWriter.java:46)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:112)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.ParquetRecordWriterWrapper.write(ParquetRecordWriterWrapper.java:125)
>   at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.write(HiveFileFormat.scala:149)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:406)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:283)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:281)
>   at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1438)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:286)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:211)
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:210)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:349)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.parquet.io.ParquetEncodingException: empty fields are 
> illegal, the field should be ommited completely instead
>   at 
> org.apache.parquet.io.MessageColumnIO$MessageColumnIORecordConsumer.endField(MessageColumnIO.java:320)
>   at 
> org.apache.parquet.io.RecordConsumerLoggingWrapper.endField(RecordConsumerLoggingWrapper.java:165)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeMap(DataWritableWriter.java:241)
>   at 
> org.apache.hadoop.hive.ql.io.parquet.write.DataWritableWriter.writeValue(DataWritableWriter.java:116)
>   at 
> org.apache.hadoop

[jira] [Created] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread yikf (Jira)
yikf created SPARK-38247:


 Summary: Unify the output of df.explain and "explain " if plan is 
command
 Key: SPARK-38247
 URL: https://issues.apache.org/jira/browse/SPARK-38247
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: yikf
 Fix For: 3.3.0


This pr aims to unify the out of df.explain and "explain" sql if plan is a 
command
 * unify the out of df.explain and "explain" sql if plan is a command
 * Make the output of explain unambiguous if plan is a command

 

Lets say have a query like "show tables", we want to explain it

Before this pr:
{code:java}
== Parsed Logical Plan ==
'ShowTables [namespace#62, tableName#63, isTemporary#64]
+- 'UnresolvedNamespace== Analyzed Logical Plan ==
namespace: string, tableName: string, isTemporary: boolean
ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
false== Optimized Logical Plan ==
CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
ShowTablesCommand, [[default,lt,false], [default,people,false], 
[default,person,false], [default,rt,false], [default,t,false], 
[default,t1,false], [default,t10086,false], [default,txtsource,false]]
   +- ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
false== Physical Plan ==
CommandResult [namespace#62, tableName#63, isTemporary#64]
   +- Execute ShowTablesCommand
         +- ShowTablesCommand default, [namespace#62, tableName#63, 
isTemporary#64], false {code}
After this pr:
{code:java}
== Parsed Logical Plan ==
'ShowTables [namespace#0, tableName#1, isTemporary#2]
+- 'UnresolvedNamespace== Analyzed Logical Plan ==
namespace: string, tableName: string, isTemporary: boolean
ShowTables [namespace#0, tableName#1, isTemporary#2]
+- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
Logical Plan ==
ShowTables [namespace#0, tableName#1, isTemporary#2]
+- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical Plan 
==
ShowTables [namespace#0, tableName#1, isTemporary#2], 
V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38248) Add @nowarn to suppress warnings related to Serializable interface

2022-02-17 Thread Yang Jie (Jira)
Yang Jie created SPARK-38248:


 Summary: Add @nowarn to suppress warnings related to Serializable 
interface
 Key: SPARK-38248
 URL: https://issues.apache.org/jira/browse/SPARK-38248
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: Yang Jie


Methods related to `java.io.Serializable` are considered never used,  we can 
use @nowarn to suppress these warnings

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494357#comment-17494357
 ] 

Apache Spark commented on SPARK-38247:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/35564

> Unify the output of df.explain and "explain " if plan is command
> 
>
> Key: SPARK-38247
> URL: https://issues.apache.org/jira/browse/SPARK-38247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> This pr aims to unify the out of df.explain and "explain" sql if plan is a 
> command
>  * unify the out of df.explain and "explain" sql if plan is a command
>  * Make the output of explain unambiguous if plan is a command
>  
> Lets say have a query like "show tables", we want to explain it
> Before this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#62, tableName#63, isTemporary#64]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
> false== Optimized Logical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
> ShowTablesCommand, [[default,lt,false], [default,people,false], 
> [default,person,false], [default,rt,false], [default,t,false], 
> [default,t1,false], [default,t10086,false], [default,txtsource,false]]
>    +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false== Physical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64]
>    +- Execute ShowTablesCommand
>          +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false {code}
> After this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
> Logical Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical 
> Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2], 
> V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38247:


Assignee: Apache Spark

> Unify the output of df.explain and "explain " if plan is command
> 
>
> Key: SPARK-38247
> URL: https://issues.apache.org/jira/browse/SPARK-38247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.3.0
>
>
> This pr aims to unify the out of df.explain and "explain" sql if plan is a 
> command
>  * unify the out of df.explain and "explain" sql if plan is a command
>  * Make the output of explain unambiguous if plan is a command
>  
> Lets say have a query like "show tables", we want to explain it
> Before this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#62, tableName#63, isTemporary#64]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
> false== Optimized Logical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
> ShowTablesCommand, [[default,lt,false], [default,people,false], 
> [default,person,false], [default,rt,false], [default,t,false], 
> [default,t1,false], [default,t10086,false], [default,txtsource,false]]
>    +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false== Physical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64]
>    +- Execute ShowTablesCommand
>          +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false {code}
> After this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
> Logical Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical 
> Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2], 
> V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38247:


Assignee: (was: Apache Spark)

> Unify the output of df.explain and "explain " if plan is command
> 
>
> Key: SPARK-38247
> URL: https://issues.apache.org/jira/browse/SPARK-38247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> This pr aims to unify the out of df.explain and "explain" sql if plan is a 
> command
>  * unify the out of df.explain and "explain" sql if plan is a command
>  * Make the output of explain unambiguous if plan is a command
>  
> Lets say have a query like "show tables", we want to explain it
> Before this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#62, tableName#63, isTemporary#64]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
> false== Optimized Logical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
> ShowTablesCommand, [[default,lt,false], [default,people,false], 
> [default,person,false], [default,rt,false], [default,t,false], 
> [default,t1,false], [default,t10086,false], [default,txtsource,false]]
>    +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false== Physical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64]
>    +- Execute ShowTablesCommand
>          +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false {code}
> After this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
> Logical Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical 
> Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2], 
> V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494356#comment-17494356
 ] 

Apache Spark commented on SPARK-38247:
--

User 'Yikf' has created a pull request for this issue:
https://github.com/apache/spark/pull/35564

> Unify the output of df.explain and "explain " if plan is command
> 
>
> Key: SPARK-38247
> URL: https://issues.apache.org/jira/browse/SPARK-38247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> This pr aims to unify the out of df.explain and "explain" sql if plan is a 
> command
>  * unify the out of df.explain and "explain" sql if plan is a command
>  * Make the output of explain unambiguous if plan is a command
>  
> Lets say have a query like "show tables", we want to explain it
> Before this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#62, tableName#63, isTemporary#64]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
> false== Optimized Logical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
> ShowTablesCommand, [[default,lt,false], [default,people,false], 
> [default,person,false], [default,rt,false], [default,t,false], 
> [default,t1,false], [default,t10086,false], [default,txtsource,false]]
>    +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false== Physical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64]
>    +- Execute ShowTablesCommand
>          +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false {code}
> After this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
> Logical Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical 
> Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2], 
> V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38248) Add @nowarn to suppress warnings related to Serializable interface

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38248:


Assignee: Apache Spark

> Add @nowarn to suppress warnings related to Serializable interface
> --
>
> Key: SPARK-38248
> URL: https://issues.apache.org/jira/browse/SPARK-38248
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> Methods related to `java.io.Serializable` are considered never used,  we can 
> use @nowarn to suppress these warnings
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38248) Add @nowarn to suppress warnings related to Serializable interface

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494358#comment-17494358
 ] 

Apache Spark commented on SPARK-38248:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35565

> Add @nowarn to suppress warnings related to Serializable interface
> --
>
> Key: SPARK-38248
> URL: https://issues.apache.org/jira/browse/SPARK-38248
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Methods related to `java.io.Serializable` are considered never used,  we can 
> use @nowarn to suppress these warnings
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38248) Add @nowarn to suppress warnings related to Serializable interface

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38248:


Assignee: (was: Apache Spark)

> Add @nowarn to suppress warnings related to Serializable interface
> --
>
> Key: SPARK-38248
> URL: https://issues.apache.org/jira/browse/SPARK-38248
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>
> Methods related to `java.io.Serializable` are considered never used,  we can 
> use @nowarn to suppress these warnings
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38249) Cleanup unused private methods and fields

2022-02-17 Thread Yang Jie (Jira)
Yang Jie created SPARK-38249:


 Summary: Cleanup unused private methods and fields
 Key: SPARK-38249
 URL: https://issues.apache.org/jira/browse/SPARK-38249
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38249) Cleanup unused private methods and fields

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38249:


Assignee: (was: Apache Spark)

> Cleanup unused private methods and fields
> -
>
> Key: SPARK-38249
> URL: https://issues.apache.org/jira/browse/SPARK-38249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38249) Cleanup unused private methods and fields

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38249:


Assignee: Apache Spark

> Cleanup unused private methods and fields
> -
>
> Key: SPARK-38249
> URL: https://issues.apache.org/jira/browse/SPARK-38249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38249) Cleanup unused private methods and fields

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494367#comment-17494367
 ] 

Apache Spark commented on SPARK-38249:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35566

> Cleanup unused private methods and fields
> -
>
> Key: SPARK-38249
> URL: https://issues.apache.org/jira/browse/SPARK-38249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38249) Cleanup unused private methods and fields

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494368#comment-17494368
 ] 

Apache Spark commented on SPARK-38249:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35566

> Cleanup unused private methods and fields
> -
>
> Key: SPARK-38249
> URL: https://issues.apache.org/jira/browse/SPARK-38249
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38245) Avro Complex Union Type return `member$I`

2022-02-17 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-38245:
-
Component/s: SQL
 (was: Spark Core)

> Avro Complex Union Type return `member$I`
> -
>
> Key: SPARK-38245
> URL: https://issues.apache.org/jira/browse/SPARK-38245
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.1
> Environment: +OS+
>  * Debian GNU/Linux 10 (Docker Container)
> +packages & others+
>  * spark-avro_2.12-3.2.1
>  * python 3.7.3
>  * pyspark 3.2.1
>  * spark-3.2.1-bin-hadoop3.2
>  * Docker version 20.10.12
>Reporter: Teddy Crepineau
>Priority: Major
>  Labels: avro, newbie
>
> *Short Description*
> When reading complex union types from Avro files, there seems to be some 
> information lost as the name of the record is omitted and {{member$i}} is 
> instead returned.
> *Long Description*
> +Error+
> Given the Avro schema {{{}schema.avsc{}}}, I would expected the schema when 
> reading the avro file using {{read_avro.py}} to be as {{{}expected.txt{}}}. 
> Instead, I get the schema output in {{reality.txt}} where {{RecordOne}} 
> became {{{}member0{}}}, etc.
> This causes information lost and makes the DataFrame unusable.
> From my understanding this behavior was implemented 
> [here.|https://github.com/databricks/spark-avro/pull/117]
>  
> {code:java|title=read_avro.py}
> df = spark.read.format("avro").load("path/to/my/file.avro")
> df.printSchema()
>  {code}
> {code:java|title=schema.avsc}
>  {
>  "type": "record",
>  "name": "SomeData",
>  "namespace": "my.name.space",
>  "fields": [
>   {
>    "name": "ts",
>    "type": {
>     "type": "long",
>     "logicalType": "timestamp-millis"
>    }
>   },
>   {
>    "name": "field_id",
>    "type": [
>     "null",
>     "string"
>    ],
>    "default": null
>   },
>   {
>    "name": "values",
>    "type": [
>     {
>      "type": "record",
>      "name": "RecordOne",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       },
>       {
>        "name": "field_b",
>        "type": {
>         "type": "enum",
>         "name": "FieldB",
>         "symbols": [
>             "..."
>         ],
>        }
>       },
>       {
>        "name": "field_C",
>        "type": {
>         "type": "array",
>         "items": "long"
>        }
>       }
>      ]
>     },
>     {
>      "type": "record",
>      "name": "RecordTwo",
>      "fields": [
>       {
>        "name": "field_a",
>        "type": "long"
>       }
>      ]
>     }
>    ]
>   }
>  ]
> }{code}
> {code:java|title=expected.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  ||-- RecordOne: struct (nullable = true)
>  |||-- field_a: long (nullable = true)
>  |||-- field_b: string (nullable = true)
>  |||-- field_c: array (nullable = true)
>  ||||-- element: long (containsNull = true)
>  ||-- RecordTwo: struct (nullable = true)
>  |||-- field_a: long (nullable = true)
> {code}
> {code:java|title=reality.txt}
> root
>  |-- ts: timestamp (nullable = true)
>  |-- field_id: string (nullable = true)
>  |-- values: struct (nullable = true)
>  ||-- member0: struct (nullable = true)
>  |||-- field_a: long (nullable = true)
>  |||-- field_b: string (nullable = true)
>  |||-- field_c: array (nullable = true)
>  ||||-- element: long (containsNull = true)
>  ||-- member1: struct (nullable = true)
>  |||-- field_a: long (nullable = true)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38175) Clean up unused parameters in private methods signature

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494372#comment-17494372
 ] 

Apache Spark commented on SPARK-38175:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/35567

> Clean up unused parameters in private methods signature
> ---
>
> Key: SPARK-38175
> URL: https://issues.apache.org/jira/browse/SPARK-38175
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> The private method `castDecimalToIntegralTypeCode` in Cast.scala as follows:
>  
> {code:java}
> private[this] def castDecimalToIntegralTypeCode(
>       ctx: CodegenContext,
>       integralType: String,
>       catalogType: String): CastFunction = {
>     if (ansiEnabled) {
>       (c, evPrim, evNull) => code"$evPrim = 
> $c.roundTo${integralType.capitalize}();"
>     } else {
>       (c, evPrim, evNull) => code"$evPrim = 
> $c.to${integralType.capitalize}();"
>     }
>   } {code}
> `ctx` and `catalogType` are useless
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494376#comment-17494376
 ] 

Apache Spark commented on SPARK-38237:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35551

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38195) Add the TIMESTAMPADD() function

2022-02-17 Thread Max Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Gekk resolved SPARK-38195.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35502
[https://github.com/apache/spark/pull/35502]

> Add the TIMESTAMPADD() function
> ---
>
> Key: SPARK-38195
> URL: https://issues.apache.org/jira/browse/SPARK-38195
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Max Gekk
>Assignee: Max Gekk
>Priority: Major
> Fix For: 3.3.0
>
>
> The function TIMESTAMPADD() is a part of the ODBC api and implemented 
> virtually by ALL other databases are missing in Spark SQL.
> The first argument is an unary interval (HOUR, YEAR, DAY, etc)
> {code:sql}
> TIMESTAMPADD( SECOND, 2 , timestamp '2021-12-12 12:00:00.00') returns 
> 2021-12-12 12:00:02.00
> {code}
> The ODBC syntax {fn } requires the interval value to have a prefix of 
> sql_tsi_ while plain sql doesn't. See
> * [Time, Date, and Interval 
> Functions|https://docs.microsoft.com/en-us/sql/odbc/reference/appendixes/time-date-and-interval-functions?view=sql-server-ver15]
> * [TIMESTAMPADD function (ODBC 
> compatible)|https://docs.faircom.com/doc/sqlref/33476.htm]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38225) Adjust input `format` of function `to_binary`

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38225:
---

Assignee: Xinrong Meng

> Adjust input `format` of function `to_binary`
> -
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
>
> Currently, function to_binary doesn't deal with the non-string `format` 
> parameter properly.
> For example, `spark.sql("select to_binary('abc', 1)")` raises casting error, 
> rather than hint that encoding format is unsupported. 
> In addition, `base2` format is arguable as discussed 
> [here](https://github.com/apache/spark/pull/35415#discussion_r805578036). We 
> may exclude it following what Snowflake 
> [to_binary](https://docs.snowflake.com/en/sql-reference/functions/to_binary.html)
>  does for now.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38225) Adjust input `format` of function `to_binary`

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38225.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35533
[https://github.com/apache/spark/pull/35533]

> Adjust input `format` of function `to_binary`
> -
>
> Key: SPARK-38225
> URL: https://issues.apache.org/jira/browse/SPARK-38225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Xinrong Meng
>Assignee: Xinrong Meng
>Priority: Major
> Fix For: 3.3.0
>
>
> Currently, function to_binary doesn't deal with the non-string `format` 
> parameter properly.
> For example, `spark.sql("select to_binary('abc', 1)")` raises casting error, 
> rather than hint that encoding format is unsupported. 
> In addition, `base2` format is arguable as discussed 
> [here](https://github.com/apache/spark/pull/35415#discussion_r805578036). We 
> may exclude it following what Snowflake 
> [to_binary](https://docs.snowflake.com/en/sql-reference/functions/to_binary.html)
>  does for now.
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38247) Unify the output of df.explain and "explain " if plan is command

2022-02-17 Thread yikf (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yikf resolved SPARK-38247.
--
Resolution: Invalid

> Unify the output of df.explain and "explain " if plan is command
> 
>
> Key: SPARK-38247
> URL: https://issues.apache.org/jira/browse/SPARK-38247
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> This pr aims to unify the out of df.explain and "explain" sql if plan is a 
> command
>  * unify the out of df.explain and "explain" sql if plan is a command
>  * Make the output of explain unambiguous if plan is a command
>  
> Lets say have a query like "show tables", we want to explain it
> Before this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#62, tableName#63, isTemporary#64]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTablesCommand default, [namespace#62, tableName#63, isTemporary#64], 
> false== Optimized Logical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64], Execute 
> ShowTablesCommand, [[default,lt,false], [default,people,false], 
> [default,person,false], [default,rt,false], [default,t,false], 
> [default,t1,false], [default,t10086,false], [default,txtsource,false]]
>    +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false== Physical Plan ==
> CommandResult [namespace#62, tableName#63, isTemporary#64]
>    +- Execute ShowTablesCommand
>          +- ShowTablesCommand default, [namespace#62, tableName#63, 
> isTemporary#64], false {code}
> After this pr:
> {code:java}
> == Parsed Logical Plan ==
> 'ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- 'UnresolvedNamespace== Analyzed Logical Plan ==
> namespace: string, tableName: string, isTemporary: boolean
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Optimized 
> Logical Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2]
> +- ResolvedNamespace V2SessionCatalog(spark_catalog), [default]== Physical 
> Plan ==
> ShowTables [namespace#0, tableName#1, isTemporary#2], 
> V2SessionCatalog(spark_catalog), [default] {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38138) Materialize QueryPlan subqueries

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38138.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35438
[https://github.com/apache/spark/pull/35438]

> Materialize QueryPlan subqueries
> 
>
> Key: SPARK-38138
> URL: https://issues.apache.org/jira/browse/SPARK-38138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38138) Materialize QueryPlan subqueries

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38138:
---

Assignee: Cheng Pan

> Materialize QueryPlan subqueries
> 
>
> Key: SPARK-38138
> URL: https://issues.apache.org/jira/browse/SPARK-38138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Cheng Pan
>Assignee: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32975) Add config for driver readiness timeout before executors start

2022-02-17 Thread Abhijeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494396#comment-17494396
 ] 

Abhijeet Singh commented on SPARK-32975:


Though the issue points to driver and the fix is related to a driver config, 
but I was getting the same error because sidecar injection was happening to 
executor pod and sidecar container was taking more time to initialize than the 
exec container.

 

I was getting connection refused excp because sidecar container was not ready 
and executor was trying to communicate. I resolved it by adding a sleep/wait 
time in entrypoint.sh for exec, but it would be neat to have a `spark.k8s` 
config which allows to set wait time.

> Add config for driver readiness timeout before executors start
> --
>
> Key: SPARK-32975
> URL: https://issues.apache.org/jira/browse/SPARK-32975
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.2, 3.1.2, 3.2.0
>Reporter: Shenson Joseph
>Assignee: Chris Wu
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4
> spark executors keeps getting killed with exit code 1 and we are seeing 
> following exception in the executor which goes to error state. Once this 
> error happens, driver doesn't restart executor. 
>  
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to 
> act-pipeline-app-1600187491917-driver-svc.default.svc:7078
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: 
> act-pipeline-app-1600187491917-driver-svc.default.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
> at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getByName(InetAddress.java:1077)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at 
> io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
> at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
> at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
> at 
> io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
> at io.netty.bootstrap.

[jira] [Comment Edited] (SPARK-32975) Add config for driver readiness timeout before executors start

2022-02-17 Thread Abhijeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494396#comment-17494396
 ] 

Abhijeet Singh edited comment on SPARK-32975 at 2/18/22, 6:52 AM:
--

Though the issue points to driver and the fix is related to a driver config, 
but I was getting the same error because sidecar injection was happening to 
executor pod and sidecar container was taking more time to initialize than the 
exec container.

 

I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it 
would be neat to have a spark.kubernetes.allocation.executor.readinessWait 
config which allows to set wait time.


was (Author: singh-abhijeet):
Though the issue points to driver and the fix is related to a driver config, 
but I was getting the same error because sidecar injection was happening to 
executor pod and sidecar container was taking more time to initialize than the 
exec container.

 

I was getting connection refused excp because sidecar container was not ready 
and executor was trying to communicate. I resolved it by adding a sleep/wait 
time in entrypoint.sh for exec, but it would be neat to have a `spark.k8s` 
config which allows to set wait time.

> Add config for driver readiness timeout before executors start
> --
>
> Key: SPARK-32975
> URL: https://issues.apache.org/jira/browse/SPARK-32975
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.2, 3.1.2, 3.2.0
>Reporter: Shenson Joseph
>Assignee: Chris Wu
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4
> spark executors keeps getting killed with exit code 1 and we are seeing 
> following exception in the executor which goes to error state. Once this 
> error happens, driver doesn't restart executor. 
>  
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to 
> act-pipeline-app-1600187491917-driver-svc.default.svc:7078
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: 
> act-pipeline-app-1600187491917-driver-svc.default.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
> at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getByName(InetAddress.java:1077)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at 
> io.netty.resol

[jira] [Comment Edited] (SPARK-32975) Add config for driver readiness timeout before executors start

2022-02-17 Thread Abhijeet Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494396#comment-17494396
 ] 

Abhijeet Singh edited comment on SPARK-32975 at 2/18/22, 6:52 AM:
--

Though the issue points to driver and the fix is related to a driver config, 
but I was getting the same error because sidecar injection was happening to 
executor pod and sidecar container was taking more time to initialize than the 
exec container.

 

I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it 
would be neat to have a _spark.kubernetes.allocation.executor.readinessWait_ 
config which allows to set wait time.


was (Author: singh-abhijeet):
Though the issue points to driver and the fix is related to a driver config, 
but I was getting the same error because sidecar injection was happening to 
executor pod and sidecar container was taking more time to initialize than the 
exec container.

 

I resolved it by adding a sleep/wait time in entrypoint.sh for exec, but it 
would be neat to have a spark.kubernetes.allocation.executor.readinessWait 
config which allows to set wait time.

> Add config for driver readiness timeout before executors start
> --
>
> Key: SPARK-32975
> URL: https://issues.apache.org/jira/browse/SPARK-32975
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.4.4, 3.0.2, 3.1.2, 3.2.0
>Reporter: Shenson Joseph
>Assignee: Chris Wu
>Priority: Major
> Fix For: 3.2.0, 3.1.3
>
>
> We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4
> spark executors keeps getting killed with exit code 1 and we are seeing 
> following exception in the executor which goes to error state. Once this 
> error happens, driver doesn't restart executor. 
>  
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to 
> act-pipeline-app-1600187491917-driver-svc.default.svc:7078
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: 
> act-pipeline-app-1600187491917-driver-svc.default.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
> at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getByName(InetAddress.java:1077)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at 
> io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> at io.netty.r

[jira] [Updated] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate

2022-02-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38237:
-
Summary: Introduce a new config to require all cluster keys on Aggregate  
(was: Rename back StatefulOpClusteredDistribution to HashClusteredDistribution)

> Introduce a new config to require all cluster keys on Aggregate
> ---
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38237) Introduce a new config to require all cluster keys on Aggregate

2022-02-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38237:
-
Description: 
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (note that the technical parallelism also 
depends on "cardinality" - picking sub-key groups means having less 
cardinality).

We propose to introduce a new config to require all cluster keys on Aggregate, 
leveraging HashClusteredDistribution. That said, we propose to rename back 
HashClusteredDistribution with retaining NOTE for stateful operator. The 
distribution should not be still touched anyway due to the requirement of 
stateful operator, but can be co-used with batch case if needed.

  was:
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (note that the technical parallelism also 
depends on "cardinality" - picking sub-key groups means having less 
cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for 
stateful operator. The distribution should not be still touched anyway due to 
the requirement of stateful operator, but can be co-used with batch case if 
needed.


> Introduce a new config to require all cluster keys on Aggregate
> ---
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to introduce a new config to require all cluster keys on 
> Aggregate, leveraging HashClusteredDistribution. That said, we propose to 
> rename back HashClusteredDistribution with retaining NOTE for stateful 
> operator. The distribution should not be still touched anyway due to the 
> requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2022-02-17 Thread Attila Zsolt Piros (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493745#comment-17493745
 ] 

Attila Zsolt Piros commented on SPARK-33206:


I am working on this soon a PR will be opened.

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Priority: Major
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38236) Absolute file paths specified in create/alter table are treated as relative

2022-02-17 Thread Bo Zhang (Jira)
Bo Zhang created SPARK-38236:


 Summary: Absolute file paths specified in create/alter table are 
treated as relative
 Key: SPARK-38236
 URL: https://issues.apache.org/jira/browse/SPARK-38236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1
Reporter: Bo Zhang


After https://github.com/apache/spark/pull/28527 we change to create table 
under the database location when the table location specified is relative. 
However the criteria to determine if a table location is relative/absolute is 
URI.isAbsolute, which basically checks if the table location URI has a scheme 
defined. So table URIs like /table/path are treated as relative and the scheme 
and authority of the database location URI are used to create the table. For 
example, when the database location URI is s3a://bucket/db, the table will be 
created at s3a://bucket/table/path, while it should be created under the file 
system defined in SessionCatalog.hadoopConf instead.

This also applies to alter table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38236) Absolute file paths specified in create/alter table are treated as relative

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493763#comment-17493763
 ] 

Apache Spark commented on SPARK-38236:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35462

> Absolute file paths specified in create/alter table are treated as relative
> ---
>
> Key: SPARK-38236
> URL: https://issues.apache.org/jira/browse/SPARK-38236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> After https://github.com/apache/spark/pull/28527 we change to create table 
> under the database location when the table location specified is relative. 
> However the criteria to determine if a table location is relative/absolute is 
> URI.isAbsolute, which basically checks if the table location URI has a scheme 
> defined. So table URIs like /table/path are treated as relative and the 
> scheme and authority of the database location URI are used to create the 
> table. For example, when the database location URI is s3a://bucket/db, the 
> table will be created at s3a://bucket/table/path, while it should be created 
> under the file system defined in SessionCatalog.hadoopConf instead.
> This also applies to alter table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38236) Absolute file paths specified in create/alter table are treated as relative

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38236:


Assignee: (was: Apache Spark)

> Absolute file paths specified in create/alter table are treated as relative
> ---
>
> Key: SPARK-38236
> URL: https://issues.apache.org/jira/browse/SPARK-38236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> After https://github.com/apache/spark/pull/28527 we change to create table 
> under the database location when the table location specified is relative. 
> However the criteria to determine if a table location is relative/absolute is 
> URI.isAbsolute, which basically checks if the table location URI has a scheme 
> defined. So table URIs like /table/path are treated as relative and the 
> scheme and authority of the database location URI are used to create the 
> table. For example, when the database location URI is s3a://bucket/db, the 
> table will be created at s3a://bucket/table/path, while it should be created 
> under the file system defined in SessionCatalog.hadoopConf instead.
> This also applies to alter table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38236) Absolute file paths specified in create/alter table are treated as relative

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38236:


Assignee: Apache Spark

> Absolute file paths specified in create/alter table are treated as relative
> ---
>
> Key: SPARK-38236
> URL: https://issues.apache.org/jira/browse/SPARK-38236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Bo Zhang
>Assignee: Apache Spark
>Priority: Major
>
> After https://github.com/apache/spark/pull/28527 we change to create table 
> under the database location when the table location specified is relative. 
> However the criteria to determine if a table location is relative/absolute is 
> URI.isAbsolute, which basically checks if the table location URI has a scheme 
> defined. So table URIs like /table/path are treated as relative and the 
> scheme and authority of the database location URI are used to create the 
> table. For example, when the database location URI is s3a://bucket/db, the 
> table will be created at s3a://bucket/table/path, while it should be created 
> under the file system defined in SessionCatalog.hadoopConf instead.
> This also applies to alter table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38236) Absolute file paths specified in create/alter table are treated as relative

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38236?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493765#comment-17493765
 ] 

Apache Spark commented on SPARK-38236:
--

User 'bozhang2820' has created a pull request for this issue:
https://github.com/apache/spark/pull/35462

> Absolute file paths specified in create/alter table are treated as relative
> ---
>
> Key: SPARK-38236
> URL: https://issues.apache.org/jira/browse/SPARK-38236
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.1.2, 3.2.0, 3.2.1
>Reporter: Bo Zhang
>Priority: Major
>
> After https://github.com/apache/spark/pull/28527 we change to create table 
> under the database location when the table location specified is relative. 
> However the criteria to determine if a table location is relative/absolute is 
> URI.isAbsolute, which basically checks if the table location URI has a scheme 
> defined. So table URIs like /table/path are treated as relative and the 
> scheme and authority of the database location URI are used to create the 
> table. For example, when the database location URI is s3a://bucket/db, the 
> table will be created at s3a://bucket/table/path, while it should be created 
> under the file system defined in SessionCatalog.hadoopConf instead.
> This also applies to alter table.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38230:


Assignee: Apache Spark

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Assignee: Apache Spark
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38230:


Assignee: (was: Apache Spark)

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493790#comment-17493790
 ] 

Apache Spark commented on SPARK-38230:
--

User 'coalchan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35549

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-38237:


 Summary: Rename back StatefulOpClusteredDistribution to 
HashClusteredDistribution
 Key: SPARK-38237
 URL: https://issues.apache.org/jira/browse/SPARK-38237
 Project: Spark
  Issue Type: Task
  Components: SQL, Structured Streaming
Affects Versions: 3.3.0
Reporter: Jungtaek Lim


We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (where the parallelism also depends on 
cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for 
stateful operator. The distribution should not be still touched anyway due to 
the requirement of stateful operator, but can be co-used with batch case if 
needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38230) InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions in most cases

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493791#comment-17493791
 ] 

Apache Spark commented on SPARK-38230:
--

User 'coalchan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35549

> InsertIntoHadoopFsRelationCommand unnecessarily fetches details of partitions 
> in most cases
> ---
>
> Key: SPARK-38230
> URL: https://issues.apache.org/jira/browse/SPARK-38230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Coal Chan
>Priority: Major
>
> In 
> `org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand`,
>  `sparkSession.sessionState.catalog.listPartitions` will call method 
> `org.apache.hadoop.hive.metastore.listPartitionsPsWithAuth` of hive metastore 
> client, this method will produce multiple queries per partition on hive 
> metastore db. So when you insert into a table which has too many 
> partitions(ie: 10k), it will produce too many queries on hive metastore 
> db(ie: n * 10k = 10nk), it puts a lot of strain on the database.
> In fact, it calls method `listPartitions` in order to get locations of 
> partitions and get `customPartitionLocations`. But in most cases, we do not 
> have custom partitions, we can just get partition names, so we can call 
> method listPartitionNames.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493792#comment-17493792
 ] 

Jungtaek Lim commented on SPARK-38237:
--

Will submit a PR sooner.

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (where the parallelism also depends on 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-38237:
-
Description: 
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (note that the technical parallelism also 
depends on "cardinality" - picking sub-key groups means having less 
cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for 
stateful operator. The distribution should not be still touched anyway due to 
the requirement of stateful operator, but can be co-used with batch case if 
needed.

  was:
We still find HashClusteredDistribution be useful for batch query as well. For 
example, we had a case with lower parallelism than expected due to the fact 
ClusteredDistribution is used for aggregation which matches with 
HashPartitioning with sub-key groups (where the parallelism also depends on 
cardinality).

We propose to rename back HashClusteredDistribution with retaining NOTE for 
stateful operator. The distribution should not be still touched anyway due to 
the requirement of stateful operator, but can be co-used with batch case if 
needed.


> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Wan Kun (Jira)
Wan Kun created SPARK-38238:
---

 Summary: Contains Join for Spark SQL
 Key: SPARK-38238
 URL: https://issues.apache.org/jira/browse/SPARK-38238
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wan Kun


Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM emails a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM emails a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}

If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
String Matching*, and many algorithm trying to improve this matching. One of  
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once. 




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38238:

Description: 
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}

If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
String Matching*, and many algorithm trying to improve this matching. One of  
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once. 


  was:
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM emails a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM emails a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}

If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
String Matching*, and many algorithm trying to improve this matching. One of  
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once. 



> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
> String Matching*, and many algorithm trying to improve this matching. One of  
> the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38238:

Description: 
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
String Matching{*}, and many algorithm trying to improve this matching. One of 
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max(n))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table

sadf

  was:
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}

If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or *Multi-Way 
String Matching*, and many algorithm trying to improve this matching. One of  
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once. 



> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max(n))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table
> sadf



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38238:


Assignee: (was: Apache Spark)

> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493807#comment-17493807
 ] 

Apache Spark commented on SPARK-38238:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/35550

> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38238:


Assignee: Apache Spark

> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Assignee: Apache Spark
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38238:

Description: 
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
String Matching{*}, and many algorithm trying to improve this matching. One of 
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table

  was:
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
String Matching{*}, and many algorithm trying to improve this matching. One of 
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max(n))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table

sadf


> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493808#comment-17493808
 ] 

Apache Spark commented on SPARK-38238:
--

User 'wankunde' has created a pull request for this issue:
https://github.com/apache/spark/pull/35550

> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
> String Matching{*}, and many algorithm trying to improve this matching. One 
> of the famous algorithm called [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea to optimize this query is to transform all the patterns into a 
> trie tree and broadcast it. So then each row from the left table only need to 
> match its content to the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38238) Contains Join for Spark SQL

2022-02-17 Thread Wan Kun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wan Kun updated SPARK-38238:

Description: 
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this kind of join is called *Multi-Pattern String Matching* or 
{*}Multi-Way String Matching{*}, and many algorithms try to improve this kind 
of matching. One of the well-knowing algorithm is [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea of optimizing this kindof query is to transform all the patterns 
into a trie tree and broadcast it. So each row of the fact table only need to 
match its content against the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table

  was:
Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
the following string contains query:
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON a.text like concat('%', b.pattern, '%');
{code}
OR
{code:sql}
SELECT a.text, b.pattern
FROM fact_table a
JOIN patterns b
ON position(b.pattern, a.text) > 0;
{code}
If there are many patterns to match in the left table, the query many execute 
for a long time.

Actually this join is called *Multi-Pattern String Matching* or {*}Multi-Way 
String Matching{*}, and many algorithm trying to improve this matching. One of 
the famous algorithm called [*Aho–Corasick 
algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]

The basic idea to optimize this query is to transform all the patterns into a 
trie tree and broadcast it. So then each row from the left table only need to 
match its content to the trie tree once.

The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
M = number of records in the fact table
N = number of records in the patterns table
m = row length of the fact table
n = row length of the patterns table


> Contains Join for Spark SQL
> ---
>
> Key: SPARK-38238
> URL: https://issues.apache.org/jira/browse/SPARK-38238
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wan Kun
>Priority: Major
>
> Currently Spark SQL uses a Broadcast Nested Loop join when it has to execute 
> the following string contains query:
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON a.text like concat('%', b.pattern, '%');
> {code}
> OR
> {code:sql}
> SELECT a.text, b.pattern
> FROM fact_table a
> JOIN patterns b
> ON position(b.pattern, a.text) > 0;
> {code}
> If there are many patterns to match in the left table, the query many execute 
> for a long time.
> Actually this kind of join is called *Multi-Pattern String Matching* or 
> {*}Multi-Way String Matching{*}, and many algorithms try to improve this kind 
> of matching. One of the well-knowing algorithm is [*Aho–Corasick 
> algorithm*|https://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_algorithm]
> The basic idea of optimizing this kindof query is to transform all the 
> patterns into a trie tree and broadcast it. So each row of the fact table 
> only need to match its content against the trie tree once.
> The query will go from *O(M * N * m * n)* to *O(M * m * max( n ))*
> M = number of records in the fact table
> N = number of records in the patterns table
> m = row length of the fact table
> n = row length of the patterns table



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38229) Should't check temp/external/ifNotExists with visitReplaceTable when parser

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38229.
-
  Assignee: yikf
Resolution: Fixed

> Should't check temp/external/ifNotExists with visitReplaceTable when parser
> ---
>
> Key: SPARK-38229
> URL: https://issues.apache.org/jira/browse/SPARK-38229
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> Spark does not support replace table syntax such as CREATE OR REPLACE 
> TEMPORARY TABLE.../REPLACE EXTERNAL TABLE/REPLACE ... IF NOT EXISTS, And we 
> don't need to check these tokens
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-38216:
---

Assignee: yikf

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38216) When creating a Hive table, fail early if all the columns are partitioned columns

2022-02-17 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-38216.
-
Resolution: Fixed

Issue resolved by pull request 35527
[https://github.com/apache/spark/pull/35527]

> When creating a Hive table, fail early if all the columns are partitioned 
> columns
> -
>
> Key: SPARK-38216
> URL: https://issues.apache.org/jira/browse/SPARK-38216
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: yikf
>Assignee: yikf
>Priority: Minor
> Fix For: 3.3.0
>
>
> In Hive the schema and partition columns must be disjoint sets, if hive table 
> which all columns are partitioned columns, so that other columns is empty, it 
> will fail when Hive create table, error msg as follow:
> `
> throw new HiveException(
> "at least one column must be specified for the table")
> `
> So when creating a Hive table, fail early if all the columns are partitioned 
> columns, 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz reassigned SPARK-37410:
--

Assignee: Maciej Szymkiewicz

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-37410) Inline type hints for python/pyspark/ml/recommendation.py

2022-02-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz resolved SPARK-37410.

Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35429
[https://github.com/apache/spark/pull/35429]

> Inline type hints for python/pyspark/ml/recommendation.py
> -
>
> Key: SPARK-37410
> URL: https://issues.apache.org/jira/browse/SPARK-37410
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.3.0
>
>
> Inline type hints from python/pyspark/ml/recommendation.pyi to 
> python/pyspark/ml/recommendation.py.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38237:


Assignee: (was: Apache Spark)

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38237:


Assignee: Apache Spark

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493869#comment-17493869
 ] 

Apache Spark commented on SPARK-38237:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35551

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493871#comment-17493871
 ] 

Apache Spark commented on SPARK-38237:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35552

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Jungtaek Lim (Jira)


[ https://issues.apache.org/jira/browse/SPARK-38237 ]


Jungtaek Lim deleted comment on SPARK-38237:
--

was (Author: apachespark):
User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35551

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38237) Rename back StatefulOpClusteredDistribution to HashClusteredDistribution

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493872#comment-17493872
 ] 

Apache Spark commented on SPARK-38237:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/35552

> Rename back StatefulOpClusteredDistribution to HashClusteredDistribution
> 
>
> Key: SPARK-38237
> URL: https://issues.apache.org/jira/browse/SPARK-38237
> Project: Spark
>  Issue Type: Task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.3.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> We still find HashClusteredDistribution be useful for batch query as well. 
> For example, we had a case with lower parallelism than expected due to the 
> fact ClusteredDistribution is used for aggregation which matches with 
> HashPartitioning with sub-key groups (note that the technical parallelism 
> also depends on "cardinality" - picking sub-key groups means having less 
> cardinality).
> We propose to rename back HashClusteredDistribution with retaining NOTE for 
> stateful operator. The distribution should not be still touched anyway due to 
> the requirement of stateful operator, but can be co-used with batch case if 
> needed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38188) Support queue scheduling (Introduce queue) with volcano implementations

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38188:


Assignee: (was: Apache Spark)

> Support queue scheduling (Introduce queue) with volcano implementations
> ---
>
> Key: SPARK-38188
> URL: https://issues.apache.org/jira/browse/SPARK-38188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38188) Support queue scheduling (Introduce queue) with volcano implementations

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38188:


Assignee: Apache Spark

> Support queue scheduling (Introduce queue) with volcano implementations
> ---
>
> Key: SPARK-38188
> URL: https://issues.apache.org/jira/browse/SPARK-38188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38188) Support queue scheduling (Introduce queue) with volcano implementations

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493911#comment-17493911
 ] 

Apache Spark commented on SPARK-38188:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35553

> Support queue scheduling (Introduce queue) with volcano implementations
> ---
>
> Key: SPARK-38188
> URL: https://issues.apache.org/jira/browse/SPARK-38188
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38239) AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'

2022-02-17 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-38239:
--

 Summary: AttributeError: 'LogisticRegressionModel' object has no 
attribute '_call_java'
 Key: SPARK-38239
 URL: https://issues.apache.org/jira/browse/SPARK-38239
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 3.2.0, 3.1.0, 3.0.0, 2.4.0, 3.3.0
Reporter: Maciej Szymkiewicz


Trying to invoke {{\_\_repr\_\_}} on 
{{pyspark.mllib.classification.LogisticRegressionModel}} leads to 
{{AttributeError}}:

{code:python}
>>> type(model)

>>> model
Traceback (most recent call last):
  File /path/to/python3.9/site-packages/IPython/core/formatters.py:698 in 
__call__
return repr(obj)
  File /path/to/spark/python/pyspark/mllib/classification.py:281 in __repr__
return self._call_java("toString")
AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
{code}

This problem was introduced SPARK-14712, where the method was added, with the 
same implementation, for both {{ml}} and {{mllib}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38239) AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38239:


Assignee: Apache Spark

> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> --
>
> Key: SPARK-38239
> URL: https://issues.apache.org/jira/browse/SPARK-38239
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> Trying to invoke {{\_\_repr\_\_}} on 
> {{pyspark.mllib.classification.LogisticRegressionModel}} leads to 
> {{AttributeError}}:
> {code:python}
> >>> type(model)
> 
> >>> model
> Traceback (most recent call last):
>   File /path/to/python3.9/site-packages/IPython/core/formatters.py:698 in 
> __call__
> return repr(obj)
>   File /path/to/spark/python/pyspark/mllib/classification.py:281 in __repr__
> return self._call_java("toString")
> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> {code}
> This problem was introduced SPARK-14712, where the method was added, with the 
> same implementation, for both {{ml}} and {{mllib}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38239) AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38239:


Assignee: (was: Apache Spark)

> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> --
>
> Key: SPARK-38239
> URL: https://issues.apache.org/jira/browse/SPARK-38239
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Trying to invoke {{\_\_repr\_\_}} on 
> {{pyspark.mllib.classification.LogisticRegressionModel}} leads to 
> {{AttributeError}}:
> {code:python}
> >>> type(model)
> 
> >>> model
> Traceback (most recent call last):
>   File /path/to/python3.9/site-packages/IPython/core/formatters.py:698 in 
> __call__
> return repr(obj)
>   File /path/to/spark/python/pyspark/mllib/classification.py:281 in __repr__
> return self._call_java("toString")
> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> {code}
> This problem was introduced SPARK-14712, where the method was added, with the 
> same implementation, for both {{ml}} and {{mllib}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38239) AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493918#comment-17493918
 ] 

Apache Spark commented on SPARK-38239:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35554

> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> --
>
> Key: SPARK-38239
> URL: https://issues.apache.org/jira/browse/SPARK-38239
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Trying to invoke {{\_\_repr\_\_}} on 
> {{pyspark.mllib.classification.LogisticRegressionModel}} leads to 
> {{AttributeError}}:
> {code:python}
> >>> type(model)
> 
> >>> model
> Traceback (most recent call last):
>   File /path/to/python3.9/site-packages/IPython/core/formatters.py:698 in 
> __call__
> return repr(obj)
>   File /path/to/spark/python/pyspark/mllib/classification.py:281 in __repr__
> return self._call_java("toString")
> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> {code}
> This problem was introduced SPARK-14712, where the method was added, with the 
> same implementation, for both {{ml}} and {{mllib}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38239) AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493919#comment-17493919
 ] 

Apache Spark commented on SPARK-38239:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35554

> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> --
>
> Key: SPARK-38239
> URL: https://issues.apache.org/jira/browse/SPARK-38239
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.4.0, 3.0.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> Trying to invoke {{\_\_repr\_\_}} on 
> {{pyspark.mllib.classification.LogisticRegressionModel}} leads to 
> {{AttributeError}}:
> {code:python}
> >>> type(model)
> 
> >>> model
> Traceback (most recent call last):
>   File /path/to/python3.9/site-packages/IPython/core/formatters.py:698 in 
> __call__
> return repr(obj)
>   File /path/to/spark/python/pyspark/mllib/classification.py:281 in __repr__
> return self._call_java("toString")
> AttributeError: 'LogisticRegressionModel' object has no attribute '_call_java'
> {code}
> This problem was introduced SPARK-14712, where the method was added, with the 
> same implementation, for both {{ml}} and {{mllib}}.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38240) Improve RuntimeReplaceable and add a guideline for adding new functions

2022-02-17 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-38240:
---

 Summary: Improve RuntimeReplaceable and add a guideline for adding 
new functions
 Key: SPARK-38240
 URL: https://issues.apache.org/jira/browse/SPARK-38240
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38240) Improve RuntimeReplaceable and add a guideline for adding new functions

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38240:


Assignee: Apache Spark

> Improve RuntimeReplaceable and add a guideline for adding new functions
> ---
>
> Key: SPARK-38240
> URL: https://issues.apache.org/jira/browse/SPARK-38240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38240) Improve RuntimeReplaceable and add a guideline for adding new functions

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493930#comment-17493930
 ] 

Apache Spark commented on SPARK-38240:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/35534

> Improve RuntimeReplaceable and add a guideline for adding new functions
> ---
>
> Key: SPARK-38240
> URL: https://issues.apache.org/jira/browse/SPARK-38240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38240) Improve RuntimeReplaceable and add a guideline for adding new functions

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38240:


Assignee: (was: Apache Spark)

> Improve RuntimeReplaceable and add a guideline for adding new functions
> ---
>
> Key: SPARK-38240
> URL: https://issues.apache.org/jira/browse/SPARK-38240
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-02-17 Thread Martin Tzvetanov Grigorov (Jira)
Martin Tzvetanov Grigorov created SPARK-38241:
-

 Summary: Close KubernetesClient in K8S integrations tests
 Key: SPARK-38241
 URL: https://issues.apache.org/jira/browse/SPARK-38241
 Project: Spark
  Issue Type: Task
  Components: Kubernetes, Tests
Affects Versions: 3.2.1
Reporter: Martin Tzvetanov Grigorov


The implementations of 
org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493950#comment-17493950
 ] 

Apache Spark commented on SPARK-38241:
--

User 'martin-g' has created a pull request for this issue:
https://github.com/apache/spark/pull/3

> Close KubernetesClient in K8S integrations tests
> 
>
> Key: SPARK-38241
> URL: https://issues.apache.org/jira/browse/SPARK-38241
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> The implementations of 
> org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
> should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38241:


Assignee: Apache Spark

> Close KubernetesClient in K8S integrations tests
> 
>
> Key: SPARK-38241
> URL: https://issues.apache.org/jira/browse/SPARK-38241
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Apache Spark
>Priority: Minor
>
> The implementations of 
> org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
> should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38241:


Assignee: (was: Apache Spark)

> Close KubernetesClient in K8S integrations tests
> 
>
> Key: SPARK-38241
> URL: https://issues.apache.org/jira/browse/SPARK-38241
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> The implementations of 
> org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
> should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38241) Close KubernetesClient in K8S integrations tests

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38241?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493952#comment-17493952
 ] 

Apache Spark commented on SPARK-38241:
--

User 'martin-g' has created a pull request for this issue:
https://github.com/apache/spark/pull/3

> Close KubernetesClient in K8S integrations tests
> 
>
> Key: SPARK-38241
> URL: https://issues.apache.org/jira/browse/SPARK-38241
> Project: Spark
>  Issue Type: Task
>  Components: Kubernetes, Tests
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> The implementations of 
> org.apache.spark.deploy.k8s.integrationtest.backend.IntegrationTestBackend 
> should close their KubernetesClient instance in #cleanUp() method.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38242) Sort the SparkSubmit debug output

2022-02-17 Thread Martin Tzvetanov Grigorov (Jira)
Martin Tzvetanov Grigorov created SPARK-38242:
-

 Summary: Sort the SparkSubmit debug output 
 Key: SPARK-38242
 URL: https://issues.apache.org/jira/browse/SPARK-38242
 Project: Spark
  Issue Type: Wish
  Components: Spark Submit
Affects Versions: 3.2.1
Reporter: Martin Tzvetanov Grigorov


When '-verbose' is passed to SparkSubmit it prints some useful debug 
information: Main class, Argument and Spark config.

I find it a bit hard to find information there because the arguments/configs 
are shuffled. I suggest to sort those before printing them.

 
{code:java}
 Main class:
org.apache.spark.deploy.k8s.submit.KubernetesClientApplication
Arguments:
--main-class
--primary-java-resource
local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar
org.apache.spark.examples.SparkPi
Spark config:
(spark.app.name,spark-on-k8s-app)
(spark.app.submitTime,1645106476125)
(spark.driver.cores,1)
(spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
(spark.driver.memory,2048m)
(spark.dynamicAllocation.enabled,true)
(spark.dynamicAllocation.shuffleTracking.enabled,true)
(spark.executor.cores,2)
(spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
(spark.executor.instances,3)
(spark.executor.memory,2048m)
(spark.jars,local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar)
(spark.kubernetes.allocation.batch.delay,1)
(spark.kubernetes.allocation.batch.size,3)
(spark.kubernetes.authenticate.driver.serviceAccountName,spark-account-name)
(spark.kubernetes.driver.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
(spark.kubernetes.executor.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
(spark.kubernetes.namespace,spark-on-k8s)
(spark.master,k8s://https://192.168.49.2:8443)
(spark.network.timeout,300)
(spark.submit.deployMode,cluster)
(spark.submit.pyFiles,)
Classpath elements:{code}
 

The "Parsed arguments:" order is hardcoded at 
org.apache.spark.deploy.SparkSubmitArguments#toString, so they are still 
shuffled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38242) Sort the SparkSubmit debug output

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493970#comment-17493970
 ] 

Apache Spark commented on SPARK-38242:
--

User 'martin-g' has created a pull request for this issue:
https://github.com/apache/spark/pull/35556

> Sort the SparkSubmit debug output 
> --
>
> Key: SPARK-38242
> URL: https://issues.apache.org/jira/browse/SPARK-38242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> When '-verbose' is passed to SparkSubmit it prints some useful debug 
> information: Main class, Argument and Spark config.
> I find it a bit hard to find information there because the arguments/configs 
> are shuffled. I suggest to sort those before printing them.
>  
> {code:java}
>  Main class:
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication
> Arguments:
> --main-class
> --primary-java-resource
> local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar
> org.apache.spark.examples.SparkPi
> Spark config:
> (spark.app.name,spark-on-k8s-app)
> (spark.app.submitTime,1645106476125)
> (spark.driver.cores,1)
> (spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.driver.memory,2048m)
> (spark.dynamicAllocation.enabled,true)
> (spark.dynamicAllocation.shuffleTracking.enabled,true)
> (spark.executor.cores,2)
> (spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.executor.instances,3)
> (spark.executor.memory,2048m)
> (spark.jars,local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar)
> (spark.kubernetes.allocation.batch.delay,1)
> (spark.kubernetes.allocation.batch.size,3)
> (spark.kubernetes.authenticate.driver.serviceAccountName,spark-account-name)
> (spark.kubernetes.driver.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.executor.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.namespace,spark-on-k8s)
> (spark.master,k8s://https://192.168.49.2:8443)
> (spark.network.timeout,300)
> (spark.submit.deployMode,cluster)
> (spark.submit.pyFiles,)
> Classpath elements:{code}
>  
> The "Parsed arguments:" order is hardcoded at 
> org.apache.spark.deploy.SparkSubmitArguments#toString, so they are still 
> shuffled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38242) Sort the SparkSubmit debug output

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38242:


Assignee: Apache Spark

> Sort the SparkSubmit debug output 
> --
>
> Key: SPARK-38242
> URL: https://issues.apache.org/jira/browse/SPARK-38242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Apache Spark
>Priority: Minor
>
> When '-verbose' is passed to SparkSubmit it prints some useful debug 
> information: Main class, Argument and Spark config.
> I find it a bit hard to find information there because the arguments/configs 
> are shuffled. I suggest to sort those before printing them.
>  
> {code:java}
>  Main class:
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication
> Arguments:
> --main-class
> --primary-java-resource
> local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar
> org.apache.spark.examples.SparkPi
> Spark config:
> (spark.app.name,spark-on-k8s-app)
> (spark.app.submitTime,1645106476125)
> (spark.driver.cores,1)
> (spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.driver.memory,2048m)
> (spark.dynamicAllocation.enabled,true)
> (spark.dynamicAllocation.shuffleTracking.enabled,true)
> (spark.executor.cores,2)
> (spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.executor.instances,3)
> (spark.executor.memory,2048m)
> (spark.jars,local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar)
> (spark.kubernetes.allocation.batch.delay,1)
> (spark.kubernetes.allocation.batch.size,3)
> (spark.kubernetes.authenticate.driver.serviceAccountName,spark-account-name)
> (spark.kubernetes.driver.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.executor.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.namespace,spark-on-k8s)
> (spark.master,k8s://https://192.168.49.2:8443)
> (spark.network.timeout,300)
> (spark.submit.deployMode,cluster)
> (spark.submit.pyFiles,)
> Classpath elements:{code}
>  
> The "Parsed arguments:" order is hardcoded at 
> org.apache.spark.deploy.SparkSubmitArguments#toString, so they are still 
> shuffled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38242) Sort the SparkSubmit debug output

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38242?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38242:


Assignee: (was: Apache Spark)

> Sort the SparkSubmit debug output 
> --
>
> Key: SPARK-38242
> URL: https://issues.apache.org/jira/browse/SPARK-38242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> When '-verbose' is passed to SparkSubmit it prints some useful debug 
> information: Main class, Argument and Spark config.
> I find it a bit hard to find information there because the arguments/configs 
> are shuffled. I suggest to sort those before printing them.
>  
> {code:java}
>  Main class:
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication
> Arguments:
> --main-class
> --primary-java-resource
> local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar
> org.apache.spark.examples.SparkPi
> Spark config:
> (spark.app.name,spark-on-k8s-app)
> (spark.app.submitTime,1645106476125)
> (spark.driver.cores,1)
> (spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.driver.memory,2048m)
> (spark.dynamicAllocation.enabled,true)
> (spark.dynamicAllocation.shuffleTracking.enabled,true)
> (spark.executor.cores,2)
> (spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.executor.instances,3)
> (spark.executor.memory,2048m)
> (spark.jars,local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar)
> (spark.kubernetes.allocation.batch.delay,1)
> (spark.kubernetes.allocation.batch.size,3)
> (spark.kubernetes.authenticate.driver.serviceAccountName,spark-account-name)
> (spark.kubernetes.driver.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.executor.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.namespace,spark-on-k8s)
> (spark.master,k8s://https://192.168.49.2:8443)
> (spark.network.timeout,300)
> (spark.submit.deployMode,cluster)
> (spark.submit.pyFiles,)
> Classpath elements:{code}
>  
> The "Parsed arguments:" order is hardcoded at 
> org.apache.spark.deploy.SparkSubmitArguments#toString, so they are still 
> shuffled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38242) Sort the SparkSubmit debug output

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17493971#comment-17493971
 ] 

Apache Spark commented on SPARK-38242:
--

User 'martin-g' has created a pull request for this issue:
https://github.com/apache/spark/pull/35556

> Sort the SparkSubmit debug output 
> --
>
> Key: SPARK-38242
> URL: https://issues.apache.org/jira/browse/SPARK-38242
> Project: Spark
>  Issue Type: Wish
>  Components: Spark Submit
>Affects Versions: 3.2.1
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
>
> When '-verbose' is passed to SparkSubmit it prints some useful debug 
> information: Main class, Argument and Spark config.
> I find it a bit hard to find information there because the arguments/configs 
> are shuffled. I suggest to sort those before printing them.
>  
> {code:java}
>  Main class:
> org.apache.spark.deploy.k8s.submit.KubernetesClientApplication
> Arguments:
> --main-class
> --primary-java-resource
> local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar
> org.apache.spark.examples.SparkPi
> Spark config:
> (spark.app.name,spark-on-k8s-app)
> (spark.app.submitTime,1645106476125)
> (spark.driver.cores,1)
> (spark.driver.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.driver.memory,2048m)
> (spark.dynamicAllocation.enabled,true)
> (spark.dynamicAllocation.shuffleTracking.enabled,true)
> (spark.executor.cores,2)
> (spark.executor.extraJavaOptions,-Dio.netty.tryReflectionSetAccessible=true)
> (spark.executor.instances,3)
> (spark.executor.memory,2048m)
> (spark.jars,local:///opt/spark/examples/jars/spark-examples_2.13-3.3.0-SNAPSHOT.jar)
> (spark.kubernetes.allocation.batch.delay,1)
> (spark.kubernetes.allocation.batch.size,3)
> (spark.kubernetes.authenticate.driver.serviceAccountName,spark-account-name)
> (spark.kubernetes.driver.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.executor.container.image,spark/spark:3.3.0-SNAPSHOT-scala_2.13-11-jre-slim)
> (spark.kubernetes.namespace,spark-on-k8s)
> (spark.master,k8s://https://192.168.49.2:8443)
> (spark.network.timeout,300)
> (spark.submit.deployMode,cluster)
> (spark.submit.pyFiles,)
> Classpath elements:{code}
>  
> The "Parsed arguments:" order is hardcoded at 
> org.apache.spark.deploy.SparkSubmitArguments#toString, so they are still 
> shuffled.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Maciej Szymkiewicz (Jira)
Maciej Szymkiewicz created SPARK-38243:
--

 Summary: Unintended exception thrown in 
pyspark.ml.LogisticRegression.getThreshold
 Key: SPARK-38243
 URL: https://issues.apache.org/jira/browse/SPARK-38243
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 3.2.0, 3.1.0, 2.4.0, 3.3.0
Reporter: Maciej Szymkiewicz


If {{LogisticRegression.getThreshold}} is called with model having multiple 
thresholds we suppose to raise an exception,
{code:python}
ValueError: Logistic Regression getThreshold only applies to binary 
classification ...
{code}
However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
{{{}str.format{}}}, resulting in unintended {{TypeError}}


{code:python}
>>> from pyspark.ml.classification import LogisticRegression
... 
... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
>>> model.getThreshold()
Traceback (most recent call last):
  Input In [7] in 
    model.getThreshold()
  File ~/Workspace/spark/python/pyspark/ml/classification.py:1003 in 
getThreshold
    + ",".join(ts)
Type Error: sequence item 0: expected str instance, float found

{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38243:
---
Description: 
If {{LogisticRegression.getThreshold}} is called with model having multiple 
thresholds we suppose to raise an exception,
{code:python}
ValueError: Logistic Regression getThreshold only applies to binary 
classification ...
{code}
However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
{{{}str.join{}}}, resulting in unintended {{TypeError}}


{code:python}
>>> from pyspark.ml.classification import LogisticRegression
... 
... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
>>> model.getThreshold()
Traceback (most recent call last):
  Input In [7] in 
    model.getThreshold()
  File ~/Workspace/spark/python/pyspark/ml/classification.py:1003 in 
getThreshold
    + ",".join(ts)
Type Error: sequence item 0: expected str instance, float found

{code}

  was:
If {{LogisticRegression.getThreshold}} is called with model having multiple 
thresholds we suppose to raise an exception,
{code:python}
ValueError: Logistic Regression getThreshold only applies to binary 
classification ...
{code}
However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
{{{}str.format{}}}, resulting in unintended {{TypeError}}


{code:python}
>>> from pyspark.ml.classification import LogisticRegression
... 
... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
>>> model.getThreshold()
Traceback (most recent call last):
  Input In [7] in 
    model.getThreshold()
  File ~/Workspace/spark/python/pyspark/ml/classification.py:1003 in 
getThreshold
    + ",".join(ts)
Type Error: sequence item 0: expected str instance, float found

{code}


> Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
> -
>
> Key: SPARK-38243
> URL: https://issues.apache.org/jira/browse/SPARK-38243
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> If {{LogisticRegression.getThreshold}} is called with model having multiple 
> thresholds we suppose to raise an exception,
> {code:python}
> ValueError: Logistic Regression getThreshold only applies to binary 
> classification ...
> {code}
> However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
> {{{}str.join{}}}, resulting in unintended {{TypeError}}
> {code:python}
> >>> from pyspark.ml.classification import LogisticRegression
> ... 
> ... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
> >>> model.getThreshold()
> Traceback (most recent call last):
>   Input In [7] in 
>     model.getThreshold()
>   File ~/Workspace/spark/python/pyspark/ml/classification.py:1003 in 
> getThreshold
>     + ",".join(ts)
> Type Error: sequence item 0: expected str instance, float found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Yikun Jiang (Jira)
Yikun Jiang created SPARK-38244:
---

 Summary: Upgrade kubernetes-client to 5.12.1
 Key: SPARK-38244
 URL: https://issues.apache.org/jira/browse/SPARK-38244
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.3.0
Reporter: Yikun Jiang






--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Maciej Szymkiewicz (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maciej Szymkiewicz updated SPARK-38243:
---
Description: 
If {{LogisticRegression.getThreshold}} is called with model having multiple 
thresholds we suppose to raise an exception,
{code:python}
ValueError: Logistic Regression getThreshold only applies to binary 
classification ...
{code}
However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
{{{}str.join{}}}, resulting in unintended {{TypeError}}


{code:python}
>>> from pyspark.ml.classification import LogisticRegression
... 
... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
>>> model.getThreshold()
Traceback (most recent call last):
  Input In [7] in 
    model.getThreshold()
  File /path/to/spark/python/pyspark/ml/classification.py:1003 in getThreshold
    + ",".join(ts)
Type Error: sequence item 0: expected str instance, float found

{code}

  was:
If {{LogisticRegression.getThreshold}} is called with model having multiple 
thresholds we suppose to raise an exception,
{code:python}
ValueError: Logistic Regression getThreshold only applies to binary 
classification ...
{code}
However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
{{{}str.join{}}}, resulting in unintended {{TypeError}}


{code:python}
>>> from pyspark.ml.classification import LogisticRegression
... 
... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
>>> model.getThreshold()
Traceback (most recent call last):
  Input In [7] in 
    model.getThreshold()
  File ~/Workspace/spark/python/pyspark/ml/classification.py:1003 in 
getThreshold
    + ",".join(ts)
Type Error: sequence item 0: expected str instance, float found

{code}


> Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
> -
>
> Key: SPARK-38243
> URL: https://issues.apache.org/jira/browse/SPARK-38243
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> If {{LogisticRegression.getThreshold}} is called with model having multiple 
> thresholds we suppose to raise an exception,
> {code:python}
> ValueError: Logistic Regression getThreshold only applies to binary 
> classification ...
> {code}
> However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
> {{{}str.join{}}}, resulting in unintended {{TypeError}}
> {code:python}
> >>> from pyspark.ml.classification import LogisticRegression
> ... 
> ... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
> >>> model.getThreshold()
> Traceback (most recent call last):
>   Input In [7] in 
>     model.getThreshold()
>   File /path/to/spark/python/pyspark/ml/classification.py:1003 in getThreshold
>     + ",".join(ts)
> Type Error: sequence item 0: expected str instance, float found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38244:


Assignee: (was: Apache Spark)

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494034#comment-17494034
 ] 

Apache Spark commented on SPARK-38244:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35557

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38244:


Assignee: Apache Spark

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494033#comment-17494033
 ] 

Apache Spark commented on SPARK-38244:
--

User 'Yikun' has created a pull request for this issue:
https://github.com/apache/spark/pull/35557

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38243:


Assignee: Apache Spark

> Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
> -
>
> Key: SPARK-38243
> URL: https://issues.apache.org/jira/browse/SPARK-38243
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Assignee: Apache Spark
>Priority: Minor
>
> If {{LogisticRegression.getThreshold}} is called with model having multiple 
> thresholds we suppose to raise an exception,
> {code:python}
> ValueError: Logistic Regression getThreshold only applies to binary 
> classification ...
> {code}
> However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
> {{{}str.join{}}}, resulting in unintended {{TypeError}}
> {code:python}
> >>> from pyspark.ml.classification import LogisticRegression
> ... 
> ... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
> >>> model.getThreshold()
> Traceback (most recent call last):
>   Input In [7] in 
>     model.getThreshold()
>   File /path/to/spark/python/pyspark/ml/classification.py:1003 in getThreshold
>     + ",".join(ts)
> Type Error: sequence item 0: expected str instance, float found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-38243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494036#comment-17494036
 ] 

Apache Spark commented on SPARK-38243:
--

User 'zero323' has created a pull request for this issue:
https://github.com/apache/spark/pull/35558

> Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
> -
>
> Key: SPARK-38243
> URL: https://issues.apache.org/jira/browse/SPARK-38243
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> If {{LogisticRegression.getThreshold}} is called with model having multiple 
> thresholds we suppose to raise an exception,
> {code:python}
> ValueError: Logistic Regression getThreshold only applies to binary 
> classification ...
> {code}
> However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
> {{{}str.join{}}}, resulting in unintended {{TypeError}}
> {code:python}
> >>> from pyspark.ml.classification import LogisticRegression
> ... 
> ... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
> >>> model.getThreshold()
> Traceback (most recent call last):
>   Input In [7] in 
>     model.getThreshold()
>   File /path/to/spark/python/pyspark/ml/classification.py:1003 in getThreshold
>     + ",".join(ts)
> Type Error: sequence item 0: expected str instance, float found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38243) Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-38243:


Assignee: (was: Apache Spark)

> Unintended exception thrown in pyspark.ml.LogisticRegression.getThreshold
> -
>
> Key: SPARK-38243
> URL: https://issues.apache.org/jira/browse/SPARK-38243
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.0, 3.1.0, 3.2.0, 3.3.0
>Reporter: Maciej Szymkiewicz
>Priority: Minor
>
> If {{LogisticRegression.getThreshold}} is called with model having multiple 
> thresholds we suppose to raise an exception,
> {code:python}
> ValueError: Logistic Regression getThreshold only applies to binary 
> classification ...
> {code}
> However, {{thresholds}} ({{{}List[float]{}}}) are incorrectly passed to 
> {{{}str.join{}}}, resulting in unintended {{TypeError}}
> {code:python}
> >>> from pyspark.ml.classification import LogisticRegression
> ... 
> ... model = LogisticRegression(thresholds=[1.0, 2.0, 3.0])
> >>> model.getThreshold()
> Traceback (most recent call last):
>   Input In [7] in 
>     model.getThreshold()
>   File /path/to/spark/python/pyspark/ml/classification.py:1003 in getThreshold
>     + ",".join(ts)
> Type Error: sequence item 0: expected str instance, float found
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33349) ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed

2022-02-17 Thread Kent (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17494121#comment-17494121
 ] 

Kent commented on SPARK-33349:
--

[~jkleckner] Just curious when you see "too old resource" in the driver pod log 
does your pod die off and get restarted?

Our Driver pod just hangs and logs are not moving...

The only way we see to fix this is to manually kill the pod or restart the 
spark job itself.

This must be affecting many others as well?

Maybe the custom fix is to have a watcher pod that looks for "too old resource" 
in the driver pod log and if found kills that driver pod which would then 
hopefully keep the spark job running...

Thanks!

Kent

BTW I voted and will have others vote for this as well!

 

> ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed
> --
>
> Key: SPARK-33349
> URL: https://issues.apache.org/jira/browse/SPARK-33349
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.1, 3.0.2, 3.1.0
>Reporter: Nicola Bova
>Priority: Critical
>
> I launch my spark application with the 
> [spark-on-kubernetes-operator|https://github.com/GoogleCloudPlatform/spark-on-k8s-operator]
>  with the following yaml file:
> {code:yaml}
> apiVersion: sparkoperator.k8s.io/v1beta2
> kind: SparkApplication
> metadata:
>    name: spark-kafka-streamer-test
>    namespace: kafka2hdfs
> spec: 
>    type: Scala
>    mode: cluster
>    image: /spark:3.0.2-SNAPSHOT-2.12-0.1.0
>    imagePullPolicy: Always
>    timeToLiveSeconds: 259200
>    mainClass: path.to.my.class.KafkaStreamer
>    mainApplicationFile: spark-kafka-streamer_2.12-spark300-assembly.jar
>    sparkVersion: 3.0.1
>    restartPolicy:
>  type: Always
>    sparkConf:
>  "spark.kafka.consumer.cache.capacity": "8192"
>  "spark.kubernetes.memoryOverheadFactor": "0.3"
>    deps:
>    jars:
>  - my
>  - jar
>  - list
>    hadoopConfigMap: hdfs-config
>    driver:
>  cores: 4
>  memory: 12g
>  labels:
>    version: 3.0.1
>  serviceAccount: default
>  javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
>   executor:
>  instances: 4
>     cores: 4
>     memory: 16g
>     labels:
>   version: 3.0.1
>     javaOptions: 
> "-Dlog4j.configuration=file:///opt/spark/log4j/log4j.properties"
> {code}
>  I have tried with both Spark `3.0.1` and `3.0.2-SNAPSHOT` with the ["Restart 
> the watcher when we receive a version changed from 
> k8s"|https://github.com/apache/spark/pull/29533] patch.
> This is the driver log:
> {code}
> 20/11/04 12:16:02 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> ... // my app log, it's a structured streaming app reading from kafka and 
> writing to hdfs
> 20/11/04 13:12:12 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has 
> been closed (this is expected if the application is shutting down.)
> io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
> version: 1574101276 (1574213896)
>  at 
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
>  at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
>  at 
> okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
>  at 
> okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
>  at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
>  at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
>  at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203)
>  at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
>  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown 
> Source)
>  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown 
> Source)
>  at java.base/java.lang.Thread.run(Unknown Source)
> {code}
> The error above appears after roughly 50 minutes.
> After the exception above, no more logs are produced and the app hangs.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-38244.
---
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 35557
[https://github.com/apache/spark/pull/35557]

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-38244) Upgrade kubernetes-client to 5.12.1

2022-02-17 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-38244:
-

Assignee: Yikun Jiang

> Upgrade kubernetes-client to 5.12.1
> ---
>
> Key: SPARK-38244
> URL: https://issues.apache.org/jira/browse/SPARK-38244
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.3.0
>Reporter: Yikun Jiang
>Assignee: Yikun Jiang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33206) Spark Shuffle Index Cache calculates memory usage wrong

2022-02-17 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33206:


Assignee: Apache Spark

> Spark Shuffle Index Cache calculates memory usage wrong
> ---
>
> Key: SPARK-33206
> URL: https://issues.apache.org/jira/browse/SPARK-33206
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 2.4.0, 3.0.1
>Reporter: Lars Francke
>Assignee: Apache Spark
>Priority: Major
> Attachments: image001(1).png
>
>
> SPARK-21501 changed the spark shuffle index service to be based on memory 
> instead of the number of files.
> Unfortunately, there's a problem with the calculation which is based on size 
> information provided by `ShuffleIndexInformation`.
> It is based purely on the file size of the cached file on disk.
> We're running in OOMs with very small index files (byte size ~16 bytes) but 
> the overhead of the ShuffleIndexInformation around this is much larger (e.g. 
> 184 bytes, see screenshot). We need to take this into account and should 
> probably add a fixed overhead of somewhere between 152 and 180 bytes 
> according to my tests. I'm not 100% sure what the correct number is and it'll 
> also depend on the architecture etc. so we can't be exact anyway.
> If we do that we can maybe get rid of the size field in 
> ShuffleIndexInformation to save a few more bytes per entry.
> In effect this means that for small files we use up about 70-100 times as 
> much memory as we intend to. Our NodeManagers OOM with 4GB and more of 
> indexShuffleCache.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >