[jira] [Assigned] (SPARK-33904) Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33904:
---

Assignee: Maxim Gekk

> Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`
> ---
>
> Key: SPARK-33904
> URL: https://issues.apache.org/jira/browse/SPARK-33904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> The v1 INSERT INTO command recognizes `spark_catalog` as the default session 
> catalog:
> {code:sql}
> spark-sql> create table spark_catalog.ns.tbl (c int);
> spark-sql> insert into spark_catalog.ns.tbl select 0;
> spark-sql> select * from spark_catalog.ns.tbl;
> 0
> {code}
> but the `saveAsTable()` and `insertInto()` methods don't allow to write a 
> table with explicitly specified catalog spark_catalog:
> {code:scala}
> scala> sql("CREATE NAMESPACE spark_catalog.ns")
> scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
> org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the 
> identifier spark_catalog.ns.tbl.
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629)
>   ... 47 elided
> scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl")
> org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the 
> identifier spark_catalog.ns.tbl.
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498)
>   ... 47 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33904) Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33904.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30919
[https://github.com/apache/spark/pull/30919]

> Recognize `spark_catalog` in `saveAsTable()` and `insertInto()`
> ---
>
> Key: SPARK-33904
> URL: https://issues.apache.org/jira/browse/SPARK-33904
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> The v1 INSERT INTO command recognizes `spark_catalog` as the default session 
> catalog:
> {code:sql}
> spark-sql> create table spark_catalog.ns.tbl (c int);
> spark-sql> insert into spark_catalog.ns.tbl select 0;
> spark-sql> select * from spark_catalog.ns.tbl;
> 0
> {code}
> but the `saveAsTable()` and `insertInto()` methods don't allow to write a 
> table with explicitly specified catalog spark_catalog:
> {code:scala}
> scala> sql("CREATE NAMESPACE spark_catalog.ns")
> scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl")
> org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the 
> identifier spark_catalog.ns.tbl.
>   at 
> org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629)
>   ... 47 elided
> scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl")
> org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the 
> identifier spark_catalog.ns.tbl.
>   at 
> org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498)
>   ... 47 elided
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33926.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30963
[https://github.com/apache/spark/pull/30963]

> Improve the error message in resolving of DSv1 multi-part identifiers
> -
>
> Key: SPARK-33926
> URL: https://issues.apache.org/jira/browse/SPARK-33926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> This is a follow up of 
> https://github.com/apache/spark/pull/30915#discussion_r549240857



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33926:
---

Assignee: Maxim Gekk

> Improve the error message in resolving of DSv1 multi-part identifiers
> -
>
> Key: SPARK-33926
> URL: https://issues.apache.org/jira/browse/SPARK-33926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> This is a follow up of 
> https://github.com/apache/spark/pull/30915#discussion_r549240857



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256357#comment-17256357
 ] 

Apache Spark commented on SPARK-33940:
--

User 'CodingCat' has created a pull request for this issue:
https://github.com/apache/spark/pull/30972

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33940:


Assignee: (was: Apache Spark)

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33940:


Assignee: Apache Spark

> allow configuring the max column name length in csv writer
> --
>
> Key: SPARK-33940
> URL: https://issues.apache.org/jira/browse/SPARK-33940
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Nan Zhu
>Assignee: Apache Spark
>Priority: Major
>
> csv writer actually has an implicit limit on column name length due to 
> univocity-parser, 
>  
> when we initialize a writer 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
>  it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
> eventually 
> ([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]
>  
> in that stringCache.get, it has a maxStringLength cap 
> [https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
>  which is 1024 by default
>  
> we do not expose this as configurable option, leading to NPE when we have a 
> column name larger than 1024, 
>  
> ```
> [info]   Cause: java.lang.NullPointerException:
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)
> [info]   at 
> com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)
> [info]   at 
> org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)
> [info]   at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
> [info]   at 
> org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)
> [info]   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)
> ```
>  
> it could be reproduced by a simple unit test
>  
> ```
> val row1 = Row("a")
> val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
> val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
> df.repartition(1)
>  .write
>  .option("header", "true")
>  .option("maxColumnNameLength", 1025)
>  .csv(dataPath)
> ```
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33940) allow configuring the max column name length in csv writer

2020-12-29 Thread Nan Zhu (Jira)
Nan Zhu created SPARK-33940:
---

 Summary: allow configuring the max column name length in csv writer
 Key: SPARK-33940
 URL: https://issues.apache.org/jira/browse/SPARK-33940
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Nan Zhu


csv writer actually has an implicit limit on column name length due to 
univocity-parser, 

 

when we initialize a writer 
[https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/AbstractWriter.java#L211,]
 it calls toIdentifierGroupArray which calls valueOf in NormalizedString.java 
eventually 
([https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/NormalizedString.java#L205-L209)]

 

in that stringCache.get, it has a maxStringLength cap 
[https://github.com/uniVocity/univocity-parsers/blob/e09114c6879fa6c2c15e7365abc02cda3e193ff7/src/main/java/com/univocity/parsers/common/StringCache.java#L104]
 which is 1024 by default

 

we do not expose this as configurable option, leading to NPE when we have a 
column name larger than 1024, 

 

```

[info]   Cause: java.lang.NullPointerException:

[info]   at 
com.univocity.parsers.common.AbstractWriter.submitRow(AbstractWriter.java:349)

[info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:444)

[info]   at 
com.univocity.parsers.common.AbstractWriter.writeHeaders(AbstractWriter.java:410)

[info]   at 
org.apache.spark.sql.catalyst.csv.UnivocityGenerator.writeHeaders(UnivocityGenerator.scala:87)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter$.writeHeaders(CsvOutputWriter.scala:58)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.(CsvOutputWriter.scala:44)

[info]   at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:86)

[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)

[info]   at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.(FileFormatDataWriter.scala:111)

[info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:269)

[info]   at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:210)

```

 

it could be reproduced by a simple unit test

 

```

val row1 = Row("a")
val superLongHeader = (0 until 1025).map(_ => "c").mkString("")
val df = Seq(s"${row1.getString(0)}").toDF(superLongHeader)
df.repartition(1)
 .write
 .option("header", "true")
 .option("maxColumnNameLength", 1025)
 .csv(dataPath)

```

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33927:


Assignee: Hyukjin Kwon

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Blocker
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33927.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30971
[https://github.com/apache/spark/pull/30971]

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Hyukjin Kwon
>Priority: Blocker
> Fix For: 3.1.0
>
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33907) Only prune columns of from_json if parsing options is empty

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256336#comment-17256336
 ] 

Apache Spark commented on SPARK-33907:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30970

> Only prune columns of from_json if parsing options is empty
> ---
>
> Key: SPARK-33907
> URL: https://issues.apache.org/jira/browse/SPARK-33907
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.1.0
>
>
> For safety, we should only prune columns from from_json expression if the 
> parsing option is empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33927:


Assignee: (was: Apache Spark)

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33927:


Assignee: Apache Spark

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Blocker
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256335#comment-17256335
 ] 

Apache Spark commented on SPARK-33927:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30971

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33890) Improve the implement of trim/trimleft/trimright

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33890.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30905
[https://github.com/apache/spark/pull/30905]

> Improve the implement of trim/trimleft/trimright
> 
>
> Key: SPARK-33890
> URL: https://issues.apache.org/jira/browse/SPARK-33890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.2.0
>
>
> The current implement of trim/trimleft/trimright have somewhat redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33890) Improve the implement of trim/trimleft/trimright

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33890:
---

Assignee: jiaan.geng

> Improve the implement of trim/trimleft/trimright
> 
>
> Key: SPARK-33890
> URL: https://issues.apache.org/jira/browse/SPARK-33890
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> The current implement of trim/trimleft/trimright have somewhat redundant.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32684:
---

Assignee: angerszhu

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32684.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30946
[https://github.com/apache/spark/pull/30946]

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Minor
> Fix For: 3.2.0
>
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33874) Spark may report PodRunning if there is a sidecar that has not exited

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33874.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30892

> Spark may report PodRunning if there is a sidecar that has not exited
> -
>
> Key: SPARK-33874
> URL: https://issues.apache.org/jira/browse/SPARK-33874
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2, 3.1.0, 3.2.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Major
> Fix For: 3.1.0
>
>
> This is a continuation of SPARK-30821 which handles the situation where Spark 
> is still running but it may have sidecar containers that exited.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33938) Optimize Like Any/All by LikeSimplification

2020-12-29 Thread jiaan.geng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256285#comment-17256285
 ] 

jiaan.geng commented on SPARK-33938:


I'm working on.

> Optimize Like Any/All by LikeSimplification 
> 
>
> Key: SPARK-33938
> URL: https://issues.apache.org/jira/browse/SPARK-33938
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Major
>
> We should optimize Like Any/All by LikeSimplification to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread L. C. Hsieh (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh reassigned SPARK-33932:
---

Assignee: L. C. Hsieh  (was: Apache Spark)

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.2.0
>
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31946) Failed to register SIGPWR handler on MacOS

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256249#comment-17256249
 ] 

Apache Spark commented on SPARK-31946:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/30968

> Failed to register SIGPWR handler on MacOS
> --
>
> Key: SPARK-31946
> URL: https://issues.apache.org/jira/browse/SPARK-31946
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: macOS 10.14.6
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - 
> disabling decommission feature.
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the 
> *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31946) Failed to register SIGPWR handler on MacOS

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31946:


Assignee: Apache Spark

> Failed to register SIGPWR handler on MacOS
> --
>
> Key: SPARK-31946
> URL: https://issues.apache.org/jira/browse/SPARK-31946
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: macOS 10.14.6
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - 
> disabling decommission feature.
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the 
> *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-31946) Failed to register SIGPWR handler on MacOS

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31946:


Assignee: (was: Apache Spark)

> Failed to register SIGPWR handler on MacOS
> --
>
> Key: SPARK-31946
> URL: https://issues.apache.org/jira/browse/SPARK-31946
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
> Environment: macOS 10.14.6
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> 20/06/09 22:54:54 WARN SignalUtils: Failed to register SIGPWR handler - 
> disabling decommission feature.
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:83)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:81)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:86)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}
> Seem like MacOS is *POSIX* compliant. But SIGPWR is not specified in the 
> *POSIX* specification. See [https://en.wikipedia.org/wiki/Signal_(IPC)#SIGPWR]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33939) Do not display the seed of uuid/shuffle if user not specify

2020-12-29 Thread ulysses you (Jira)
ulysses you created SPARK-33939:
---

 Summary: Do not display the seed of uuid/shuffle if user not 
specify
 Key: SPARK-33939
 URL: https://issues.apache.org/jira/browse/SPARK-33939
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: ulysses you


Keep the display behavior same with rand/randn.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33938) Optimize Like Any/All by LikeSimplification

2020-12-29 Thread Yuming Wang (Jira)
Yuming Wang created SPARK-33938:
---

 Summary: Optimize Like Any/All by LikeSimplification 
 Key: SPARK-33938
 URL: https://issues.apache.org/jira/browse/SPARK-33938
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yuming Wang


We should optimize Like Any/All by LikeSimplification to improve performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33934) Support automatically identify the Python file and execute it

2020-12-29 Thread angerszhu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256228#comment-17256228
 ] 

angerszhu commented on SPARK-33934:
---

raise a pr soon

> Support automatically identify the Python file and execute it
> -
>
> Key: SPARK-33934
> URL: https://issues.apache.org/jira/browse/SPARK-33934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In Hive script transform, we can use `USING xxx.py` but in Spark we will got 
> error 
> {code:java}
> Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
> recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 
> 339): org.apache.spark.SparkException: Subprocess exited with status 127. 
> Error: /bin/bash: xxx.py: can't find the command
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33932.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30961

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table

2020-12-29 Thread Jira


[ 
https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256196#comment-17256196
 ] 

黄海升 commented on SPARK-33937:
-

I'm sorry. It looks like it is.

> Move the old partition data to trash instead of deleting it when inserting 
> rewrite hive table
> -
>
> Key: SPARK-33937
> URL: https://issues.apache.org/jira/browse/SPARK-33937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: 黄海升
>Priority: Minor
> Fix For: 3.2.0
>
>
> InsertIntoHiveTable should move the old partition data to trash instead of 
> deleting it.
> Because that's what we do in hive.
> [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java]
> `deleteOldPathForReplace`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table

2020-12-29 Thread Jira


 [ 
https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

黄海升 resolved SPARK-33937.
-
Resolution: Duplicate

> Move the old partition data to trash instead of deleting it when inserting 
> rewrite hive table
> -
>
> Key: SPARK-33937
> URL: https://issues.apache.org/jira/browse/SPARK-33937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: 黄海升
>Priority: Minor
> Fix For: 3.2.0
>
>
> InsertIntoHiveTable should move the old partition data to trash instead of 
> deleting it.
> Because that's what we do in hive.
> [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java]
> `deleteOldPathForReplace`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table

2020-12-29 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256192#comment-17256192
 ] 

Chao Sun commented on SPARK-33937:
--

This looks like a duplicate of SPARK-32480. 

> Move the old partition data to trash instead of deleting it when inserting 
> rewrite hive table
> -
>
> Key: SPARK-33937
> URL: https://issues.apache.org/jira/browse/SPARK-33937
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: 黄海升
>Priority: Minor
> Fix For: 3.2.0
>
>
> InsertIntoHiveTable should move the old partition data to trash instead of 
> deleting it.
> Because that's what we do in hive.
> [https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java]
> `deleteOldPathForReplace`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33937) Move the old partition data to trash instead of deleting it when inserting rewrite hive table

2020-12-29 Thread Jira
黄海升 created SPARK-33937:
---

 Summary: Move the old partition data to trash instead of deleting 
it when inserting rewrite hive table
 Key: SPARK-33937
 URL: https://issues.apache.org/jira/browse/SPARK-33937
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: 黄海升
 Fix For: 3.2.0


InsertIntoHiveTable should move the old partition data to trash instead of 
deleting it.

Because that's what we do in hive.

[https://github.com/apache/hive/blob/9c6f8b76123c88b0c8a98645874722ba80b3c2b0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java]

`deleteOldPathForReplace`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-29 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046
 ] 

David Wyles edited comment on SPARK-33635 at 12/29/20, 8:48 PM:


[~gsomogyi] I now have my results.
 I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from spark 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

[event 
logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing]

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case.

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:

val spark =
   SparkSession.builder.appName("Kafka Read Performance")
     .config("spark.executor.memory","16g")
     .config("spark.cores.max", "10")
     .config("spark.eventLog.enabled","true")
     .config("spark.eventLog.dir","file:///tmp/spark-events")
     .config("spark.eventLog.overwrite","true")
    .getOrCreate()

import spark.implicits._

val *startTime* = System.nanoTime()

val df = 
   spark
     .read
     .format("kafka")
     .option("kafka.bootstrap.servers", config.brokers)
     .option("subscribe", config.inTopic)
     .option("startingOffsets", "earliest")
     .option("endingOffsets", "latest")
     .option("failOnDataLoss","false")
     .load()

df
   .write
   .format("kafka")
   .option("kafka.bootstrap.servers", config.brokers)
   .option("topic", config.outTopic)
   .mode(SaveMode.Append)
   .save()

val *endTime* = System.nanoTime()

val elapsedSecs = (endTime - startTime) / 1E9

// static input sample was used, fixed row count.

println(s"Took $elapsedSecs secs")
 spark.stop()

 


was (Author: david.wyles):
[~gsomogyi] I now have my results.
 I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

[event 
logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing]

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case.

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:

val spark =
   SparkSession.builder.appName("Kafka Read Performance")
     .config("spark.executor.memory","16g")
     .config("spark.cores.max", "10")
     .config("spark.eventLog.enabled","true")
     .config("spark.eventLog.dir","file:///tmp/spark-events")
     .config("spark.eventLog.overwrite","true")
    .getOrCreate()

import spark.implicits._

val *startTime* = System.nanoTime()

val df = 
   spark
     .read
     .format("kafka")
     .option("kafka.bootstrap.servers", config.brokers)
     .option("subscribe", config.inTopic)
     .option("startingOffsets", "earliest")
     .option("endingOffsets", "latest")
     .option("failOnDataLoss","false")
     .load()

df
   .write
   .format("kafka")
   .option("kafka.bootstrap.servers", config.brokers)
   .option("topic", config.outTopic)
   .mode(SaveMode.Append)
   .save()

val *endTime* = System.nanoTime()

val elapsedSecs = (endTime - startTime) / 1E9

// static input sample was used, fixed row count.

println(s"Took $elapsedSecs secs")
 spark.stop()

 

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall 

[jira] [Updated] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33936:
--
Fix Version/s: (was: 3.2.0)
   3.1.0

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33936:
--
Fix Version/s: 3.2.0

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.1.0, 3.2.0
>
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33936.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30966
[https://github.com/apache/spark/pull/30966]

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.2.0
>
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33936:
-

Assignee: Maxim Gekk

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256116#comment-17256116
 ] 

Apache Spark commented on SPARK-33936:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30967

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33915) Allow json expression to be pushable column

2020-12-29 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255389#comment-17255389
 ] 

Ted Yu edited comment on SPARK-33915 at 12/29/20, 6:51 PM:
---

Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.phone) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) 
AS phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [[phone->'phone' = ?, 1200]]
 - Requested Columns: [id,address,phone]

{code}


was (Author: yuzhih...@gmail.com):
Here is the plan prior to predicate pushdown:
{code}
2020-12-26 03:28:59,926 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- Filter (get_json_object(phone#37, $.phone) = 1200)
  +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: []
 - Requested Columns: [id,address,phone]
{code}
Here is the plan with pushdown:
{code}
2020-12-28 01:40:08,150 (Time-limited test) [DEBUG - 
org.apache.spark.internal.Logging.logDebug(Logging.scala:61)] Adaptive 
execution enabled for plan: Sort [id#34 ASC NULLS FIRST], true, 0
+- Project [id#34, address#35, phone#37, get_json_object(phone#37, $.code) AS 
phone#33]
   +- BatchScan[id#34, address#35, phone#37] Cassandra Scan: test.person
 - Cassandra Filters: [["`GetJsonObject(phone#37,$.phone)`" = ?, 1200]]
 - Requested Columns: [id,address,phone]
{code}

> Allow json expression to be pushable column
> ---
>
> Key: SPARK-33915
> URL: https://issues.apache.org/jira/browse/SPARK-33915
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.1
>Reporter: Ted Yu
>Priority: Major
>
> Currently PushableColumnBase provides no support for json / jsonb expression.
> Example of json expression:
> {code}
> get_json_object(phone, '$.code') = '1200'
> {code}
> If non-string literal is part of the expression, the presence of cast() would 
> complicate the situation.
> Implication is that implementation of SupportsPushDownFilters doesn't have a 
> chance to perform pushdown even if third party DB engine supports json 
> expression pushdown.
> This issue is for discussion and implementation of Spark core changes which 
> would allow json expression to be recognized as pushable column.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33936:


Assignee: (was: Apache Spark)

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33936:


Assignee: Apache Spark

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256107#comment-17256107
 ] 

Apache Spark commented on SPARK-33936:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30966

> Add the version when connector methods and interfaces were added
> 
>
> Key: SPARK-33936
> URL: https://issues.apache.org/jira/browse/SPARK-33936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0, 3.2.0
>Reporter: Maxim Gekk
>Priority: Minor
>
> Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
> *connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33936) Add the version when connector methods and interfaces were added

2020-12-29 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-33936:
--

 Summary: Add the version when connector methods and interfaces 
were added
 Key: SPARK-33936
 URL: https://issues.apache.org/jira/browse/SPARK-33936
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0, 3.2.0
Reporter: Maxim Gekk


Add the *since 3.1.0 /3.2.0* tags to new methods and interfaces in the 
*connector* package. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256103#comment-17256103
 ] 

Apache Spark commented on SPARK-33935:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30965

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256101#comment-17256101
 ] 

Apache Spark commented on SPARK-33935:
--

User 'tanelk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30965

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33935:


Assignee: (was: Apache Spark)

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33935:


Assignee: Apache Spark

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Assignee: Apache Spark
>Priority: Major
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-33935:
---
Issue Type: Bug  (was: Improvement)

> Fix CBOs cost function 
> ---
>
> Key: SPARK-33935
> URL: https://issues.apache.org/jira/browse/SPARK-33935
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Tanel Kiis
>Priority: Major
>
> The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
> {code:title=spark.sql.cbo.joinReorder.card.weight}
> The weight of cardinality (number of rows) for plan cost comparison in join 
> reorder: rows * weight + size * (1 - weight).
> {code}
> But in the implementation the formula is a bit different:
> {code:title=Current implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   if (other.planCost.card == 0 || other.planCost.size == 0) {
> false
>   } else {
> val relativeRows = BigDecimal(this.planCost.card) / 
> BigDecimal(other.planCost.card)
> val relativeSize = BigDecimal(this.planCost.size) / 
> BigDecimal(other.planCost.size)
> relativeRows * conf.joinReorderCardWeight +
>   relativeSize * (1 - conf.joinReorderCardWeight) < 1
>   }
> }
> {code}
> This change has an unfortunate consequence: 
> given two plans A and B, both A betterThan B and B betterThan A might give 
> the same results. This happes when one has many rows with small sizes and 
> other has few rows with large sizes.
> A example values, that have this fenomen with the default weight value (0.7):
> A.card = 500, B.card = 300
> A.size = 30, B.size = 80
> Both A betterThan B and B betterThan A would have score above 1 and would 
> return false.
> A new implementation is proposed, that matches the documentation:
> {code:title=Proposed implementation}
> def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
>   val oldCost = BigDecimal(this.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
>   val newCost = BigDecimal(other.planCost.card) * 
> conf.joinReorderCardWeight +
> BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
>   newCost < oldCost
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33935) Fix CBOs cost function

2020-12-29 Thread Tanel Kiis (Jira)
Tanel Kiis created SPARK-33935:
--

 Summary: Fix CBOs cost function 
 Key: SPARK-33935
 URL: https://issues.apache.org/jira/browse/SPARK-33935
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Tanel Kiis


The parameter spark.sql.cbo.joinReorder.card.weight is decumented as:
{code:title=spark.sql.cbo.joinReorder.card.weight}
The weight of cardinality (number of rows) for plan cost comparison in join 
reorder: rows * weight + size * (1 - weight).
{code}

But in the implementation the formula is a bit different:
{code:title=Current implementation}
def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
  if (other.planCost.card == 0 || other.planCost.size == 0) {
false
  } else {
val relativeRows = BigDecimal(this.planCost.card) / 
BigDecimal(other.planCost.card)
val relativeSize = BigDecimal(this.planCost.size) / 
BigDecimal(other.planCost.size)
relativeRows * conf.joinReorderCardWeight +
  relativeSize * (1 - conf.joinReorderCardWeight) < 1
  }
}
{code}

This change has an unfortunate consequence: 
given two plans A and B, both A betterThan B and B betterThan A might give the 
same results. This happes when one has many rows with small sizes and other has 
few rows with large sizes.

A example values, that have this fenomen with the default weight value (0.7):
A.card = 500, B.card = 300
A.size = 30, B.size = 80
Both A betterThan B and B betterThan A would have score above 1 and would 
return false.

A new implementation is proposed, that matches the documentation:
{code:title=Proposed implementation}
def betterThan(other: JoinPlan, conf: SQLConf): Boolean = {
  val oldCost = BigDecimal(this.planCost.card) * conf.joinReorderCardWeight 
+
BigDecimal(this.planCost.size) * (1 - conf.joinReorderCardWeight)
  val newCost = BigDecimal(other.planCost.card) * 
conf.joinReorderCardWeight +
BigDecimal(other.planCost.size) * (1 - conf.joinReorderCardWeight)
  newCost < oldCost
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-29 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046
 ] 

David Wyles edited comment on SPARK-33635 at 12/29/20, 4:54 PM:


[~gsomogyi] I now have my results.
 I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

[event 
logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing]

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case.

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:

val spark =
   SparkSession.builder.appName("Kafka Read Performance")
     .config("spark.executor.memory","16g")
     .config("spark.cores.max", "10")
     .config("spark.eventLog.enabled","true")
     .config("spark.eventLog.dir","file:///tmp/spark-events")
     .config("spark.eventLog.overwrite","true")
    .getOrCreate()

import spark.implicits._

val *startTime* = System.nanoTime()

val df = 
   spark
     .read
     .format("kafka")
     .option("kafka.bootstrap.servers", config.brokers)
     .option("subscribe", config.inTopic)
     .option("startingOffsets", "earliest")
     .option("endingOffsets", "latest")
     .option("failOnDataLoss","false")
     .load()

df
   .write
   .format("kafka")
   .option("kafka.bootstrap.servers", config.brokers)
   .option("topic", config.outTopic)
   .mode(SaveMode.Append)
   .save()

val *endTime* = System.nanoTime()

val elapsedSecs = (endTime - startTime) / 1E9

// static input sample was used, fixed row count.

println(s"Took $elapsedSecs secs")
 spark.stop()

 


was (Author: david.wyles):
[~gsomogyi] I now have my results.
 I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

[event 
logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing]

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case (In my production use case I'm just writing to parquet files 
in hdfs - which is where I noticed the degredation in performant).

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:

val spark =
   SparkSession.builder.appName("Kafka Read Performance")
     .config("spark.executor.memory","16g")
     .config("spark.cores.max", "10")
     .config("spark.eventLog.enabled","true")
     .config("spark.eventLog.dir","file:///tmp/spark-events")
     .config("spark.eventLog.overwrite","true")
    .getOrCreate()

import spark.implicits._

val *startTime* = System.nanoTime()

val df = 
   spark
     .read
     .format("kafka")
     .option("kafka.bootstrap.servers", config.brokers)
     .option("subscribe", config.inTopic)
     .option("startingOffsets", "earliest")
     .option("endingOffsets", "latest")
     .option("failOnDataLoss","false")
     .load()

df
   .write
   .format("kafka")
   .option("kafka.bootstrap.servers", config.brokers)
   .option("topic", config.outTopic)
   .mode(SaveMode.Append)
   .save()

val *endTime* = System.nanoTime()

val elapsedSecs = (endTime - startTime) / 1E9

// static input sample was used, fixed row count.

println(s"Took $elapsedSecs secs")
 spark.stop()

 

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908

[jira] [Comment Edited] (SPARK-33635) Performance regression in Kafka read

2020-12-29 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046
 ] 

David Wyles edited comment on SPARK-33635 at 12/29/20, 4:34 PM:


[~gsomogyi] I now have my results.
 I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

[event 
logs|https://drive.google.com/drive/folders/1aElmzVWmJqRALQimdOYxdJu559_3EX_9?usp=sharing]

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case (In my production use case I'm just writing to parquet files 
in hdfs - which is where I noticed the degredation in performant).

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:

val spark =
   SparkSession.builder.appName("Kafka Read Performance")
     .config("spark.executor.memory","16g")
     .config("spark.cores.max", "10")
     .config("spark.eventLog.enabled","true")
     .config("spark.eventLog.dir","file:///tmp/spark-events")
     .config("spark.eventLog.overwrite","true")
    .getOrCreate()

import spark.implicits._

val *startTime* = System.nanoTime()

val df = 
   spark
     .read
     .format("kafka")
     .option("kafka.bootstrap.servers", config.brokers)
     .option("subscribe", config.inTopic)
     .option("startingOffsets", "earliest")
     .option("endingOffsets", "latest")
     .option("failOnDataLoss","false")
     .load()

df
   .write
   .format("kafka")
   .option("kafka.bootstrap.servers", config.brokers)
   .option("topic", config.outTopic)
   .mode(SaveMode.Append)
   .save()

val *endTime* = System.nanoTime()

val elapsedSecs = (endTime - startTime) / 1E9

// static input sample was used, fixed row count.

println(s"Took $elapsedSecs secs")
 spark.stop()

 


was (Author: david.wyles):
[~gsomogyi] I now have my results.
I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case (In my production use case I'm just writing to parquet files 
in hdfs - which is where I noticed the degredation in performant).

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:



val spark =
  SparkSession.builder.appName("Kafka Read Performance")
    .config("spark.executor.memory","16g")
    .config("spark.cores.max", "10")
    .config("spark.eventLog.enabled","true")
    .config("spark.eventLog.dir","file:///tmp/spark-events")
    .config("spark.eventLog.overwrite","true")
   .getOrCreate()

 import spark.implicits._

val *startTime* = System.nanoTime()

 val df = 
  spark
    .read
    .format("kafka")
    .option("kafka.bootstrap.servers", config.brokers)
    .option("subscribe", config.inTopic)
    .option("startingOffsets", "earliest")
    .option("endingOffsets", "latest")
    .option("failOnDataLoss","false")
    .load()

df
  .write
  .format("kafka")
  .option("kafka.bootstrap.servers", config.brokers)
  .option("topic", config.outTopic)
  .mode(SaveMode.Append)
  .save()

val *endTime* = System.nanoTime()

 val elapsedSecs = (endTime - startTime) / 1E9

 // static input sample was used, fixed row count.

 println(s"Took $elapsedSecs secs")
 spark.stop()

 

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> 

[jira] [Commented] (SPARK-33635) Performance regression in Kafka read

2020-12-29 Thread David Wyles (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256046#comment-17256046
 ] 

David Wyles commented on SPARK-33635:
-

[~gsomogyi] I now have my results.
I was so unhappy about these results I ran all the tests again, the only thing 
that changed between them is the version of spark running on the cluster, 
everything else was static - the data input from kafka was an unchanging static 
set of data.

Input-> *672733262* rows

+*Spark 2.4.5*:+

*440* seconds - *1,528,939* rows per second.

+*Spark 3.0.1*:+

*990* seconds - *679,528* rows per seconds.

These are multiple runs (I even took the best from sprak 3.0.1)

I also captured the event logs between these two versions of spark - should 
anyone find them useful.

So, no matter what I do, I can only conclude that Spark 2.4.5 was a lot faster 
in this test case (In my production use case I'm just writing to parquet files 
in hdfs - which is where I noticed the degredation in performant).

Is Spark SQL reading the source data twice, just as it would if there was a 
"order by" in the query?

Sample code used:



val spark =
  SparkSession.builder.appName("Kafka Read Performance")
    .config("spark.executor.memory","16g")
    .config("spark.cores.max", "10")
    .config("spark.eventLog.enabled","true")
    .config("spark.eventLog.dir","file:///tmp/spark-events")
    .config("spark.eventLog.overwrite","true")
   .getOrCreate()

 import spark.implicits._

val *startTime* = System.nanoTime()

 val df = 
  spark
    .read
    .format("kafka")
    .option("kafka.bootstrap.servers", config.brokers)
    .option("subscribe", config.inTopic)
    .option("startingOffsets", "earliest")
    .option("endingOffsets", "latest")
    .option("failOnDataLoss","false")
    .load()

df
  .write
  .format("kafka")
  .option("kafka.bootstrap.servers", config.brokers)
  .option("topic", config.outTopic)
  .mode(SaveMode.Append)
  .save()

val *endTime* = System.nanoTime()

 val elapsedSecs = (endTime - startTime) / 1E9

 // static input sample was used, fixed row count.

 println(s"Took $elapsedSecs secs")
 spark.stop()

 

> Performance regression in Kafka read
> 
>
> Key: SPARK-33635
> URL: https://issues.apache.org/jira/browse/SPARK-33635
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
> Environment: A simple 5 node system. A simple data row of csv data in 
> kafka, evenly distributed between the partitions.
> Open JDK 1.8.0.252
> Spark in stand alone - 5 nodes, 10 workers (2 worker per node, each locked to 
> a distinct NUMA group)
> kafka (v 2.3.1) cluster - 5 nodes (1 broker per node).
> Centos 7.7.1908
> 1 topic, 10 partiions, 1 hour queue life
> (this is just one of clusters we have, I have tested on all of them and 
> theyall exhibit the same performance degredation)
>Reporter: David Wyles
>Priority: Major
>
> I have observed a slowdown in the reading of data from kafka on all of our 
> systems when migrating from spark 2.4.5 to Spark 3.0.0 (and Spark 3.0.1)
> I have created a sample project to isolate the problem as much as possible, 
> with just a read all data from a kafka topic (see 
> [https://github.com/codegorillauk/spark-kafka-read] ).
> With 2.4.5, across multiple runs, 
>  I get a stable read rate of 1,120,000 (1.12 mill) rows per second
> With 3.0.0 or 3.0.1, across multiple runs,
>  I get a stable read rate of 632,000 (0.632 mil) rows per second
> The represents a *44% loss in performance*. Which is, a lot.
> I have been working though the spark-sql-kafka-0-10 code base, but change for 
> spark 3 have been ongoing for over a year and its difficult to pin point an 
> exact change or reason for the degradation.
> I am happy to help fix this problem, but will need some assitance as I am 
> unfamiliar with the spark-sql-kafka-0-10 project.
>  
> A sample of the data my test reads (note: its not parsing csv - this is just 
> test data)
>  
> 160692180,001e0610e532,lightsense,tsl250rd,intensity,21853,53.262,acceleration_z,651,ep,290,commit,913,pressure,138,pm1,799,uv_intensity,823,idletime,-372,count,-72,ir_intensity,185,concentration,-61,flags,-532,tx,694.36,ep_heatsink,-556.92,acceleration_x,-221.40,fw,910.53,sample_flow_rate,-959.60,uptime,-515.15,pm10,-768.03,powersupply,214.72,magnetic_field_y,-616.04,alphasense,606.73,AoT_Chicago,053,Racine
>  Ave & 18th St Chicago IL,41.857959,-87.6564270002,AoT Chicago (S) 
> [C],2017/12/15 00:00:00,



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-33910.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.2.0
>
>
> 1. Push down the foldable expressions through CaseWhen/If
> 2. Simplify conditional in predicate
> 3. Push the UnaryExpression into (if / case) branches
> 4. Simplify CaseWhen if elseValue is None
> 5. Simplify CaseWhen clauses with (true and false) and (false and true)
> Common use cases are:
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
> select 'a' as event_type, * from t1   
> union all 
> select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
> {code}
> 1. Reduce read the whole table.
> {noformat}
> explain select * from v1 where event_type = 'a';
> Before simplify:
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#7, id#9L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
> id#10L]
>+- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
> [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> After simplify:
> == Physical Plan ==
> *(1) Project [a AS event_type#8, id#4L]
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
> Format: Parquet
> {noformat}
> 2. Push down the conditional expressions to data source.
> {noformat}
> explain select * from v1 where event_type = 'b';
> Before simplify:
> == Physical Plan ==
> Union
> :- LocalTableScan , [event_type#7, id#9L]
> +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
> id#10L]
>+- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
>   +- *(1) ColumnarToRow
>  +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
> [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> After simplify:
> == Physical Plan ==
> *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L 
> AS id#4L]
> +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
> [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
> PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
> {noformat}
> 3. Reduce the amount of calculation.
> {noformat}
> Before simplify:
> explain select event_type = 'e' from v1;
> == Physical Plan ==
> Union
> :- *(1) Project [false AS (event_type = e)#37]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
> +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS 
> (event_type = e)#38]
>+- *(2) ColumnarToRow
>   +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> After simplify:
> == Physical Plan ==
> Union
> :- *(1) Project [false AS (event_type = e)#10]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], 
> Format: Parquet,
> +- *(2) Project [false AS (event_type = e)#14]
>+- *(2) ColumnarToRow
>   +- FileScan parquet default.t2[] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33910) Simplify/Optimize conditional expressions

2020-12-29 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-33910:
---

Assignee: Yuming Wang

>  Simplify/Optimize conditional expressions
> --
>
> Key: SPARK-33910
> URL: https://issues.apache.org/jira/browse/SPARK-33910
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> 1. Push down the foldable expressions through CaseWhen/If
> 2. Simplify conditional in predicate
> 3. Push the UnaryExpression into (if / case) branches
> 4. Simplify CaseWhen if elseValue is None
> 5. Simplify CaseWhen clauses with (true and false) and (false and true)
> Common use cases are:
> {code:sql}
> create table t1 using parquet as select * from range(100);
> create table t2 using parquet as select * from range(200);
> create temp view v1 as
> select 'a' as event_type, * from t1   
> union all 
> select CASE WHEN id = 1 THEN 'b' ELSE 'c' end as event_type, * from t2
> {code}
> 1. Reduce read the whole table.
> {noformat}
> explain select * from v1 where event_type = 'a';
> Before simplify:
> == Physical Plan ==
> Union
> :- *(1) Project [a AS event_type#7, id#9L]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[id#9L] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> +- *(2) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
> id#10L]
>+- *(2) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = a)
>   +- *(2) ColumnarToRow
>  +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
> [(CASE WHEN (id#10L = 1) THEN b ELSE c END = a)], Format: Parquet, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> After simplify:
> == Physical Plan ==
> *(1) Project [a AS event_type#8, id#4L]
> +- *(1) ColumnarToRow
>+- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], 
> Format: Parquet
> {noformat}
> 2. Push down the conditional expressions to data source.
> {noformat}
> explain select * from v1 where event_type = 'b';
> Before simplify:
> == Physical Plan ==
> Union
> :- LocalTableScan , [event_type#7, id#9L]
> +- *(1) Project [CASE WHEN (id#10L = 1) THEN b ELSE c END AS event_type#8, 
> id#10L]
>+- *(1) Filter (CASE WHEN (id#10L = 1) THEN b ELSE c END = b)
>   +- *(1) ColumnarToRow
>  +- FileScan parquet default.t2[id#10L] Batched: true, DataFilters: 
> [(CASE WHEN (id#10L = 1) THEN b ELSE c END = b)], Format: Parquet, 
> PartitionFilters: [], PushedFilters: [], ReadSchema: struct
> After simplify:
> == Physical Plan ==
> *(1) Project [CASE WHEN (id#5L = 1) THEN b ELSE c END AS event_type#8, id#5L 
> AS id#4L]
> +- *(1) Filter (isnotnull(id#5L) AND (id#5L = 1))
>+- *(1) ColumnarToRow
>   +- FileScan parquet default.t2[id#5L] Batched: true, DataFilters: 
> [isnotnull(id#5L), (id#5L = 1)], Format: Parquet, PartitionFilters: [], 
> PushedFilters: [IsNotNull(id), EqualTo(id,1)], ReadSchema: struct
> {noformat}
> 3. Reduce the amount of calculation.
> {noformat}
> Before simplify:
> explain select event_type = 'e' from v1;
> == Physical Plan ==
> Union
> :- *(1) Project [false AS (event_type = e)#37]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
> +- *(2) Project [(CASE WHEN (id#21L = 1) THEN b ELSE c END = e) AS 
> (event_type = e)#38]
>+- *(2) ColumnarToRow
>   +- FileScan parquet default.t2[id#21L] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: 
> struct
> After simplify:
> == Physical Plan ==
> Union
> :- *(1) Project [false AS (event_type = e)#10]
> :  +- *(1) ColumnarToRow
> : +- FileScan parquet default.t1[] Batched: true, DataFilters: [], 
> Format: Parquet,
> +- *(2) Project [false AS (event_type = e)#14]
>+- *(2) ColumnarToRow
>   +- FileScan parquet default.t2[] Batched: true, DataFilters: [], 
> Format: Parquet, PartitionFilters: [], PushedFilters: [], ReadSchema: struct<>
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33934) Support automatically identify the Python file and execute it

2020-12-29 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33934:
--
Parent: SPARK-31936
Issue Type: Sub-task  (was: Improvement)

> Support automatically identify the Python file and execute it
> -
>
> Key: SPARK-33934
> URL: https://issues.apache.org/jira/browse/SPARK-33934
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In Hive script transform, we can use `USING xxx.py` but in Spark we will got 
> error 
> {code:java}
> Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
> recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 
> 339): org.apache.spark.SparkException: Subprocess exited with status 127. 
> Error: /bin/bash: xxx.py: can't find the command
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33934) Support automatically identify the Python file and execute it

2020-12-29 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-33934:
--
Description: 
In Hive script transform, we can use `USING xxx.py` but in Spark we will got 
error 
{code:java}
Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): 
org.apache.spark.SparkException: Subprocess exited with status 127. Error: 
/bin/bash: xxx.py: can't find the command
{code}

  was:
In Hive script transform, we can use `USING xxx/py` but in Spark we will got 
error 
{code:java}
Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): 
org.apache.spark.SparkException: Subprocess exited with status 127. Error: 
/bin/bash: xxx.py: can't find the command
{code}


> Support automatically identify the Python file and execute it
> -
>
> Key: SPARK-33934
> URL: https://issues.apache.org/jira/browse/SPARK-33934
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> In Hive script transform, we can use `USING xxx.py` but in Spark we will got 
> error 
> {code:java}
> Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
> recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 
> 339): org.apache.spark.SparkException: Subprocess exited with status 127. 
> Error: /bin/bash: xxx.py: can't find the command
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33934) Support automatically identify the Python file and execute it

2020-12-29 Thread angerszhu (Jira)
angerszhu created SPARK-33934:
-

 Summary: Support automatically identify the Python file and 
execute it
 Key: SPARK-33934
 URL: https://issues.apache.org/jira/browse/SPARK-33934
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: angerszhu


In Hive script transform, we can use `USING xxx/py` but in Spark we will got 
error 
{code:java}
Job aborted due to stage failure: Task 17 in stage 530.0 failed 4 times, most 
recent failure: Lost task 17.3 in stage 530.0 (TID 38639, host, executor 339): 
org.apache.spark.SparkException: Subprocess exited with status 127. Error: 
/bin/bash: xxx.py: can't find the command
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32684:


Assignee: (was: Apache Spark)

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32684:


Assignee: Apache Spark

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: Apache Spark
>Priority: Minor
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32684:
-
Priority: Minor  (was: Major)

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Minor
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32684:
-
Summary: Add a test case for hive serde/default-serde mode's null value 
'\\N'  (was: Scrip transform hive serde/default-serde mode null value keep same 
with hive as '\\N')

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256015#comment-17256015
 ] 

Apache Spark commented on SPARK-33859:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30964

> Support V2 ALTER TABLE .. RENAME PARTITION
> --
>
> Key: SPARK-33859
> URL: https://issues.apache.org/jira/browse/SPARK-33859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION 
> similar to v1 implementation: 
> https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17256014#comment-17256014
 ] 

Apache Spark commented on SPARK-33859:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30964

> Support V2 ALTER TABLE .. RENAME PARTITION
> --
>
> Key: SPARK-33859
> URL: https://issues.apache.org/jira/browse/SPARK-33859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION 
> similar to v1 implementation: 
> https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32684) Add a test case for hive serde/default-serde mode's null value '\\N'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-32684:
-
Affects Version/s: (was: 3.0.0)
   3.2.0

> Add a test case for hive serde/default-serde mode's null value '\\N'
> 
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32684) Scrip transform hive serde/default-serde mode null value keep same with hive as '\\N'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-32684:
--

> Scrip transform hive serde/default-serde mode null value keep same with hive 
> as '\\N'
> -
>
> Key: SPARK-32684
> URL: https://issues.apache.org/jira/browse/SPARK-32684
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Major
>
> Hive serde default NULL value is '\N'
> {code:java}
> String nullString = tbl.getProperty(
> serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N");
> nullSequence = new Text(nullString);
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33930) Spark SQL no serde row format field delimit default is '\u0001'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-33930.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30958
[https://github.com/apache/spark/pull/30958]

> Spark SQL no serde row format field delimit default is '\u0001'
> ---
>
> Key: SPARK-33930
> URL: https://issues.apache.org/jira/browse/SPARK-33930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>
> For same sql
> {code:java}
> SELECT TRANSFORM(a, b, c, null)
> ROW FORMAT DELIMITED
> USING 'cat' 
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '&'
> FROM (select 1 as a, 2 as b, 3  as c) t
> {code}
> !image-2020-12-29-13-11-31-336.png!
>  
> !image-2020-12-29-13-11-45-734.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33930) Spark SQL no serde row format field delimit default is '\u0001'

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-33930:


Assignee: angerszhu

> Spark SQL no serde row format field delimit default is '\u0001'
> ---
>
> Key: SPARK-33930
> URL: https://issues.apache.org/jira/browse/SPARK-33930
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> For same sql
> {code:java}
> SELECT TRANSFORM(a, b, c, null)
> ROW FORMAT DELIMITED
> USING 'cat' 
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '&'
> FROM (select 1 as a, 2 as b, 3  as c) t
> {code}
> !image-2020-12-29-13-11-31-336.png!
>  
> !image-2020-12-29-13-11-45-734.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33909) Check rand functions seed is legal at analyer side

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33909.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30923
[https://github.com/apache/spark/pull/30923]

> Check rand functions seed is legal at analyer side
> --
>
> Key: SPARK-33909
> URL: https://issues.apache.org/jira/browse/SPARK-33909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0
>
>
> It's better to check seed expression is legal at analyzer side instead of 
> execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33909) Check rand functions seed is legal at analyer side

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33909:
---

Assignee: ulysses you

> Check rand functions seed is legal at analyer side
> --
>
> Key: SPARK-33909
> URL: https://issues.apache.org/jira/browse/SPARK-33909
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
>
> It's better to check seed expression is legal at analyzer side instead of 
> execution.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33859:
---

Assignee: Maxim Gekk

> Support V2 ALTER TABLE .. RENAME PARTITION
> --
>
> Key: SPARK-33859
> URL: https://issues.apache.org/jira/browse/SPARK-33859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION 
> similar to v1 implementation: 
> https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-33859) Support V2 ALTER TABLE .. RENAME PARTITION

2020-12-29 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-33859.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30935
[https://github.com/apache/spark/pull/30935]

> Support V2 ALTER TABLE .. RENAME PARTITION
> --
>
> Key: SPARK-33859
> URL: https://issues.apache.org/jira/browse/SPARK-33859
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Need to implement v2 execution node for ALTER TABLE .. RENAME PARTITION 
> similar to v1 implementation: 
> https://github.com/apache/spark/blob/40c37d69fd003ed6079ee8c139dba5c15915c568/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L513



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33871) Cannot access to column after left semi join and left join

2020-12-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255978#comment-17255978
 ] 

Hyukjin Kwon commented on SPARK-33871:
--

+1 for [~viirya]'s advice here.

> Cannot access to column after left semi join  and left join
> ---
>
> Key: SPARK-33871
> URL: https://issues.apache.org/jira/browse/SPARK-33871
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Evgenii Samusenko
>Priority: Minor
>
> Cannot access to column after left semi join and left join
> {code}
> val col = "c1"
> val df = Seq((1, "a"),(2, "a"),(3, "a"),(4, "a")).toDF(col, "c2")
> val df2 = Seq(1).toDF(col)
> val semiJoin = df.join(df2, df(col) === df2(col), "left_semi")
> val left = df.join(semiJoin, df(col) === semiJoin(col), "left")
> left.show
> +---+---+++
> | c1| c2|  c1|  c2|
> +---+---+++
> |  1|  a|   1|   a|
> |  2|  a|null|null|
> |  3|  a|null|null|
> |  4|  a|null|null|
> +---+---+++
> left.select(semiJoin(col))
> +---+
> | c1|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+
> left.select(df(col))
> +---+
> | c1|
> +---+
> |  1|
> |  2|
> |  3|
> |  4|
> +---+
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33927) Fix Spark Release image

2020-12-29 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255977#comment-17255977
 ] 

Hyukjin Kwon commented on SPARK-33927:
--

Thanks for letting me know [~dongjoon]. I will likely have to take a look for 
this one this week :-).

> Fix Spark Release image
> ---
>
> Key: SPARK-33927
> URL: https://issues.apache.org/jira/browse/SPARK-33927
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>
> The release script seems to be broken. This is a blocker for Apache Spark 
> 3.1.0 release.
> {code}
> $ cd dev/create-release/spark-rm
> $ docker build -t spark-rm .
> ...
> exit code: 1
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33929:
-
Affects Version/s: (was: 2.3.4)
   (was: 2.3.3)
   (was: 2.3.2)
   (was: 2.3.1)
   (was: 2.3.0)
   2.4.7

> Spark-submit with --package deequ doesn't pull all jars
> ---
>
> Key: SPARK-33929
> URL: https://issues.apache.org/jira/browse/SPARK-33929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.4.7
>Reporter: Dustin Smith
>Priority: Major
>
> This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] 
> pointed out in the comments that version 2.4x was experiencing this same 
> problem when using Amazon Deequ.
> This problem exist in 2.3.x ecosystem as well for Deequ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33929) Spark-submit with --package deequ doesn't pull all jars

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-33929:
-
Target Version/s:   (was: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4)

> Spark-submit with --package deequ doesn't pull all jars
> ---
>
> Key: SPARK-33929
> URL: https://issues.apache.org/jira/browse/SPARK-33929
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Submit
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4
>Reporter: Dustin Smith
>Priority: Major
>
> This issue was marked as solved SPARK-24074; however, another [~hyukjin.kwon] 
> pointed out in the comments that version 2.4x was experiencing this same 
> problem when using Amazon Deequ.
> This problem exist in 2.3.x ecosystem as well for Deequ.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32968) Column pruning for CsvToStructs

2020-12-29 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32968?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32968.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 30912
[https://github.com/apache/spark/pull/30912]

> Column pruning for CsvToStructs
> ---
>
> Key: SPARK-32968
> URL: https://issues.apache.org/jira/browse/SPARK-32968
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.2.0
>
>
> We could do column pruning for CsvToStructs expression if we only require 
> some fields from it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33926) Improve the error message in resolving of DSv1 multi-part identifiers

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255953#comment-17255953
 ] 

Apache Spark commented on SPARK-33926:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/30963

> Improve the error message in resolving of DSv1 multi-part identifiers
> -
>
> Key: SPARK-33926
> URL: https://issues.apache.org/jira/browse/SPARK-33926
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> This is a follow up of 
> https://github.com/apache/spark/pull/30915#discussion_r549240857



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255904#comment-17255904
 ] 

Apache Spark commented on SPARK-33933:
--

User 'zhongyu09' has created a pull request for this issue:
https://github.com/apache/spark/pull/30962

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Apache Spark
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255903#comment-17255903
 ] 

Apache Spark commented on SPARK-33933:
--

User 'zhongyu09' has created a pull request for this issue:
https://github.com/apache/spark/pull/30962

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33933:


Assignee: (was: Apache Spark)

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33933:


Assignee: Apache Spark

> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1
>Reporter: Yu Zhong
>Assignee: Apache Spark
>Priority: Major
>
> In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
> queries as below.
>  
> {code:java}
> Could not execute broadcast in 300 secs. You can increase the timeout for 
> broadcasts via spark.sql.broadcastTimeout or disable broadcast join by 
> setting spark.sql.autoBroadcastJoinThreshold to -1
> {code}
>  
> This is usually happens when broadcast join(with or without hint) after a 
> long running shuffle (more than 5 minutes).  By disable AQE, the issues 
> disappear.
> The workaround is to increase spark.sql.broadcastTimeout and it works. But 
> because the data to broadcast is very small, that doesn't make sense.
> After investigation, the root cause should be like this: when enable AQE, in 
> getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
> query stage for materialized part by createQueryStages and materialize those 
> new created query stages to submit map stages or broadcasting. When 
> ShuffleQueryStage are materializing before BroadcastQueryStage, the map job 
> and broadcast job are submitted almost at the same time, but map job will 
> hold all the computing resources. If the map job runs slow (when lots of data 
> needs to process and the resource is limited), the broadcast job cannot be 
> started(and finished) before spark.sql.broadcastTimeout, thus cause whole job 
> failed (introduced in SPARK-31475).
> Code to reproduce:
>  
> {code:java}
> import java.util.UUID
> import scala.util.Random
> import org.apache.spark.sql.functions._
> import org.apache.spark.sql.SparkSession
> val spark = SparkSession.builder()
>   .master("local[2]")
>   .appName("Test Broadcast").getOrCreate()
> import spark.implicits._
> spark.conf.set("spark.sql.adaptive.enabled", "true")
> val sc = spark.sparkContext
> sc.setLogLevel("INFO")
> val uuid = UUID.randomUUID
> val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
>   for (i <- Range(0, 1 + Random.nextInt(1)))
> yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
> }).toDF("index", "part", "pv", "uuid")
>   .withColumn("md5", md5($"uuid"))
> val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
> val dim = dim_data.toDF("name", "index")
> val result = df.groupBy("index")
>   .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
>   .join(dim, Seq("index"))
>   .collect(){code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31936) Implement ScriptTransform in sql/core

2020-12-29 Thread angerszhu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu updated SPARK-31936:
--
Affects Version/s: 3.2.0
   3.1.0

> Implement ScriptTransform in sql/core
> -
>
> Key: SPARK-31936
> URL: https://issues.apache.org/jira/browse/SPARK-31936
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0, 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>
> ScriptTransformation currently relies on Hive internals. It'd be great if we 
> can implement a native ScriptTransformation in sql/core module to remove the 
> extra Hive dependency here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Yu Zhong (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33933?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yu Zhong updated SPARK-33933:
-
Description: 
In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
queries as below.

 
{code:java}
Could not execute broadcast in 300 secs. You can increase the timeout for 
broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting 
spark.sql.autoBroadcastJoinThreshold to -1
{code}
 

This is usually happens when broadcast join(with or without hint) after a long 
running shuffle (more than 5 minutes).  By disable AQE, the issues disappear.

The workaround is to increase spark.sql.broadcastTimeout and it works. But 
because the data to broadcast is very small, that doesn't make sense.

After investigation, the root cause should be like this: when enable AQE, in 
getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
query stage for materialized part by createQueryStages and materialize those 
new created query stages to submit map stages or broadcasting. When 
ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and 
broadcast job are submitted almost at the same time, but map job will hold all 
the computing resources. If the map job runs slow (when lots of data needs to 
process and the resource is limited), the broadcast job cannot be started(and 
finished) before spark.sql.broadcastTimeout, thus cause whole job failed 
(introduced in SPARK-31475).

Code to reproduce:

 
{code:java}
import java.util.UUID
import scala.util.Random
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .master("local[2]")
  .appName("Test Broadcast").getOrCreate()
import spark.implicits._

spark.conf.set("spark.sql.adaptive.enabled", "true")

val sc = spark.sparkContext
sc.setLogLevel("INFO")
val uuid = UUID.randomUUID

val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
  for (i <- Range(0, 1 + Random.nextInt(1)))
yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
}).toDF("index", "part", "pv", "uuid")
  .withColumn("md5", md5($"uuid"))

val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
val dim = dim_data.toDF("name", "index")

val result = df.groupBy("index")
  .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
  .join(dim, Seq("index"))
  .collect(){code}
 

 

  was:
In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
queries as below.

 
{code:java}
Could not execute broadcast in 300 secs. You can increase the timeout for 
broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting 
spark.sql.autoBroadcastJoinThreshold to -1
{code}
 

This is usually happens when broadcast join(with or without hint) after a long 
running shuffle (more than 5 minutes).  By disable AQE, the issues disappear.

The workaround is to increase spark.sql.broadcastTimeout and it works. But 
because the data to broadcast is very small, that doesn't make sense.

After investigation, the root cause should be like this: when enable AQE, in 
getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
query stage for materialized part by createQueryStages and materialize those 
new created query stages to submit map stages or broadcasting. When 
ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and 
broadcast job are submitted almost at the same time, but map job will hold all 
the computing resources. If the map job runs slow (when lots of data needs to 
process and the resource is limited), the broadcast job cannot be started(and 
finished) before spark.sql.broadcastTimeout, thus cause whole job failed 
(introduced in SPARK-31475).

Code to reproduce:

 
{code:java}
import java.util.UUID
import scala.util.Random
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .master("local[2]")
  .appName("Test Broadcast").getOrCreate()
import spark.implicits._

val sc = spark.sparkContext
sc.setLogLevel("INFO")
val uuid = UUID.randomUUID

val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
  for (i <- Range(0, 1 + Random.nextInt(1)))
yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
}).toDF("index", "part", "pv", "uuid")
  .withColumn("md5", md5($"uuid"))

val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
val dim = dim_data.toDF("name", "index")

val result = df.groupBy("index")
  .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
  .join(dim, Seq("index"))
  .collect(){code}
 

 


> Broadcast timeout happened unexpectedly in AQE 
> ---
>
> Key: SPARK-33933
> URL: https://issues.apache.org/jira/browse/SPARK-33933
> Project: Spark
>   

[jira] [Created] (SPARK-33933) Broadcast timeout happened unexpectedly in AQE

2020-12-29 Thread Yu Zhong (Jira)
Yu Zhong created SPARK-33933:


 Summary: Broadcast timeout happened unexpectedly in AQE 
 Key: SPARK-33933
 URL: https://issues.apache.org/jira/browse/SPARK-33933
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.0.0
Reporter: Yu Zhong


In Spark 3.0, when AQE is enabled, there is often broadcast timeout in normal 
queries as below.

 
{code:java}
Could not execute broadcast in 300 secs. You can increase the timeout for 
broadcasts via spark.sql.broadcastTimeout or disable broadcast join by setting 
spark.sql.autoBroadcastJoinThreshold to -1
{code}
 

This is usually happens when broadcast join(with or without hint) after a long 
running shuffle (more than 5 minutes).  By disable AQE, the issues disappear.

The workaround is to increase spark.sql.broadcastTimeout and it works. But 
because the data to broadcast is very small, that doesn't make sense.

After investigation, the root cause should be like this: when enable AQE, in 
getFinalPhysicalPlan, spark traversal the physical plan bottom up and create 
query stage for materialized part by createQueryStages and materialize those 
new created query stages to submit map stages or broadcasting. When 
ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and 
broadcast job are submitted almost at the same time, but map job will hold all 
the computing resources. If the map job runs slow (when lots of data needs to 
process and the resource is limited), the broadcast job cannot be started(and 
finished) before spark.sql.broadcastTimeout, thus cause whole job failed 
(introduced in SPARK-31475).

Code to reproduce:

 
{code:java}
import java.util.UUID
import scala.util.Random
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .master("local[2]")
  .appName("Test Broadcast").getOrCreate()
import spark.implicits._

val sc = spark.sparkContext
sc.setLogLevel("INFO")
val uuid = UUID.randomUUID

val df = sc.parallelize(Range(0, 1), 1).flatMap(x => {
  for (i <- Range(0, 1 + Random.nextInt(1)))
yield (x % 26, x, Random.nextInt(10), UUID.randomUUID.toString)
}).toDF("index", "part", "pv", "uuid")
  .withColumn("md5", md5($"uuid"))

val dim_data = Range(0, 26).map(x => (('a' + x).toChar.toString, x))
val dim = dim_data.toDF("name", "index")

val result = df.groupBy("index")
  .agg(sum($"pv").alias("pv"), countDistinct("uuid").alias("uv"))
  .join(dim, Seq("index"))
  .collect(){code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33932:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33932:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33932:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33932?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17255862#comment-17255862
 ] 

Apache Spark commented on SPARK-33932:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30961

> Clean up KafkaOffsetReader API document
> ---
>
> Key: SPARK-33932
> URL: https://issues.apache.org/jira/browse/SPARK-33932
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> KafkaOffsetReader API documents are duplicated among 
> KafkaOffsetReaderConsumer and KafkaOffsetReaderAdmin. It seems to be good if 
> the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33932) Clean up KafkaOffsetReader API document

2020-12-29 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-33932:
---

 Summary: Clean up KafkaOffsetReader API document
 Key: SPARK-33932
 URL: https://issues.apache.org/jira/browse/SPARK-33932
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


KafkaOffsetReader API documents are duplicated among KafkaOffsetReaderConsumer 
and KafkaOffsetReaderAdmin. It seems to be good if the doc is centralized. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org