date:20201108

[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33391:

Description: 
var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 1): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 2): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 3): integer (nullable = true)

 

In this case, the nullable property in element_at with CreateArray statement is 
not correct.

 

  was:TODO


> element_at with CreateArray not respect one based index
> ---
>
> Key: SPARK-33391
> URL: https://issues.apache.org/jira/browse/SPARK-33391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> var df = spark.sql("select element_at(array(3, 2, 1), 0)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 1)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 2)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 3)")
> df.printSchema()
> root
>  |-- element_at(array(3, 2, 1), 0): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 1): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 2): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 3): integer (nullable = true)
>  
> In this case, the nullable property in element_at with CreateArray statement 
> is not correct.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33391:

Summary: element_at with CreateArray not respect one based index  (was: 
element_at not respect one based index)

> element_at with CreateArray not respect one based index
> ---
>
> Key: SPARK-33391
> URL: https://issues.apache.org/jira/browse/SPARK-33391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33391) element_at not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-33391:
---

 Summary: element_at not respect one based index
 Key: SPARK-33391
 URL: https://issues.apache.org/jira/browse/SPARK-33391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33390) Make Literal support char array

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33390:


Assignee: Apache Spark

> Make Literal support char array
> ---
>
> Key: SPARK-33390
> URL: https://issues.apache.org/jira/browse/SPARK-33390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Assignee: Apache Spark
>Priority: Minor
>
> Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33390) Make Literal support char array

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228385#comment-17228385
 ] 

Apache Spark commented on SPARK-33390:
--

User 'ulysses-you' has created a pull request for this issue:
https://github.com/apache/spark/pull/30295

> Make Literal support char array
> ---
>
> Key: SPARK-33390
> URL: https://issues.apache.org/jira/browse/SPARK-33390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33390) Make Literal support char array

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33390:


Assignee: (was: Apache Spark)

> Make Literal support char array
> ---
>
> Key: SPARK-33390
> URL: https://issues.apache.org/jira/browse/SPARK-33390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>
> Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33390) Make Literal support char array

2020-11-08 Thread ulysses you (Jira)

ulysses you created SPARK-33390:
---

 Summary: Make Literal support char array
 Key: SPARK-33390
 URL: https://issues.apache.org/jira/browse/SPARK-33390
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: ulysses you


Make Literal support char array.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-11-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32405:
---

Assignee: Huaxin Gao

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Huaxin Gao
>Priority: Major
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32405) Apply table options while creating tables in JDBC Table Catalog

2020-11-08 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32405.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30154
[https://github.com/apache/spark/pull/30154]

> Apply table options while creating tables in JDBC Table Catalog
> ---
>
> Key: SPARK-32405
> URL: https://issues.apache.org/jira/browse/SPARK-32405
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> We need to add an API to `JdbcDialect` to generate the SQL statement to 
> specify table options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33389) make internal classes of SparkSession always using active SQLConf

2020-11-08 Thread Lu Lu (Jira)

Lu Lu created SPARK-33389:
-

 Summary: make internal classes of SparkSession always using active 
SQLConf
 Key: SPARK-33389
 URL: https://issues.apache.org/jira/browse/SPARK-33389
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Lu Lu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33387.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30293
[https://github.com/apache/spark/pull/30293]

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>
> Since the current shuffle block migration works in a random order, the 
> failure during worker decommission affects all of the shuffles. This issue 
> aims to support ordered migration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-33387:
-

Assignee: Dongjoon Hyun

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> Since the current shuffle block migration works in a random order, the 
> failure during worker decommission affects all of the shuffles. This issue 
> aims to support ordered migration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33140) make all sub-class of Rule[QueryPlan] using SQLConf.get

2020-11-08 Thread Lu Lu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Lu updated SPARK-33140:
--
Summary: make all sub-class of Rule[QueryPlan] using SQLConf.get  (was: 
make Analyzer rules using SQLConf.get)

> make all sub-class of Rule[QueryPlan] using SQLConf.get
> ---
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33140) make Analyzer rules using SQLConf.get

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228356#comment-17228356
 ] 

Apache Spark commented on SPARK-33140:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/30294

> make Analyzer rules using SQLConf.get
> -
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33140) make Analyzer rules using SQLConf.get

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228357#comment-17228357
 ] 

Apache Spark commented on SPARK-33140:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/30294

> make Analyzer rules using SQLConf.get
> -
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33371) Support Python 3.9+ in PySpark

2020-11-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33371:
--
Fix Version/s: 3.0.2

> Support Python 3.9+ in PySpark
> --
>
> Key: SPARK-33371
> URL: https://issues.apache.org/jira/browse/SPARK-33371
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.2, 3.1.0
>
>
> Python 3.9 works with PySpark. we should fix setup.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-33387:
--
Description: Since the current shuffle block migration works in a random 
order, the failure during worker decommission affects all of the shuffles. This 
issue aims to support ordered migration.  (was: Since the current shuffle block 
migration works in a random order like the following, the failure during worker 
decommission affects all of the shuffles. This issue aims to support ordered 
migration.

shuffle_16_1900_0.index
shuffle_19_2123_0.index
shuffle_25_3792_0.index
shuffle_25_3792_0.data
shuffle_19_2123_0.data
shuffle_16_1900_0.data
shuffle_16_2015_0.index
shuffle_16_2015_0.data
shuffle_12_3264_0.index
shuffle_14_4329_0.index
shuffle_20_2463_0.index
shuffle_20_2463_0.data
shuffle_14_4329_0.data)

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since the current shuffle block migration works in a random order, the 
> failure during worker decommission affects all of the shuffles. This issue 
> aims to support ordered migration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33388) Merge In and InSet predicate

2020-11-08 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33388:
---

 Summary: Merge In and InSet predicate
 Key: SPARK-33388
 URL: https://issues.apache.org/jira/browse/SPARK-33388
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


Maybe we should create a base class for {{In}} and {{InSet}}, so that these 2 
classes are only different in the expression tree, eval and codegen are the 
same.

[https://github.com/apache/spark/pull/28269#issuecomment-655365714]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33387:


Assignee: Apache Spark

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Since the current shuffle block migration works in a random order like the 
> following, the failure during worker decommission affects all of the 
> shuffles. This issue aims to support ordered migration.
> shuffle_16_1900_0.index
> shuffle_19_2123_0.index
> shuffle_25_3792_0.index
> shuffle_25_3792_0.data
> shuffle_19_2123_0.data
> shuffle_16_1900_0.data
> shuffle_16_2015_0.index
> shuffle_16_2015_0.data
> shuffle_12_3264_0.index
> shuffle_14_4329_0.index
> shuffle_20_2463_0.index
> shuffle_20_2463_0.data
> shuffle_14_4329_0.data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33387:


Assignee: (was: Apache Spark)

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since the current shuffle block migration works in a random order like the 
> following, the failure during worker decommission affects all of the 
> shuffles. This issue aims to support ordered migration.
> shuffle_16_1900_0.index
> shuffle_19_2123_0.index
> shuffle_25_3792_0.index
> shuffle_25_3792_0.data
> shuffle_19_2123_0.data
> shuffle_16_1900_0.data
> shuffle_16_2015_0.index
> shuffle_16_2015_0.data
> shuffle_12_3264_0.index
> shuffle_14_4329_0.index
> shuffle_20_2463_0.index
> shuffle_20_2463_0.data
> shuffle_14_4329_0.data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228341#comment-17228341
 ] 

Apache Spark commented on SPARK-33387:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/30293

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since the current shuffle block migration works in a random order like the 
> following, the failure during worker decommission affects all of the 
> shuffles. This issue aims to support ordered migration.
> shuffle_16_1900_0.index
> shuffle_19_2123_0.index
> shuffle_25_3792_0.index
> shuffle_25_3792_0.data
> shuffle_19_2123_0.data
> shuffle_16_1900_0.data
> shuffle_16_2015_0.index
> shuffle_16_2015_0.data
> shuffle_12_3264_0.index
> shuffle_14_4329_0.index
> shuffle_20_2463_0.index
> shuffle_20_2463_0.data
> shuffle_14_4329_0.data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33387:


Assignee: (was: Apache Spark)

> Support ordered shuffle block migration
> ---
>
> Key: SPARK-33387
> URL: https://issues.apache.org/jira/browse/SPARK-33387
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Since the current shuffle block migration works in a random order like the 
> following, the failure during worker decommission affects all of the 
> shuffles. This issue aims to support ordered migration.
> shuffle_16_1900_0.index
> shuffle_19_2123_0.index
> shuffle_25_3792_0.index
> shuffle_25_3792_0.data
> shuffle_19_2123_0.data
> shuffle_16_1900_0.data
> shuffle_16_2015_0.index
> shuffle_16_2015_0.data
> shuffle_12_3264_0.index
> shuffle_14_4329_0.index
> shuffle_20_2463_0.index
> shuffle_20_2463_0.data
> shuffle_14_4329_0.data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33387) Support ordered shuffle block migration

2020-11-08 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-33387:
-

 Summary: Support ordered shuffle block migration
 Key: SPARK-33387
 URL: https://issues.apache.org/jira/browse/SPARK-33387
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun


Since the current shuffle block migration works in a random order like the 
following, the failure during worker decommission affects all of the shuffles. 
This issue aims to support ordered migration.

shuffle_16_1900_0.index
shuffle_19_2123_0.index
shuffle_25_3792_0.index
shuffle_25_3792_0.data
shuffle_19_2123_0.data
shuffle_16_1900_0.data
shuffle_16_2015_0.index
shuffle_16_2015_0.data
shuffle_12_3264_0.index
shuffle_14_4329_0.index
shuffle_20_2463_0.index
shuffle_20_2463_0.data
shuffle_14_4329_0.data



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-11-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-33244:

Description: 
* The code paths of `spark.table` and `spark.read.table` should be the same. 
This behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.
 * Add comments for `{{spark.table`}} to emphasize it also support streaming 
temp view reading

  was:
* The code paths of `spark.table` and `spark.read.table` should be the same. 
This behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.
 * Add comment for `{{spark.table`}} to emphasize it also support streaming 
temp view reading


> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> * The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.
>  * Add comments for `{{spark.table`}} to emphasize it also support streaming 
> temp view reading



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-11-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-33244:

Description: 
* The code paths of `spark.table` and `spark.read.table` should be the same. 
This behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.
 * Add comment for `{{spark.table`}} to emphasize it also support streaming 
temp view reading

  was:The code paths of `spark.table` and `spark.read.table` should be the 
same. This behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.


> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> * The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.
>  * Add comment for `{{spark.table`}} to emphasize it also support streaming 
> temp view reading



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-11-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-33244:

Description: The code paths of `spark.table` and `spark.read.table` should 
be the same. This behavior is broke in SPARK-32592 since we need to respect 
options in `spark.read.table` API.  (was: * Block reading streaming temp view 
via `spark.table` API 
 * The code paths of `spark.table` and `spark.read.table` should be the same. 
This behavior is broke in SPARK-32592 since we need to respect options in 
`spark.read.table` API.)

> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33244) Unify the code paths for spark.table and spark.read.table

2020-11-08 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-33244:

Summary: Unify the code paths for spark.table and spark.read.table  (was: 
Block reading streaming temp view via `spark.table` API)

> Unify the code paths for spark.table and spark.read.table
> -
>
> Key: SPARK-33244
> URL: https://issues.apache.org/jira/browse/SPARK-33244
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> * Block reading streaming temp view via `spark.table` API 
>  * The code paths of `spark.table` and `spark.read.table` should be the same. 
> This behavior is broke in SPARK-32592 since we need to respect options in 
> `spark.read.table` API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33166) Provide Search Function in Spark docs site

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228335#comment-17228335
 ] 

Apache Spark commented on SPARK-33166:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/30292

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33166) Provide Search Function in Spark docs site

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33166:


Assignee: Apache Spark

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Assignee: Apache Spark
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33166) Provide Search Function in Spark docs site

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33166:


Assignee: (was: Apache Spark)

> Provide Search Function in Spark docs site
> --
>
> Key: SPARK-33166
> URL: https://issues.apache.org/jira/browse/SPARK-33166
> Project: Spark
>  Issue Type: New Feature
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Xiao Li
>Priority: Major
>
> In the last few releases, our Spark documentation  
> https://spark.apache.org/docs/latest/ becomes richer. It would nice to 
> provide a search function to make our users find contents faster. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33386) Accessing array elements should failed if index is out of bound.

2020-11-08 Thread Leanken.Lin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33386:

Description: When ansi mode enabled, accessing array element should failed 
with exception, but currently it's returning null.  (was: TODO)

> Accessing array elements should failed if index is out of bound.
> 
>
> Key: SPARK-33386
> URL: https://issues.apache.org/jira/browse/SPARK-33386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing array element should failed with exception, 
> but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33386) Accessing array elements should failed if index is out of bound.

2020-11-08 Thread Leanken.Lin (Jira)

Leanken.Lin created SPARK-33386:
---

 Summary: Accessing array elements should failed if index is out of 
bound.
 Key: SPARK-33386
 URL: https://issues.apache.org/jira/browse/SPARK-33386
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-33384.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30290
[https://github.com/apache/spark/pull/30290]

> Delete temporary file when cancelling writing to final path even underlying 
> stream throwing error
> -
>
> Key: SPARK-33384
> URL: https://issues.apache.org/jira/browse/SPARK-33384
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 3.1.0
>
>
> In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
> underlying stream and delete temporary file, in a single try/catch block. 
> Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
> missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33385) Bucket pruning support IsNaN

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33385:


Assignee: (was: Apache Spark)

> Bucket pruning support IsNaN
> 
>
> Key: SPARK-33385
> URL: https://issues.apache.org/jira/browse/SPARK-33385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{IsNaN}} can also support bucket pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33385) Bucket pruning support IsNaN

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33385:


Assignee: Apache Spark

> Bucket pruning support IsNaN
> 
>
> Key: SPARK-33385
> URL: https://issues.apache.org/jira/browse/SPARK-33385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> {{IsNaN}} can also support bucket pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33385) Bucket pruning support IsNaN

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228324#comment-17228324
 ] 

Apache Spark commented on SPARK-33385:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30291

> Bucket pruning support IsNaN
> 
>
> Key: SPARK-33385
> URL: https://issues.apache.org/jira/browse/SPARK-33385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{IsNaN}} can also support bucket pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33385) Bucket pruning support IsNaN

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228323#comment-17228323
 ] 

Apache Spark commented on SPARK-33385:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/30291

> Bucket pruning support IsNaN
> 
>
> Key: SPARK-33385
> URL: https://issues.apache.org/jira/browse/SPARK-33385
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Priority: Major
>
> {{IsNaN}} can also support bucket pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33385) Bucket pruning support IsNaN

2020-11-08 Thread Yuming Wang (Jira)

Yuming Wang created SPARK-33385:
---

 Summary: Bucket pruning support IsNaN
 Key: SPARK-33385
 URL: https://issues.apache.org/jira/browse/SPARK-33385
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Yuming Wang


{{IsNaN}} can also support bucket pruning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33352) Fix procedure-like declaration compilation warning in Scala 2.13

2020-11-08 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-33352:


Assignee: Yang Jie

> Fix procedure-like declaration compilation warning in Scala 2.13
> 
>
> Key: SPARK-33352
> URL: https://issues.apache.org/jira/browse/SPARK-33352
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Similar to spark-29291, just to track Spark 3.1.0.
> There are two similar compilation warnings about procedure-like declaration 
> in Scala 2.13.3:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: 
> procedure syntax is deprecated for constructors: add `=`, as in method 
> definition
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211:
>  procedure syntax is deprecated: instead, add `: Unit =` to explicitly 
> declare `run`'s return type
> {code}
>  
> For constructors method definition should be `this(...) = \{ }` not 
> `this(...) \{ }`, for without 
> `return type` methods definition should be `def methodName(...): Unit = {}` 
> not `def methodName(...) {}`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-33352) Fix procedure-like declaration compilation warning in Scala 2.13

2020-11-08 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-33352.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 30255
[https://github.com/apache/spark/pull/30255]

> Fix procedure-like declaration compilation warning in Scala 2.13
> 
>
> Key: SPARK-33352
> URL: https://issues.apache.org/jira/browse/SPARK-33352
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.1.0
>
>
> Similar to spark-29291, just to track Spark 3.1.0.
> There are two similar compilation warnings about procedure-like declaration 
> in Scala 2.13.3:
>  
> {code:java}
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: 
> procedure syntax is deprecated for constructors: add `=`, as in method 
> definition
> [WARNING] [Warn] 
> /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211:
>  procedure syntax is deprecated: instead, add `: Unit =` to explicitly 
> declare `run`'s return type
> {code}
>  
> For constructors method definition should be `this(...) = \{ }` not 
> `this(...) \{ }`, for without 
> `return type` methods definition should be `def methodName(...): Unit = {}` 
> not `def methodName(...) {}`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228270#comment-17228270
 ] 

Apache Spark commented on SPARK-33384:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30290

> Delete temporary file when cancelling writing to final path even underlying 
> stream throwing error
> -
>
> Key: SPARK-33384
> URL: https://issues.apache.org/jira/browse/SPARK-33384
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
> underlying stream and delete temporary file, in a single try/catch block. 
> Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
> missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33384:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Delete temporary file when cancelling writing to final path even underlying 
> stream throwing error
> -
>
> Key: SPARK-33384
> URL: https://issues.apache.org/jira/browse/SPARK-33384
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
> underlying stream and delete temporary file, in a single try/catch block. 
> Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
> missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228269#comment-17228269
 ] 

Apache Spark commented on SPARK-33384:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/30290

> Delete temporary file when cancelling writing to final path even underlying 
> stream throwing error
> -
>
> Key: SPARK-33384
> URL: https://issues.apache.org/jira/browse/SPARK-33384
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
> underlying stream and delete temporary file, in a single try/catch block. 
> Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
> missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33384:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Delete temporary file when cancelling writing to final path even underlying 
> stream throwing error
> -
>
> Key: SPARK-33384
> URL: https://issues.apache.org/jira/browse/SPARK-33384
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Minor
>
> In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
> underlying stream and delete temporary file, in a single try/catch block. 
> Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
> missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-33384) Delete temporary file when cancelling writing to final path even underlying stream throwing error

2020-11-08 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-33384:
---

 Summary: Delete temporary file when cancelling writing to final 
path even underlying stream throwing error
 Key: SPARK-33384
 URL: https://issues.apache.org/jira/browse/SPARK-33384
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


In {{RenameBasedFSDataOutputStream.cancel}}, we do two things: closing 
underlying stream and delete temporary file, in a single try/catch block. 
Closing {{OutputStream}} could possibly throw {{IOException}} so we possibly 
missing deleting temporary file.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33141:


Assignee: (was: Apache Spark)

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228069#comment-17228069
 ] 

Apache Spark commented on SPARK-33141:
--

User 'luluorta' has created a pull request for this issue:
https://github.com/apache/spark/pull/30289

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-08 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33141:


Assignee: Apache Spark

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Apache Spark
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33141) capture SQL configs when creating permanent views

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228030#comment-17228030
 ] 

Apache Spark commented on SPARK-33141:
--

User 'luluorta' has created a pull request for this issue:
https://github.com/apache/spark/pull/30289

> capture SQL configs when creating permanent views
> -
>
> Key: SPARK-33141
> URL: https://issues.apache.org/jira/browse/SPARK-33141
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-33383) Improve performance of Column.isin Expression

2020-11-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-33383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Wollschläger updated SPARK-33383:
---
Description: 
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the _where_-Condition _Column.isin_.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
_Column.isin_ Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
{code}
df.filter(row => allowedValues.contains(row.getInt(0)))
{code}


however, when running a few tests locally, I realized that using _Column.isin_ 
is actually about 10 times slower than a _rdd.filter_ or a broadcast-inner-join.

Shouldn't {code}df.where(col("colname").isin(allowedValues)){code} perform 
(SQL-API overhead aside) as good as {code}df.filter(row => 
allowedValues.contains(row.getInt(0))){code} ?

I used the following dummy code for my local tests:

{code:scala}
package example

import org.apache.spark.sql.functions.{broadcast, col, count}
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.Random

object Test {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

import spark.implicits._

val _10Million = 1000
val random = new Random(1048394789305L)

val values = Seq.fill(_10Million)(random.nextInt())
val df = values.toDF("value")
val allowedValues = getRandomElements(values, random, 1)

println("Starting ...")
runWithInCollection(spark, df, allowedValues)
println(" In Collection")
runWithBroadcastDF(spark, df, allowedValues)
println(" Broadcast DF")
runWithBroadcastVariable(spark, df, allowedValues)
println(" Broadcast Variable")
}

def getRandomElements[A](seq: Seq[A], random: Random, size: Int): Set[A] = {
val builder = Set.newBuilder[A]

for (i <- 0 until size) {
builder += getRandomElement(seq, random)
}

builder.result()
}

def getRandomElement[A](seq: Seq[A], random: Random): A = {
seq(random.nextInt(seq.length))
}

// I expected this one to be almost equivalent to the one with a 
broadcast-variable, but it's actually about 10 times slower
def runWithInCollection(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
spark.time {

df.where(col("value").isInCollection(allowedValues)).runTestAggregation()
}
}

// A bit slower than the one with a broadcast variable
def runWithBroadcastDF(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
import spark.implicits._

val allowedValuesDF = allowedValues.toSeq.toDF("allowedValue")

spark.time {
df.join(broadcast(allowedValuesDF), col("value") === 
col("allowedValue")).runTestAggregation()
}
}

// This is actually the fastest one
def runWithBroadcastVariable(spark: SparkSession, df: DataFrame, 
allowedValues: Set[Int]): Unit = {
val allowedValuesBroadcast = spark.sparkContext.broadcast(allowedValues)

spark.time {
df.filter(row => 
allowedValuesBroadcast.value.contains(row.getInt(0))).runTestAggregation()
}
}

implicit class TestRunner(val df: DataFrame) {

def runTestAggregation(): Unit = {
df.agg(count("value")).show()
}
}
}
{code}

  was:
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the _where_-Condition _Column.isin_.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
_Column.isin_ Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
{code}
df.filter(row => allowedValues.contains(row.getInt(0)))
{code}


however, when running a few tests locally, I realized that using _Column.isin_ 
is actually about 10 times slower than a _rdd.filter_ or a broadcast-inner-join.

Shouldn't {code}df.where(col("colname").isin(allowedValues)){code} perform 
(SQL-API overhead aside) as good as {code}df.filter(row => 
allowedValues.contains(row.getInt(0))){code} ?

I used the following dummy code for my local tests:

{code:scala}
package example

[jira] [Updated] (SPARK-33383) Improve performance of Column.isin Expression

2020-11-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-33383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Wollschläger updated SPARK-33383:
---
Description: 
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the _where_-Condition _Column.isin_.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
_Column.isin_ Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
{code}
df.filter(row => allowedValues.contains(row.getInt(0)))
{code}


however, when running a few tests locally, I realized that using _Column.isin_ 
is actually about 10 times slower than a _rdd.filter_ or a broadcast-inner-join.

Shouldn't {code}df.where(col("colname").isin(allowedValues)){code} perform 
(SQL-API overhead aside) perform as good as {code}df.filter(row => 
allowedValues.contains(row.getInt(0))){code} ?

I used the following dummy code for my local tests:

{code:scala}
package example

import org.apache.spark.sql.functions.{broadcast, col, count}
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.Random

object Test {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

import spark.implicits._

val _10Million = 1000
val random = new Random(1048394789305L)

val values = Seq.fill(_10Million)(random.nextInt())
val df = Seq.fill(_10Million)(random.nextInt()).toDF("value")
val allowedValues = getRandomElements(values, random, 1)

println("Starting ...")
runWithInCollection(spark, df, allowedValues)
println(" In Collection")
runWithBroadcastDF(spark, df, allowedValues)
println(" Broadcast DF")
runWithBroadcastVariable(spark, df, allowedValues)
println(" Broadcast Variable")
}

def getRandomElements[A](seq: Seq[A], random: Random, size: Int): Set[A] = {
val builder = Set.newBuilder[A]

for (i <- 0 until size) {
builder += getRandomElement(seq, random)
}

builder.result()
}

def getRandomElement[A](seq: Seq[A], random: Random): A = {
seq(random.nextInt(seq.length))
}

// I expected this one to be almost equivalent to the one with a 
broadcast-variable, but it's actually about 10 times slower
def runWithInCollection(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
spark.time {

df.where(col("value").isInCollection(allowedValues)).runTestAggregation()
}
}

// A bit slower than the one with a broadcast variable
def runWithBroadcastDF(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
import spark.implicits._

val allowedValuesDF = allowedValues.toSeq.toDF("allowedValue")

spark.time {
df.join(broadcast(allowedValuesDF), col("value") === 
col("allowedValue")).runTestAggregation()
}
}

// This is actually the fastest one
def runWithBroadcastVariable(spark: SparkSession, df: DataFrame, 
allowedValues: Set[Int]): Unit = {
val allowedValuesBroadcast = spark.sparkContext.broadcast(allowedValues)

spark.time {
df.filter(row => 
allowedValuesBroadcast.value.contains(row.getInt(0))).runTestAggregation()
}
}

implicit class TestRunner(val df: DataFrame) {

def runTestAggregation(): Unit = {
df.agg(count("value")).show()
}
}
}
{code}

  was:
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the `where`-Condition `Column.isin`.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
`Column.isin` Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
```scala
df.filter(row => allowedValues.contains(row.getInt(0)))
```



{noformat}
fdfsf
{noformat}


however, when running a few tests locally, I realized that using `Column.isin` 
is actually about 10 times slower than a ```rdd.filter``` or a 
broadcast-inner-join.

Shouldn't ```df.where(col("colname").isin(allowedValues))``` perform (SQL-API 
overhead aside) perform as good as ```df.filter(row => 
allowedValues.contains(row.getInt(0)))``` ?

{code:scala}

[jira] [Updated] (SPARK-33383) Improve performance of Column.isin Expression

2020-11-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-33383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Wollschläger updated SPARK-33383:
---
Description: 
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the _where_-Condition _Column.isin_.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
_Column.isin_ Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
{code}
df.filter(row => allowedValues.contains(row.getInt(0)))
{code}


however, when running a few tests locally, I realized that using _Column.isin_ 
is actually about 10 times slower than a _rdd.filter_ or a broadcast-inner-join.

Shouldn't {code}df.where(col("colname").isin(allowedValues)){code} perform 
(SQL-API overhead aside) as good as {code}df.filter(row => 
allowedValues.contains(row.getInt(0))){code} ?

I used the following dummy code for my local tests:

{code:scala}
package example

import org.apache.spark.sql.functions.{broadcast, col, count}
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.Random

object Test {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

import spark.implicits._

val _10Million = 1000
val random = new Random(1048394789305L)

val values = Seq.fill(_10Million)(random.nextInt())
val df = Seq.fill(_10Million)(random.nextInt()).toDF("value")
val allowedValues = getRandomElements(values, random, 1)

println("Starting ...")
runWithInCollection(spark, df, allowedValues)
println(" In Collection")
runWithBroadcastDF(spark, df, allowedValues)
println(" Broadcast DF")
runWithBroadcastVariable(spark, df, allowedValues)
println(" Broadcast Variable")
}

def getRandomElements[A](seq: Seq[A], random: Random, size: Int): Set[A] = {
val builder = Set.newBuilder[A]

for (i <- 0 until size) {
builder += getRandomElement(seq, random)
}

builder.result()
}

def getRandomElement[A](seq: Seq[A], random: Random): A = {
seq(random.nextInt(seq.length))
}

// I expected this one to be almost equivalent to the one with a 
broadcast-variable, but it's actually about 10 times slower
def runWithInCollection(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
spark.time {

df.where(col("value").isInCollection(allowedValues)).runTestAggregation()
}
}

// A bit slower than the one with a broadcast variable
def runWithBroadcastDF(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
import spark.implicits._

val allowedValuesDF = allowedValues.toSeq.toDF("allowedValue")

spark.time {
df.join(broadcast(allowedValuesDF), col("value") === 
col("allowedValue")).runTestAggregation()
}
}

// This is actually the fastest one
def runWithBroadcastVariable(spark: SparkSession, df: DataFrame, 
allowedValues: Set[Int]): Unit = {
val allowedValuesBroadcast = spark.sparkContext.broadcast(allowedValues)

spark.time {
df.filter(row => 
allowedValuesBroadcast.value.contains(row.getInt(0))).runTestAggregation()
}
}

implicit class TestRunner(val df: DataFrame) {

def runTestAggregation(): Unit = {
df.agg(count("value")).show()
}
}
}
{code}

  was:
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the _where_-Condition _Column.isin_.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
_Column.isin_ Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
{code}
df.filter(row => allowedValues.contains(row.getInt(0)))
{code}


however, when running a few tests locally, I realized that using _Column.isin_ 
is actually about 10 times slower than a _rdd.filter_ or a broadcast-inner-join.

Shouldn't {code}df.where(col("colname").isin(allowedValues)){code} perform 
(SQL-API overhead aside) perform as good as {code}df.filter(row => 
allowedValues.contains(row.getInt(0))){code} ?

I used the following dummy code for my local

[jira] [Updated] (SPARK-33383) Improve performance of Column.isin Expression

2020-11-08 Thread Jira



 [ 
https://issues.apache.org/jira/browse/SPARK-33383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Wollschläger updated SPARK-33383:
---
Description: 
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the `where`-Condition `Column.isin`.


I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
`Column.isin` Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
```scala
df.filter(row => allowedValues.contains(row.getInt(0)))
```



{noformat}
fdfsf
{noformat}


however, when running a few tests locally, I realized that using `Column.isin` 
is actually about 10 times slower than a ```rdd.filter``` or a 
broadcast-inner-join.

Shouldn't ```df.where(col("colname").isin(allowedValues))``` perform (SQL-API 
overhead aside) perform as good as ```df.filter(row => 
allowedValues.contains(row.getInt(0)))``` ?

{code:scala}
package example

import org.apache.spark.sql.functions.{broadcast, col, count}
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.Random

object Test {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

import spark.implicits._

val _10Million = 1000
val random = new Random(1048394789305L)

val values = Seq.fill(_10Million)(random.nextInt())
val df = Seq.fill(_10Million)(random.nextInt()).toDF("value")
val allowedValues = getRandomElements(values, random, 1)

println("Starting ...")
runWithInCollection(spark, df, allowedValues)
println(" In Collection")
runWithBroadcastDF(spark, df, allowedValues)
println(" Broadcast DF")
runWithBroadcastVariable(spark, df, allowedValues)
println(" Broadcast Variable")
}

def getRandomElements[A](seq: Seq[A], random: Random, size: Int): Set[A] = {
val builder = Set.newBuilder[A]

for (i <- 0 until size) {
builder += getRandomElement(seq, random)
}

builder.result()
}

def getRandomElement[A](seq: Seq[A], random: Random): A = {
seq(random.nextInt(seq.length))
}

// I expected this one to be almost equivalent to the one with a 
broadcast-variable, but it's actually about 10 times slower
def runWithInCollection(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
spark.time {

df.where(col("value").isInCollection(allowedValues)).runTestAggregation()
}
}

// A bit slower than the one with a broadcast variable
def runWithBroadcastDF(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
import spark.implicits._

val allowedValuesDF = allowedValues.toSeq.toDF("allowedValue")

spark.time {
df.join(broadcast(allowedValuesDF), col("value") === 
col("allowedValue")).runTestAggregation()
}
}

// This is actually the fastest one
def runWithBroadcastVariable(spark: SparkSession, df: DataFrame, 
allowedValues: Set[Int]): Unit = {
val allowedValuesBroadcast = spark.sparkContext.broadcast(allowedValues)

spark.time {
df.filter(row => 
allowedValuesBroadcast.value.contains(row.getInt(0))).runTestAggregation()
}
}

implicit class TestRunner(val df: DataFrame) {

def runTestAggregation(): Unit = {
df.agg(count("value")).show()
}
}
}
{code}

  was:
When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the `where`-Condition `Column.isin`.

I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
`Column.isin` Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
```scala
df.filter(row => allowedValues.contains(row.getInt(0)))
```

however, when running a few tests locally, I realized that using `Column.isin` 
is actually about 10 times slower than a ```rdd.filter``` or a 
broadcast-inner-join.

Shouldn't ```df.where(col("colname").isin(allowedValues))``` perform (SQL-API 
overhead aside) perform as good as ```df.filter(row => 
allowedValues.contains(row.getInt(0)))``` ?

```scala
package example

import org.apache.spark.sql.functions.{broadcast, col,

[jira] [Created] (SPARK-33383) Improve performance of Column.isin Expression

2020-11-08 Thread Jira

Felix Wollschläger created SPARK-33383:
--

 Summary: Improve performance of Column.isin Expression
 Key: SPARK-33383
 URL: https://issues.apache.org/jira/browse/SPARK-33383
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1, 2.4.4
 Environment: macOS
Spark(-SQL) 2.4.4 and 3.0.1
Scala 2.12.10
Reporter: Felix Wollschläger


When I asked [a question on 
Stackoverflow|https://stackoverflow.com/questions/64683189/usage-of-broadcast-variables-when-using-only-spark-sql-api]
 and running some local tests, I came across a performance bottleneck when 
using the `where`-Condition `Column.isin`.

I have a set of allowed-values ("whitelist") with a size that's handleable 
in-memory really good (about 10k values). I thought simply using the 
`Column.isin` Expression in the SQL API should be the way to go. I assumed it 
would be runtime equivalent to
```scala
df.filter(row => allowedValues.contains(row.getInt(0)))
```

however, when running a few tests locally, I realized that using `Column.isin` 
is actually about 10 times slower than a ```rdd.filter``` or a 
broadcast-inner-join.

Shouldn't ```df.where(col("colname").isin(allowedValues))``` perform (SQL-API 
overhead aside) perform as good as ```df.filter(row => 
allowedValues.contains(row.getInt(0)))``` ?

```scala
package example

import org.apache.spark.sql.functions.{broadcast, col, count}
import org.apache.spark.sql.{DataFrame, SparkSession}

import scala.util.Random

object Test {

def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.appName("Name")
.master("local[*]")
.config("spark.driver.host", "localhost")
.config("spark.ui.enabled", "false")
.getOrCreate()

import spark.implicits._

val _10Million = 1000
val random = new Random(1048394789305L)

val values = Seq.fill(_10Million)(random.nextInt())
val df = Seq.fill(_10Million)(random.nextInt()).toDF("value")
val allowedValues = getRandomElements(values, random, 1)

println("Starting ...")
runWithInCollection(spark, df, allowedValues)
println(" In Collection")
runWithBroadcastDF(spark, df, allowedValues)
println(" Broadcast DF")
runWithBroadcastVariable(spark, df, allowedValues)
println(" Broadcast Variable")
}

def getRandomElements[A](seq: Seq[A], random: Random, size: Int): Set[A] = {
val builder = Set.newBuilder[A]

for (i <- 0 until size) {
builder += getRandomElement(seq, random)
}

builder.result()
}

def getRandomElement[A](seq: Seq[A], random: Random): A = {
seq(random.nextInt(seq.length))
}

// I expected this one to be almost equivalent to the one with a 
broadcast-variable, but it's actually about 10 times slower
def runWithInCollection(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
spark.time {

df.where(col("value").isInCollection(allowedValues)).runTestAggregation()
}
}

// A bit slower than the one with a broadcast variable
def runWithBroadcastDF(spark: SparkSession, df: DataFrame, allowedValues: 
Set[Int]): Unit = {
import spark.implicits._

val allowedValuesDF = allowedValues.toSeq.toDF("allowedValue")

spark.time {
df.join(broadcast(allowedValuesDF), col("value") === 
col("allowedValue")).runTestAggregation()
}
}

// This is actually the fastest one
def runWithBroadcastVariable(spark: SparkSession, df: DataFrame, 
allowedValues: Set[Int]): Unit = {
val allowedValuesBroadcast = spark.sparkContext.broadcast(allowedValues)

spark.time {
df.filter(row => 
allowedValuesBroadcast.value.contains(row.getInt(0))).runTestAggregation()
}
}

implicit class TestRunner(val df: DataFrame) {

def runTestAggregation(): Unit = {
df.agg(count("value")).show()
}
}
}
```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30527) Add IsNotNull filter when use In, InSet and InSubQuery

2020-11-08 Thread Yuming Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-30527.
-
Resolution: Invalid

> Add IsNotNull filter when use In, InSet and InSubQuery
> --
>
> Key: SPARK-30527
> URL: https://issues.apache.org/jira/browse/SPARK-30527
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: ulysses you
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32860.
--
Fix Version/s: 3.1.0
   3.0.2
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/30274

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Assignee: Dan Ziemba
>Priority: Trivial
>  Labels: starter
> Fix For: 3.0.2, 3.1.0
>
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-32860) Encoders::bean doc incorrectly states maps are not supported

2020-11-08 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32860:


Assignee: Dan Ziemba

> Encoders::bean doc incorrectly states maps are not supported
> 
>
> Key: SPARK-32860
> URL: https://issues.apache.org/jira/browse/SPARK-32860
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.4.6, 3.0.1, 3.1.0
>Reporter: Dan Ziemba
>Assignee: Dan Ziemba
>Priority: Trivial
>  Labels: starter
>
> The documentation for the bean method in the Encoders class currently states:
> {quote}collection types: only array and java.util.List currently, map support 
> is in progress
> {quote}
> But map support appears to work properly and has been available since 2.1.0 
> according to SPARK-16706.  Documentation should be updated to match what is / 
> is not actually supported (Set, Queue, etc?).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33371) Support Python 3.9+ in PySpark

2020-11-08 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17227971#comment-17227971
 ] 

Apache Spark commented on SPARK-33371:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/30288

> Support Python 3.9+ in PySpark
> --
>
> Key: SPARK-33371
> URL: https://issues.apache.org/jira/browse/SPARK-33371
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Python 3.9 works with PySpark. we should fix setup.py.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

58 matches

Mail list logo