[jira] [Assigned] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34916:


Assignee: (was: Apache Spark)

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34916:


Assignee: Apache Spark

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Assignee: Apache Spark
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34916) Reduce tree traversals in transform/resolve function families

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315290#comment-17315290
 ] 

Apache Spark commented on SPARK-34916:
--

User 'sigmod' has created a pull request for this issue:
https://github.com/apache/spark/pull/32060

> Reduce tree traversals in transform/resolve function families
> -
>
> Key: SPARK-34916
> URL: https://issues.apache.org/jira/browse/SPARK-34916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Yingyi Bu
>Priority: Major
>
> Transform/resolve functions are called ~280k times per query on average for a 
> TPC-DS query, which are way more than necessary. We can reduce those calls 
> with early exit information and conditons.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34963) Nested column pruning fails to extract case-insensitive struct field from array

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315289#comment-17315289
 ] 

Apache Spark commented on SPARK-34963:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32059

> Nested column pruning fails to extract case-insensitive struct field from 
> array
> ---
>
> Key: SPARK-34963
> URL: https://issues.apache.org/jira/browse/SPARK-34963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Under case-insensitive mode, nested column pruning rule cannot correctly push 
> down extractor of a struct field of an array of struct, e.g.,
> {code}
> val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34963) Nested column pruning fails to extract case-insensitive struct field from array

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34963:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Nested column pruning fails to extract case-insensitive struct field from 
> array
> ---
>
> Key: SPARK-34963
> URL: https://issues.apache.org/jira/browse/SPARK-34963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Under case-insensitive mode, nested column pruning rule cannot correctly push 
> down extractor of a struct field of an array of struct, e.g.,
> {code}
> val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34963) Nested column pruning fails to extract case-insensitive struct field from array

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34963:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Nested column pruning fails to extract case-insensitive struct field from 
> array
> ---
>
> Key: SPARK-34963
> URL: https://issues.apache.org/jira/browse/SPARK-34963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Under case-insensitive mode, nested column pruning rule cannot correctly push 
> down extractor of a struct field of an array of struct, e.g.,
> {code}
> val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34963) Nested column pruning fails to extract case-insensitive struct field from array

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315288#comment-17315288
 ] 

Apache Spark commented on SPARK-34963:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/32059

> Nested column pruning fails to extract case-insensitive struct field from 
> array
> ---
>
> Key: SPARK-34963
> URL: https://issues.apache.org/jira/browse/SPARK-34963
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.7, 3.0.2, 3.2.0, 3.1.1
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Under case-insensitive mode, nested column pruning rule cannot correctly push 
> down extractor of a struct field of an array of struct, e.g.,
> {code}
> val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34963) Nested column pruning fails to extract case-insensitive struct field from array

2021-04-05 Thread L. C. Hsieh (Jira)
L. C. Hsieh created SPARK-34963:
---

 Summary: Nested column pruning fails to extract case-insensitive 
struct field from array
 Key: SPARK-34963
 URL: https://issues.apache.org/jira/browse/SPARK-34963
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1, 3.0.2, 2.4.7, 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Under case-insensitive mode, nested column pruning rule cannot correctly push 
down extractor of a struct field of an array of struct, e.g.,

{code}
val query = spark.table("contacts").select("friends.First", "friends.MiDDle")
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34890) Port/integrate Koalas main codes into PySpark

2021-04-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34890.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32036
[https://github.com/apache/spark/pull/32036]

> Port/integrate Koalas main codes into PySpark
> -
>
> Key: SPARK-34890
> URL: https://issues.apache.org/jira/browse/SPARK-34890
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA aims to port Koalas main code appropriately to PySpark.
> Tasks for tests are managed by a separated JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34890) Port/integrate Koalas main codes into PySpark

2021-04-05 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34890:


Assignee: Haejoon Lee

> Port/integrate Koalas main codes into PySpark
> -
>
> Key: SPARK-34890
> URL: https://issues.apache.org/jira/browse/SPARK-34890
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Haejoon Lee
>Priority: Major
>
> This JIRA aims to port Koalas main code appropriately to PySpark.
> Tasks for tests are managed by a separated JIRA.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34962) Explicit representation of star in MergeIntoTable's Update and Insert action

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34962:


Assignee: Tathagata Das  (was: Apache Spark)

> Explicit representation of star in MergeIntoTable's Update and Insert action
> 
>
> Key: SPARK-34962
> URL: https://issues.apache.org/jira/browse/SPARK-34962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, UpdateAction and InsertAction in the MergeIntoTable implicitly 
> represent `update set *` and `insert *` with empty assignments. That means 
> there is no way to differentiate between the representations of "update all 
> columns" and "update no columns". For SQL MERGE queries, this inability does 
> not matter because the SQL MERGE grammar that generated the MergeIntoTable 
> plan does not allow "update no columns". However, other ways of generating 
> the MergeIntoTable plan may not have that limitation, and may want to allow 
> specifying "update no columns".  For example, in the Delta Lake project we 
> provide a type-safe Scala API for Merge, where it is perfectly valid to 
> produce a Merge query with an update clause but no update assignments. 
> Currently, we cannot use MergeIntoTable to represent this plan, thus 
> complicating the generation, and resolution of merge query from scala API. 
> This should be fixed by having an explicit representation of * in the 
> UpdateAction and InsertAction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34962) Explicit representation of star in MergeIntoTable's Update and Insert action

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34962:


Assignee: Apache Spark  (was: Tathagata Das)

> Explicit representation of star in MergeIntoTable's Update and Insert action
> 
>
> Key: SPARK-34962
> URL: https://issues.apache.org/jira/browse/SPARK-34962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Major
>
> Currently, UpdateAction and InsertAction in the MergeIntoTable implicitly 
> represent `update set *` and `insert *` with empty assignments. That means 
> there is no way to differentiate between the representations of "update all 
> columns" and "update no columns". For SQL MERGE queries, this inability does 
> not matter because the SQL MERGE grammar that generated the MergeIntoTable 
> plan does not allow "update no columns". However, other ways of generating 
> the MergeIntoTable plan may not have that limitation, and may want to allow 
> specifying "update no columns".  For example, in the Delta Lake project we 
> provide a type-safe Scala API for Merge, where it is perfectly valid to 
> produce a Merge query with an update clause but no update assignments. 
> Currently, we cannot use MergeIntoTable to represent this plan, thus 
> complicating the generation, and resolution of merge query from scala API. 
> This should be fixed by having an explicit representation of * in the 
> UpdateAction and InsertAction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34962) Explicit representation of star in MergeIntoTable's Update and Insert action

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315175#comment-17315175
 ] 

Apache Spark commented on SPARK-34962:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/32058

> Explicit representation of star in MergeIntoTable's Update and Insert action
> 
>
> Key: SPARK-34962
> URL: https://issues.apache.org/jira/browse/SPARK-34962
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
>
> Currently, UpdateAction and InsertAction in the MergeIntoTable implicitly 
> represent `update set *` and `insert *` with empty assignments. That means 
> there is no way to differentiate between the representations of "update all 
> columns" and "update no columns". For SQL MERGE queries, this inability does 
> not matter because the SQL MERGE grammar that generated the MergeIntoTable 
> plan does not allow "update no columns". However, other ways of generating 
> the MergeIntoTable plan may not have that limitation, and may want to allow 
> specifying "update no columns".  For example, in the Delta Lake project we 
> provide a type-safe Scala API for Merge, where it is perfectly valid to 
> produce a Merge query with an update clause but no update assignments. 
> Currently, we cannot use MergeIntoTable to represent this plan, thus 
> complicating the generation, and resolution of merge query from scala API. 
> This should be fixed by having an explicit representation of * in the 
> UpdateAction and InsertAction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34935) CREATE TABLE LIKE should respect the reserved properties of tables

2021-04-05 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34935.
--
Fix Version/s: 3.2.0
 Assignee: Kent Yao
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32025

> CREATE TABLE LIKE should respect the reserved properties of tables
> --
>
> Key: SPARK-34935
> URL: https://issues.apache.org/jira/browse/SPARK-34935
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.2.0
>
>
> the reserved properties are not allowed for all DDLs except CREATE TABLE LIKE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34932) ignore the groupBy expressions in GROUP BY ... GROUPING SETS

2021-04-05 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34932.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/32022

> ignore the groupBy expressions in GROUP BY ... GROUPING SETS
> 
>
> Key: SPARK-34932
> URL: https://issues.apache.org/jira/browse/SPARK-34932
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34962) Explicit representation of star in MergeIntoTable's Update and Insert action

2021-04-05 Thread Tathagata Das (Jira)
Tathagata Das created SPARK-34962:
-

 Summary: Explicit representation of star in MergeIntoTable's 
Update and Insert action
 Key: SPARK-34962
 URL: https://issues.apache.org/jira/browse/SPARK-34962
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.1
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, UpdateAction and InsertAction in the MergeIntoTable implicitly 
represent `update set *` and `insert *` with empty assignments. That means 
there is no way to differentiate between the representations of "update all 
columns" and "update no columns". For SQL MERGE queries, this inability does 
not matter because the SQL MERGE grammar that generated the MergeIntoTable plan 
does not allow "update no columns". However, other ways of generating the 
MergeIntoTable plan may not have that limitation, and may want to allow 
specifying "update no columns".  For example, in the Delta Lake project we 
provide a type-safe Scala API for Merge, where it is perfectly valid to produce 
a Merge query with an update clause but no update assignments. Currently, we 
cannot use MergeIntoTable to represent this plan, thus complicating the 
generation, and resolution of merge query from scala API. 

This should be fixed by having an explicit representation of * in the 
UpdateAction and InsertAction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34961:


Assignee: Apache Spark

> Migrate First function from DeclarativeAggregate to TypedImperativeAggregate 
> to improve performance
> ---
>
> Key: SPARK-34961
> URL: https://issues.apache.org/jira/browse/SPARK-34961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Pablo Langa Blanco
>Assignee: Apache Spark
>Priority: Minor
>
> The main objective of this change is to improve performance in some cases.
> We have three possibilities when we plan an aggregation. In the first case, 
> with mutable primitive types, HashAggregate is used.
> When we are not using these types we have two options. If the function 
> implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we 
> use SortAggregate that is less efficient.
> In this PR I propose to migrate First function to implement 
> TypedImperativeAggregate to take advantage of this feature 
> (ObjectAggregateExec)
> This Jira is related to SPARK-34464
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315155#comment-17315155
 ] 

Apache Spark commented on SPARK-34961:
--

User 'planga82' has created a pull request for this issue:
https://github.com/apache/spark/pull/32057

> Migrate First function from DeclarativeAggregate to TypedImperativeAggregate 
> to improve performance
> ---
>
> Key: SPARK-34961
> URL: https://issues.apache.org/jira/browse/SPARK-34961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The main objective of this change is to improve performance in some cases.
> We have three possibilities when we plan an aggregation. In the first case, 
> with mutable primitive types, HashAggregate is used.
> When we are not using these types we have two options. If the function 
> implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we 
> use SortAggregate that is less efficient.
> In this PR I propose to migrate First function to implement 
> TypedImperativeAggregate to take advantage of this feature 
> (ObjectAggregateExec)
> This Jira is related to SPARK-34464
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34961:


Assignee: (was: Apache Spark)

> Migrate First function from DeclarativeAggregate to TypedImperativeAggregate 
> to improve performance
> ---
>
> Key: SPARK-34961
> URL: https://issues.apache.org/jira/browse/SPARK-34961
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Pablo Langa Blanco
>Priority: Minor
>
> The main objective of this change is to improve performance in some cases.
> We have three possibilities when we plan an aggregation. In the first case, 
> with mutable primitive types, HashAggregate is used.
> When we are not using these types we have two options. If the function 
> implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we 
> use SortAggregate that is less efficient.
> In this PR I propose to migrate First function to implement 
> TypedImperativeAggregate to take advantage of this feature 
> (ObjectAggregateExec)
> This Jira is related to SPARK-34464
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-04-05 Thread Pablo Langa Blanco (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315152#comment-17315152
 ] 

Pablo Langa Blanco commented on SPARK-34464:


I've open SPARK-34961 because I think It's an improvement not a bug so I will 
raise a PR there.

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34464) `first` function is sorting the dataset while sometimes it is used to get "any value"

2021-04-05 Thread Pablo Langa Blanco (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pablo Langa Blanco resolved SPARK-34464.

Resolution: Not A Bug

> `first` function is sorting the dataset while sometimes it is used to get 
> "any value"
> -
>
> Key: SPARK-34464
> URL: https://issues.apache.org/jira/browse/SPARK-34464
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Louis Fruleux
>Priority: Minor
>
> When one wants to groupBy and take any value (not necessarily the first), one 
> usually uses 
> [first|https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L485]
>  aggregation function.
> Unfortunately, this method uses a `SortAggregate` for some data types, which 
> is not always necessary and might impact performances. Is this the desired 
> behavior?
>  
>  
> {code:java}
> Current behavior:
>  val df = Seq((0, "value")).toDF("key", "value")
> df.groupBy("key").agg(first("value")).explain()
>  /*
>  == Physical Plan ==
>  SortAggregate(key=key#342, functions=first(value#343, false))
>  +- *(2) Sort key#342 ASC NULLS FIRST, false, 0
>     +- Exchange hashpartitioning(key#342, 200)
>        +- SortAggregate(key=key#342, functions=partial_first(value#343, 
> false))
>           +- *(1) Sort key#342 ASC NULLS FIRST, false, 0
>              +- LocalTableScan key#342, value#343
>  */
> {code}
>  
> My understanding of the source code does not allow me to fully understand why 
> this is the current behavior.
> The solution might be to implement a new aggregate function. But the code 
> would be highly similar to the first one. And if I don't fully understand why 
> this 
> [createAggregate|https://github.com/apache/spark/blob/3a299aa6480ac22501512cd0310d31a441d7dfdc/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/AggUtils.scala#L45]
>  method falls back to SortAggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34961) Migrate First function from DeclarativeAggregate to TypedImperativeAggregate to improve performance

2021-04-05 Thread Pablo Langa Blanco (Jira)
Pablo Langa Blanco created SPARK-34961:
--

 Summary: Migrate First function from DeclarativeAggregate to 
TypedImperativeAggregate to improve performance
 Key: SPARK-34961
 URL: https://issues.apache.org/jira/browse/SPARK-34961
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Pablo Langa Blanco


The main objective of this change is to improve performance in some cases.

We have three possibilities when we plan an aggregation. In the first case, 
with mutable primitive types, HashAggregate is used.

When we are not using these types we have two options. If the function 
implements TypedImperativeAggregate we use ObjectHashAggregate. Otherwise, we 
use SortAggregate that is less efficient.

In this PR I propose to migrate First function to implement 
TypedImperativeAggregate to take advantage of this feature (ObjectAggregateExec)

This Jira is related to SPARK-34464

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan reassigned SPARK-34949:
---

Assignee: Sumeet

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Assignee: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
> Fix For: 3.2.0, 3.1.2
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34949) Executor.reportHeartBeat reregisters blockManager even when Executor is shutting down

2021-04-05 Thread Mridul Muralidharan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan resolved SPARK-34949.
-
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 32043
[https://github.com/apache/spark/pull/32043]

> Executor.reportHeartBeat reregisters blockManager even when Executor is 
> shutting down
> -
>
> Key: SPARK-34949
> URL: https://issues.apache.org/jira/browse/SPARK-34949
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1
> Environment: Resource Manager: K8s
>Reporter: Sumeet
>Priority: Major
>  Labels: Executor, heartbeat
> Fix For: 3.2.0, 3.1.2
>
>
> *Problem:*
> I was testing Dynamic Allocation on K8s with about 300 executors. While doing 
> so, when the executors were torn down due to 
> "spark.dynamicAllocation.executorIdleTimeout", I noticed all the executor 
> pods being removed from K8s, however, under the "Executors" tab in SparkUI, I 
> could see some executors listed as alive. 
> [spark.sparkContext.statusTracker.getExecutorInfos.length|https://github.com/apache/spark/blob/65da9287bc5112564836a555cd2967fc6b05856f/core/src/main/scala/org/apache/spark/SparkStatusTracker.scala#L100]
>  also returned a value greater than 1. 
>  
> *Cause:*
>  * "CoarseGrainedSchedulerBackend" issues RemoveExecutor on a 
> "executorEndpoint" and publishes "SparkListenerExecutorRemoved" on the 
> "listenerBus"
>  * "CoarseGrainedExecutorBackend" starts the executor shutdown
>  * "HeartbeatReceiver" picks the "SparkListenerExecutorRemoved" event and 
> removes the executor from "executorLastSeen"
>  * In the meantime, the executor reports a Heartbeat. Now "HeartbeatReceiver" 
> cannot find the "executorId" in "executorLastSeen" and hence responds with 
> "HeartbeatResponse(reregisterBlockManager = true)"
>  * The Executor now calls "env.blockManager.reregister()" and reregisters 
> itself thus creating inconsistency
>  
> *Proposed Solution:*
> The "reportHeartBeat" method is not aware of the fact that Executor is 
> shutting down, it should check "executorShutdown" before reregistering. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34959.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32055
[https://github.com/apache/spark/pull/32055]

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.2.0
>
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34959:
-

Assignee: Dongjoon Hyun

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34667) Support casting of year-month intervals to strings

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315108#comment-17315108
 ] 

Apache Spark commented on SPARK-34667:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32056

> Support casting of year-month intervals to strings
> --
>
> Key: SPARK-34667
> URL: https://issues.apache.org/jira/browse/SPARK-34667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support YearMonthIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34667) Support casting of year-month intervals to strings

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34667:


Assignee: (was: Apache Spark)

> Support casting of year-month intervals to strings
> --
>
> Key: SPARK-34667
> URL: https://issues.apache.org/jira/browse/SPARK-34667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support YearMonthIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34667) Support casting of year-month intervals to strings

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34667:


Assignee: Apache Spark

> Support casting of year-month intervals to strings
> --
>
> Key: SPARK-34667
> URL: https://issues.apache.org/jira/browse/SPARK-34667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Extend the Cast expression and support YearMonthIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34667) Support casting of year-month intervals to strings

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315107#comment-17315107
 ] 

Apache Spark commented on SPARK-34667:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/32056

> Support casting of year-month intervals to strings
> --
>
> Key: SPARK-34667
> URL: https://issues.apache.org/jira/browse/SPARK-34667
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Max Gekk
>Priority: Major
>
> Extend the Cast expression and support YearMonthIntervalType in casting to 
> StringType.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

2021-04-05 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315070#comment-17315070
 ] 

Cheng Su commented on SPARK-34960:
--

Just FYI we will start sending out code after 
[https://github.com/apache/spark/pull/32049] is merged. cc [~huaxingao], thanks.

> Aggregate (Min/Max/Count) push down for ORC
> ---
>
> Key: SPARK-34960
> URL: https://issues.apache.org/jira/browse/SPARK-34960
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we 
> can also push down certain aggregations into ORC. ORC exposes column 
> statistics in interface `org.apache.orc.Reader` 
> ([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
>  ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34960) Aggregate (Min/Max/Count) push down for ORC

2021-04-05 Thread Cheng Su (Jira)
Cheng Su created SPARK-34960:


 Summary: Aggregate (Min/Max/Count) push down for ORC
 Key: SPARK-34960
 URL: https://issues.apache.org/jira/browse/SPARK-34960
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


Similar to Parquet (https://issues.apache.org/jira/browse/SPARK-34952), we can 
also push down certain aggregations into ORC. ORC exposes column statistics in 
interface `org.apache.orc.Reader` 
([https://github.com/apache/orc/blob/master/java/core/src/java/org/apache/orc/Reader.java#L118]
 ), where Spark can utilize for aggregation push down.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28785) Schema for type scala.Array[Float] is not supported

2021-04-05 Thread Sergey Grigorev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315025#comment-17315025
 ] 

Sergey Grigorev commented on SPARK-28785:
-

I also see that error times to times on my unit tests, sometimes it's 
Vector[Double], sometimes Schema for type B is not supported ... is there a way 
to make a stable serializer that is available at compile time?

> Schema for type scala.Array[Float] is not supported
> ---
>
> Key: SPARK-28785
> URL: https://issues.apache.org/jira/browse/SPARK-28785
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.3
> Environment: *Spark version:* 2.3.3
> *OS:* CentOS Linux release 7.3.1611
> *Kernel:* 3.10.0-862.14.4.el7.x86_64
> *Java:* openjdk version "1.8.0_151"
>Reporter: Jigao Fu
>Priority: Major
>
> I use Spark ML to build my application and write some test cases based on the 
> examples of 
> [https://spark.apache.org/docs/latest/ml-features.html|https://spark.apache.org/docs/latest/ml-features.html#word2vec]
>  to test whether my application is compatible with Spark's transformers.
>  
> After upgrading my Spark version from 2.1.1 to 2.3.3, something strange 
> happened. I train a Pipeline model which contains a Word2Vec transformer and 
> save the model into local, most of the time it works pretty well but 
> sometimes I get the UnsupportedOperationException error:
>  
> Code: 
>  
> {code:java}
> val data = spark.createDataFrame(Seq(
>   (1, "Hi I heard about Spark".split(" ")),
>   (2, "I wish Java could use case classes".split(" ")),
>   (3, "Logistic regression models are neat".split(" "))
> )).toDF("label", "text")
> // transformers
> val word2Vec = new Word2Vec()
>   .setInputCol("text")
>   .setOutputCol("result")
>   .setVectorSize(3)
>   .setMinCount(0)
> val pipeline = new Pipeline().setStages(Array(word2Vec))
> val model = pipeline.fit(data)
> model.write.overwrite.save("./model_data")
> // Then my applicatin will read the model data file...{code}
>  
>  
> Exception:
>  
> {code:java}
> java.lang.UnsupportedOperationException: Schema for type scala.Array[Float] 
> is not supported at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:780)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:715)
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:714)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.getPath$1(ScalaReflection.scala:173)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:298)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:824)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:39)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor(ScalaReflection.scala:150)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:386)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1$$anonfun$7.apply(ScalaReflection.scala:380)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>  at scala.collection.immutable.List.foreach(List.scala:381) at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:245) at 
> scala.collection.immutable.List.map(List.scala:285) at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:380)
>  at 
> org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$deserializerFor$1.apply(ScalaReflection.scala:150)
>  at 
> scala.reflect.internal.tpe

[jira] [Commented] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315023#comment-17315023
 ] 

Apache Spark commented on SPARK-34959:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32055

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34959:


Assignee: Apache Spark

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315022#comment-17315022
 ] 

Apache Spark commented on SPARK-34959:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/32055

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34959:


Assignee: (was: Apache Spark)

> Upgrade SBT to 1.5.0
> 
>
> Key: SPARK-34959
> URL: https://issues.apache.org/jira/browse/SPARK-34959
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 
> support.
> https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34959) Upgrade SBT to 1.5.0

2021-04-05 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-34959:
-

 Summary: Upgrade SBT to 1.5.0
 Key: SPARK-34959
 URL: https://issues.apache.org/jira/browse/SPARK-34959
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.2.0
Reporter: Dongjoon Hyun


This JIRA issue aims to upgrade SBT to 1.5.0 which has built-in Scala 3 support.

https://github.com/sbt/sbt/releases/tag/v1.5.0



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34958) pyspark 2.4.7 in conda

2021-04-05 Thread Ben Caine (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ben Caine resolved SPARK-34958.
---
Resolution: Duplicate

> pyspark 2.4.7 in conda
> --
>
> Key: SPARK-34958
> URL: https://issues.apache.org/jira/browse/SPARK-34958
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.7
>Reporter: Ben Caine
>Priority: Major
>
> Pyspark 2.4.7 does not seem to be available yet in conda: 
> [https://anaconda.org/conda-forge/pyspark/files] shows only versions up to 
> 2.4.6 (for major version 2).
> {noformat}
> conda install pyspark=2.4.7{noformat}
> gives the following:
>  
>  
> {code:java}
> ResolvePackageNotFound:
> 13:44:43
> - pyspark=2.4.7
> {code}
> It would be great if this package were available through conda!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34946) Block unsupported correlated scalar subquery in Aggregate

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315012#comment-17315012
 ] 

Apache Spark commented on SPARK-34946:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32054

> Block unsupported correlated scalar subquery in Aggregate
> -
>
> Key: SPARK-34946
> URL: https://issues.apache.org/jira/browse/SPARK-34946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, Spark supports Aggregate to host correlated scalar subqueries, but 
> in some cases, those subqueries cannot be rewritten properly in the 
> `RewriteCorrelatedScalarSubquery` rule. The error messages are also 
> confusing. Hence we should block these cases in CheckAnalysis.
>   
> Case 1: correlated scalar subquery in the grouping expressions but not in 
> aggregate expressions 
> {code:java}
> SELECT SUM(c2) FROM t t1 GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = 
> t2.c1)
> {code}
> We get this error:
> {code:java}
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis  
> {code}
> because the correlated scalar subquery is not rewritten properly: 
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [scalar-subquery#5 [(c1#6 = c1#6#93)]], [sum(c2#7) AS sum(c2)#11L]
> :  +- Aggregate [c1#6], [sum(c2#7) AS sum(c2)#15L, c1#6 AS c1#6#93]
> : +- LocalRelation [c1#6, c2#7]
> +- LocalRelation [c1#6, c2#7]
> {code}
>  
> Case 2: correlated scalar subquery in the aggregate expressions but not in 
> the grouping expressions 
> {code:java}
> SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1), SUM(c2) FROM t t1 
> GROUP BY c1
> {code}
> We get this error:
> {code:java}
> java.lang.IllegalStateException: Couldn't find sum(c2)#69L in 
> [c1#60,sum(c2#61)#64L]
> {code}
> because the transformed correlated scalar subquery output is not present in 
> the grouping expression of the Aggregate:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [c1#60], [sum(c2)#69L AS scalarsubquery(c1)#70L, sum(c2#61) AS 
> sum(c2)#65L]
> +- Project [c1#60, c2#61, sum(c2)#69L]
>+- Join LeftOuter, (c1#60 = c1#60#95)
>   :- LocalRelation [c1#60, c2#61]
>   +- Aggregate [c1#60], [sum(c2#61) AS sum(c2)#69L, c1#60 AS c1#60#95]
>  +- LocalRelation [c1#60, c2#61]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34946) Block unsupported correlated scalar subquery in Aggregate

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17315011#comment-17315011
 ] 

Apache Spark commented on SPARK-34946:
--

User 'allisonwang-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/32054

> Block unsupported correlated scalar subquery in Aggregate
> -
>
> Key: SPARK-34946
> URL: https://issues.apache.org/jira/browse/SPARK-34946
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Allison Wang
>Priority: Major
>
> Currently, Spark supports Aggregate to host correlated scalar subqueries, but 
> in some cases, those subqueries cannot be rewritten properly in the 
> `RewriteCorrelatedScalarSubquery` rule. The error messages are also 
> confusing. Hence we should block these cases in CheckAnalysis.
>   
> Case 1: correlated scalar subquery in the grouping expressions but not in 
> aggregate expressions 
> {code:java}
> SELECT SUM(c2) FROM t t1 GROUP BY (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = 
> t2.c1)
> {code}
> We get this error:
> {code:java}
> java.lang.AssertionError: assertion failed: Expects 1 field, but got 2; 
> something went wrong in analysis  
> {code}
> because the correlated scalar subquery is not rewritten properly: 
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [scalar-subquery#5 [(c1#6 = c1#6#93)]], [sum(c2#7) AS sum(c2)#11L]
> :  +- Aggregate [c1#6], [sum(c2#7) AS sum(c2)#15L, c1#6 AS c1#6#93]
> : +- LocalRelation [c1#6, c2#7]
> +- LocalRelation [c1#6, c2#7]
> {code}
>  
> Case 2: correlated scalar subquery in the aggregate expressions but not in 
> the grouping expressions 
> {code:java}
> SELECT (SELECT SUM(c2) FROM t t2 WHERE t1.c1 = t2.c1), SUM(c2) FROM t t1 
> GROUP BY c1
> {code}
> We get this error:
> {code:java}
> java.lang.IllegalStateException: Couldn't find sum(c2)#69L in 
> [c1#60,sum(c2#61)#64L]
> {code}
> because the transformed correlated scalar subquery output is not present in 
> the grouping expression of the Aggregate:
> {code:java}
> == Optimized Logical Plan ==
> Aggregate [c1#60], [sum(c2)#69L AS scalarsubquery(c1)#70L, sum(c2#61) AS 
> sum(c2)#65L]
> +- Project [c1#60, c2#61, sum(c2)#69L]
>+- Join LeftOuter, (c1#60 = c1#60#95)
>   :- LocalRelation [c1#60, c2#61]
>   +- Aggregate [c1#60], [sum(c2#61) AS sum(c2)#69L, c1#60 AS c1#60#95]
>  +- LocalRelation [c1#60, c2#61]
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34957) pyspark 2.4.7 in conda

2021-04-05 Thread Ben Caine (Jira)
Ben Caine created SPARK-34957:
-

 Summary: pyspark 2.4.7 in conda
 Key: SPARK-34957
 URL: https://issues.apache.org/jira/browse/SPARK-34957
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.7
Reporter: Ben Caine


Pyspark 2.4.7 does not seem to be available yet in conda: 
[https://anaconda.org/conda-forge/pyspark/files] shows only versions up to 
2.4.6 (for major version 2).
{noformat}
conda install pyspark=2.4.7{noformat}
gives the following:

 

 
{code:java}
ResolvePackageNotFound:
13:44:43
- pyspark=2.4.7
{code}
It would be great if this package were available through conda!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34958) pyspark 2.4.7 in conda

2021-04-05 Thread Ben Caine (Jira)
Ben Caine created SPARK-34958:
-

 Summary: pyspark 2.4.7 in conda
 Key: SPARK-34958
 URL: https://issues.apache.org/jira/browse/SPARK-34958
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.7
Reporter: Ben Caine


Pyspark 2.4.7 does not seem to be available yet in conda: 
[https://anaconda.org/conda-forge/pyspark/files] shows only versions up to 
2.4.6 (for major version 2).
{noformat}
conda install pyspark=2.4.7{noformat}
gives the following:

 

 
{code:java}
ResolvePackageNotFound:
13:44:43
- pyspark=2.4.7
{code}
It would be great if this package were available through conda!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314890#comment-17314890
 ] 

Apache Spark commented on SPARK-34493:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32053

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34493:


Assignee: (was: Apache Spark)

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-05 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17314891#comment-17314891
 ] 

Apache Spark commented on SPARK-34493:
--

User 'itholic' has created a pull request for this issue:
https://github.com/apache/spark/pull/32053

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34493) Create "TEXT Files" page for Data Source documents.

2021-04-05 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34493:


Assignee: Apache Spark

> Create "TEXT Files" page for Data Source documents.
> ---
>
> Key: SPARK-34493
> URL: https://issues.apache.org/jira/browse/SPARK-34493
> Project: Spark
>  Issue Type: Sub-task
>  Components: docs
>Affects Versions: 3.2.0
>Reporter: Haejoon Lee
>Assignee: Apache Spark
>Priority: Major
>
> Adding "TEXT Files" page to [Data Sources 
> documents|https://spark.apache.org/docs/latest/sql-data-sources.html#data-sources]
>  which is missing now.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34934) Race condition while registering source in MetricsSystem

2021-04-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-34934.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 32024
[https://github.com/apache/spark/pull/32024]

> Race condition while registering source in MetricsSystem 
> -
>
> Key: SPARK-34934
> URL: https://issues.apache.org/jira/browse/SPARK-34934
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.1.1
>Reporter: Harsh Panchal
>Assignee: Harsh Panchal
>Priority: Minor
> Fix For: 3.2.0
>
>
> {{MetricsSystem}} manages {{mutable.ArrayBuffer}} of metric sources. 
> {{registerSource}} and {{removeSource}} methods are provided to add/remove 
> new source from Metric system. Both these methods are not synchronised. Also, 
> underlying {{mutable.ArrayBuffer}} not being thread safe, unexpected 
> behaviours are possible if called concurrently.
> Some background:
> We have created one custom RpcEndPoint which receives messages from executors 
> and create new metrics by registering custom sources. These messages are 
> processed concurrently on driver side causing this issue. 
> Also, this will go unnoticed as {{Inbox}} will ignore these exceptions.
> We found this issue in Spark 2.3.2, but it should be present in all later 
> versions.
> For ex, we got below exception due to race condition
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException
>   at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:104)
>   at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>   at 
> org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:157)
>   * some closure *
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){noformat}
> While shutting down
> {noformat}
> Exception in thread "stream execution thread for [id = 
> 9a36a08a-f1be-4ad8-b1dd-093f0b53d37d, runId = 
> d668a19c-aced-45c4-963c-c0b93411d1a4]" 
> java.lang.ArrayIndexOutOfBoundsException: 32
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:44)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:195)
>   at 
> scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.indexWhere(IndexedSeqOptimized.scala:204)
>   at scala.collection.mutable.ArrayBuffer.indexWhere(ArrayBuffer.scala:48)
>   at scala.collection.GenSeqLike$class.indexOf(GenSeqLike.scala:145)
>   at scala.collection.AbstractSeq.indexOf(Seq.scala:41)
>   at scala.collection.GenSeqLike$class.indexOf(GenSeqLike.scala:129)
>   at scala.collection.AbstractSeq.indexOf(Seq.scala:41)
>   at 
> scala.collection.mutable.BufferLike$class.$minus$eq(BufferLike.scala:127)
>   at scala.collection.mutable.AbstractBuffer.$minus$eq(Buffer.scala:49)
>   at 
> org.apache.spark.metrics.MetricsSystem.removeSource(MetricsSystem.scala:167)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:324)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:308)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34934) Race condition while registering source in MetricsSystem

2021-04-05 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-34934:


Assignee: Harsh Panchal

> Race condition while registering source in MetricsSystem 
> -
>
> Key: SPARK-34934
> URL: https://issues.apache.org/jira/browse/SPARK-34934
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.2, 3.1.1
>Reporter: Harsh Panchal
>Assignee: Harsh Panchal
>Priority: Minor
>
> {{MetricsSystem}} manages {{mutable.ArrayBuffer}} of metric sources. 
> {{registerSource}} and {{removeSource}} methods are provided to add/remove 
> new source from Metric system. Both these methods are not synchronised. Also, 
> underlying {{mutable.ArrayBuffer}} not being thread safe, unexpected 
> behaviours are possible if called concurrently.
> Some background:
> We have created one custom RpcEndPoint which receives messages from executors 
> and create new metrics by registering custom sources. These messages are 
> processed concurrently on driver side causing this issue. 
> Also, this will go unnoticed as {{Inbox}} will ignore these exceptions.
> We found this issue in Spark 2.3.2, but it should be present in all later 
> versions.
> For ex, we got below exception due to race condition
> {noformat}
> java.lang.ArrayIndexOutOfBoundsException
>   at 
> scala.collection.mutable.ResizableArray$class.ensureSize(ResizableArray.scala:104)
>   at scala.collection.mutable.ArrayBuffer.ensureSize(ArrayBuffer.scala:48)
>   at scala.collection.mutable.ArrayBuffer.$plus$eq(ArrayBuffer.scala:84)
>   at 
> org.apache.spark.metrics.MetricsSystem.registerSource(MetricsSystem.scala:157)
>   * some closure *
>   at 
> org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
>   at 
> org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745){noformat}
> While shutting down
> {noformat}
> Exception in thread "stream execution thread for [id = 
> 9a36a08a-f1be-4ad8-b1dd-093f0b53d37d, runId = 
> d668a19c-aced-45c4-963c-c0b93411d1a4]" 
> java.lang.ArrayIndexOutOfBoundsException: 32
>   at 
> scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:44)
>   at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.segmentLength(IndexedSeqOptimized.scala:195)
>   at 
> scala.collection.mutable.ArrayBuffer.segmentLength(ArrayBuffer.scala:48)
>   at 
> scala.collection.IndexedSeqOptimized$class.indexWhere(IndexedSeqOptimized.scala:204)
>   at scala.collection.mutable.ArrayBuffer.indexWhere(ArrayBuffer.scala:48)
>   at scala.collection.GenSeqLike$class.indexOf(GenSeqLike.scala:145)
>   at scala.collection.AbstractSeq.indexOf(Seq.scala:41)
>   at scala.collection.GenSeqLike$class.indexOf(GenSeqLike.scala:129)
>   at scala.collection.AbstractSeq.indexOf(Seq.scala:41)
>   at 
> scala.collection.mutable.BufferLike$class.$minus$eq(BufferLike.scala:127)
>   at scala.collection.mutable.AbstractBuffer.$minus$eq(Buffer.scala:49)
>   at 
> org.apache.spark.metrics.MetricsSystem.removeSource(MetricsSystem.scala:167)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runStream$2.apply(StreamExecution.scala:324)
>   at 
> org.apache.spark.util.UninterruptibleThread.runUninterruptibly(UninterruptibleThread.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:308)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189){noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34794) Nested higher-order functions broken in DSL

2021-04-05 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-34794:
---
Labels: correctness  (was: Correctness)

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>  Labels: correctness
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34794) Nested higher-order functions broken in DSL

2021-04-05 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-34794:
---
Affects Version/s: 3.2.0

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34794) Nested higher-order functions broken in DSL

2021-04-05 Thread Tanel Kiis (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-34794:
---
Labels: Correctness  (was: )

> Nested higher-order functions broken in DSL
> ---
>
> Key: SPARK-34794
> URL: https://issues.apache.org/jira/browse/SPARK-34794
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.1
> Environment: 3.1.1
>Reporter: Daniel Solow
>Priority: Major
>  Labels: Correctness
>
> In Spark 3, if I have:
> {code:java}
> val df = Seq(
> (Seq(1,2,3), Seq("a", "b", "c"))
> ).toDF("numbers", "letters")
> {code}
> and I want to take the cross product of these two arrays, I can do the 
> following in SQL:
> {code:java}
> df.selectExpr("""
> FLATTEN(
> TRANSFORM(
> numbers,
> number -> TRANSFORM(
> letters,
> letter -> (number AS number, letter AS letter)
> )
> )
> ) AS zipped
> """).show(false)
> ++
> |zipped  |
> ++
> |[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
> ++
> {code}
> This works fine. But when I try the equivalent using the scala DSL, the 
> result is wrong:
> {code:java}
> df.select(
> f.flatten(
> f.transform(
> $"numbers",
> (number: Column) => { f.transform(
> $"letters",
> (letter: Column) => { f.struct(
> number.as("number"),
> letter.as("letter")
> ) }
> ) }
> )
> ).as("zipped")
> ).show(10, false)
> ++
> |zipped  |
> ++
> |[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
> ++
> {code}
> Note that the numbers are not included in the output. The explain for this 
> second version is:
> {code:java}
> == Parsed Logical Plan ==
> 'Project [flatten(transform('numbers, lambdafunction(transform('letters, 
> lambdafunction(struct(NamePlaceholder, lambda 'x AS number#442, 
> NamePlaceholder, lambda 'x AS letter#443), lambda 'x, false)), lambda 'x, 
> false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Analyzed Logical Plan ==
> zipped: array>
> Project [flatten(transform(numbers#308, lambdafunction(transform(letters#309, 
> lambdafunction(struct(number, lambda x#446, letter, lambda x#446), lambda 
> x#446, false)), lambda x#445, false))) AS zipped#444]
> +- Project [_1#303 AS numbers#308, _2#304 AS letters#309]
>+- LocalRelation [_1#303, _2#304]
> == Optimized Logical Plan ==
> LocalRelation [zipped#444]
> == Physical Plan ==
> LocalTableScan [zipped#444]
> {code}
> Seems like variable name x is hardcoded. And sure enough: 
> https://github.com/apache/spark/blob/v3.1.1/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L3647



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org