[jira] [Assigned] (SPARK-32762) Enhance the verification of sql-expression-schema.md

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32762:


Assignee: (was: Apache Spark)

> Enhance the verification of sql-expression-schema.md
> 
>
> Key: SPARK-32762
> URL: https://issues.apache.org/jira/browse/SPARK-32762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: sql-expression-schema.md automatically generated by 
> ExpressionsSchemaSuite, but only expressions entry are checked in 
> ExpressionsSchemaSuite. So if we manually modify the contents of the 
> sql-expression-schema.md,  ExpressionsSchemaSuite does not necessarily 
> guarantee the correctness of the sql-expression-schema.md.
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32762) Enhance the verification of sql-expression-schema.md

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188149#comment-17188149
 ] 

Apache Spark commented on SPARK-32762:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29608

> Enhance the verification of sql-expression-schema.md
> 
>
> Key: SPARK-32762
> URL: https://issues.apache.org/jira/browse/SPARK-32762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: sql-expression-schema.md automatically generated by 
> ExpressionsSchemaSuite, but only expressions entry are checked in 
> ExpressionsSchemaSuite. So if we manually modify the contents of the 
> sql-expression-schema.md,  ExpressionsSchemaSuite does not necessarily 
> guarantee the correctness of the sql-expression-schema.md.
>Reporter: Yang Jie
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32762) Enhance the verification of sql-expression-schema.md

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32762:


Assignee: Apache Spark

> Enhance the verification of sql-expression-schema.md
> 
>
> Key: SPARK-32762
> URL: https://issues.apache.org/jira/browse/SPARK-32762
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: sql-expression-schema.md automatically generated by 
> ExpressionsSchemaSuite, but only expressions entry are checked in 
> ExpressionsSchemaSuite. So if we manually modify the contents of the 
> sql-expression-schema.md,  ExpressionsSchemaSuite does not necessarily 
> guarantee the correctness of the sql-expression-schema.md.
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32762) Enhance the verification of sql-expression-schema.md

2020-08-31 Thread Yang Jie (Jira)
Yang Jie created SPARK-32762:


 Summary: Enhance the verification of sql-expression-schema.md
 Key: SPARK-32762
 URL: https://issues.apache.org/jira/browse/SPARK-32762
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
 Environment: sql-expression-schema.md automatically generated by 
ExpressionsSchemaSuite, but only expressions entry are checked in 
ExpressionsSchemaSuite. So if we manually modify the contents of the 
sql-expression-schema.md,  ExpressionsSchemaSuite does not necessarily 
guarantee the correctness of the sql-expression-schema.md.
Reporter: Yang Jie






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32187) User Guide - Shipping Python Package

2020-08-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32187:


Assignee: Fabian Höring

> User Guide - Shipping Python Package
> 
>
> Key: SPARK-32187
> URL: https://issues.apache.org/jira/browse/SPARK-32187
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Fabian Höring
>Priority: Major
>
> - Zipped file
> - Python files
> - PEX \(?\) (see also SPARK-25433)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32761:


Assignee: Apache Spark

> Planner error when aggregating multiple distinct Constant columns
> -
>
> Key: SPARK-32761
> URL: https://issues.apache.org/jira/browse/SPARK-32761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Assignee: Apache Spark
>Priority: Major
>
> SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.
> The problematic code is:
>  
> {code:java}
> val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
>   val unfoldableChildren = 
> e.aggregateFunction.children.filter(!_.foldable).toSet
>   if (unfoldableChildren.nonEmpty) {
> // Only expand the unfoldable children
>  unfoldableChildren
>   } else {
> // If aggregateFunction's children are all foldable
> // we must expand at least one of the children (here we take the first 
> child),
> // or If we don't, we will get the wrong result, for example:
> // count(distinct 1) will be explained to count(1) after the rewrite 
> function.
> // Generally, the distinct aggregateFunction should not run
> // foldable TypeCheck for the first child.
> e.aggregateFunction.children.take(1).toSet
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32761:


Assignee: (was: Apache Spark)

> Planner error when aggregating multiple distinct Constant columns
> -
>
> Key: SPARK-32761
> URL: https://issues.apache.org/jira/browse/SPARK-32761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
>
> SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.
> The problematic code is:
>  
> {code:java}
> val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
>   val unfoldableChildren = 
> e.aggregateFunction.children.filter(!_.foldable).toSet
>   if (unfoldableChildren.nonEmpty) {
> // Only expand the unfoldable children
>  unfoldableChildren
>   } else {
> // If aggregateFunction's children are all foldable
> // we must expand at least one of the children (here we take the first 
> child),
> // or If we don't, we will get the wrong result, for example:
> // count(distinct 1) will be explained to count(1) after the rewrite 
> function.
> // Generally, the distinct aggregateFunction should not run
> // foldable TypeCheck for the first child.
> e.aggregateFunction.children.take(1).toSet
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188146#comment-17188146
 ] 

Apache Spark commented on SPARK-32761:
--

User 'linhongliu-db' has created a pull request for this issue:
https://github.com/apache/spark/pull/29607

> Planner error when aggregating multiple distinct Constant columns
> -
>
> Key: SPARK-32761
> URL: https://issues.apache.org/jira/browse/SPARK-32761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
>
> SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.
> The problematic code is:
>  
> {code:java}
> val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
>   val unfoldableChildren = 
> e.aggregateFunction.children.filter(!_.foldable).toSet
>   if (unfoldableChildren.nonEmpty) {
> // Only expand the unfoldable children
>  unfoldableChildren
>   } else {
> // If aggregateFunction's children are all foldable
> // we must expand at least one of the children (here we take the first 
> child),
> // or If we don't, we will get the wrong result, for example:
> // count(distinct 1) will be explained to count(1) after the rewrite 
> function.
> // Generally, the distinct aggregateFunction should not run
> // foldable TypeCheck for the first child.
> e.aggregateFunction.children.take(1).toSet
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32190.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29596
[https://github.com/apache/spark/pull/29596]

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32190:


Assignee: Hyukjin Kwon

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-08-31 Thread Linhong Liu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Linhong Liu updated SPARK-32761:

Description: 
SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.

The problematic code is:

 
{code:java}
val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
  val unfoldableChildren = 
e.aggregateFunction.children.filter(!_.foldable).toSet
  if (unfoldableChildren.nonEmpty) {
// Only expand the unfoldable children
 unfoldableChildren
  } else {
// If aggregateFunction's children are all foldable
// we must expand at least one of the children (here we take the first 
child),
// or If we don't, we will get the wrong result, for example:
// count(distinct 1) will be explained to count(1) after the rewrite 
function.
// Generally, the distinct aggregateFunction should not run
// foldable TypeCheck for the first child.
e.aggregateFunction.children.take(1).toSet
  }
}
{code}

  was:
SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3) will trigger this bug.

The problematic code is:

 
{code:java}
val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
  val unfoldableChildren = 
e.aggregateFunction.children.filter(!_.foldable).toSet
  if (unfoldableChildren.nonEmpty) {
// Only expand the unfoldable children
 unfoldableChildren
  } else {
// If aggregateFunction's children are all foldable
// we must expand at least one of the children (here we take the first 
child),
// or If we don't, we will get the wrong result, for example:
// count(distinct 1) will be explained to count(1) after the rewrite 
function.
// Generally, the distinct aggregateFunction should not run
// foldable TypeCheck for the first child.
e.aggregateFunction.children.take(1).toSet
  }
}
{code}


> Planner error when aggregating multiple distinct Constant columns
> -
>
> Key: SPARK-32761
> URL: https://issues.apache.org/jira/browse/SPARK-32761
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Linhong Liu
>Priority: Major
>
> SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug.
> The problematic code is:
>  
> {code:java}
> val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
>   val unfoldableChildren = 
> e.aggregateFunction.children.filter(!_.foldable).toSet
>   if (unfoldableChildren.nonEmpty) {
> // Only expand the unfoldable children
>  unfoldableChildren
>   } else {
> // If aggregateFunction's children are all foldable
> // we must expand at least one of the children (here we take the first 
> child),
> // or If we don't, we will get the wrong result, for example:
> // count(distinct 1) will be explained to count(1) after the rewrite 
> function.
> // Generally, the distinct aggregateFunction should not run
> // foldable TypeCheck for the first child.
> e.aggregateFunction.children.take(1).toSet
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns

2020-08-31 Thread Linhong Liu (Jira)
Linhong Liu created SPARK-32761:
---

 Summary: Planner error when aggregating multiple distinct Constant 
columns
 Key: SPARK-32761
 URL: https://issues.apache.org/jira/browse/SPARK-32761
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Linhong Liu


SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3) will trigger this bug.

The problematic code is:

 
{code:java}
val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>
  val unfoldableChildren = 
e.aggregateFunction.children.filter(!_.foldable).toSet
  if (unfoldableChildren.nonEmpty) {
// Only expand the unfoldable children
 unfoldableChildren
  } else {
// If aggregateFunction's children are all foldable
// we must expand at least one of the children (here we take the first 
child),
// or If we don't, we will get the wrong result, for example:
// count(distinct 1) will be explained to count(1) after the rewrite 
function.
// Generally, the distinct aggregateFunction should not run
// foldable TypeCheck for the first child.
e.aggregateFunction.children.take(1).toSet
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32191) Migration Guide

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188139#comment-17188139
 ] 

Apache Spark commented on SPARK-32191:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29606

> Migration Guide
> ---
>
> Key: SPARK-32191
> URL: https://issues.apache.org/jira/browse/SPARK-32191
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>
> Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32760) Support for INET data type

2020-08-31 Thread Xiao Li (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188138#comment-17188138
 ] 

Xiao Li commented on SPARK-32760:
-

supporting a new data type is very expensive. I suggested to close it as LATER. 

> Support for INET data type
> --
>
> Key: SPARK-32760
> URL: https://issues.apache.org/jira/browse/SPARK-32760
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188131#comment-17188131
 ] 

Apache Spark commented on SPARK-31511:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/29605

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 
> 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188132#comment-17188132
 ] 

Apache Spark commented on SPARK-31511:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/29605

> Make BytesToBytesMap iterator() thread-safe
> ---
>
> Key: SPARK-31511
> URL: https://issues.apache.org/jira/browse/SPARK-31511
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 
> 3.0.0
>Reporter: Herman van Hövell
>Assignee: Herman van Hövell
>Priority: Major
>  Labels: correctness
> Fix For: 3.1.0, 3.0.2
>
>
> BytesToBytesMap currently has a thread safe and unsafe iterator method. This 
> is somewhat confusing. We should make iterator() thread safe and remove the 
> safeIterator() function.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32759) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ruslan Dautkhanov resolved SPARK-32759.
---
Resolution: Duplicate

> Support for INET data type
> --
>
> Key: SPARK-32759
> URL: https://issues.apache.org/jira/browse/SPARK-32759
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Ruslan Dautkhanov
>Priority: Major
>
> PostgreSQL has support for `INET` data type 
> [https://www.postgresql.org/docs/9.1/datatype-net-types.html]
> We have a few customers that are interested in similar, native support for IP 
> addresses, just like in PostgreSQL.
> The issue with storing IP addresses as strings, is that most of the matches 
> (like if an IP address belong to a subnet) in most cases can't take leverage 
> of parquet bloom filters. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32760) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)
Ruslan Dautkhanov created SPARK-32760:
-

 Summary: Support for INET data type
 Key: SPARK-32760
 URL: https://issues.apache.org/jira/browse/SPARK-32760
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Ruslan Dautkhanov


PostgreSQL has support for `INET` data type 

[https://www.postgresql.org/docs/9.1/datatype-net-types.html]

We have a few customers that are interested in similar, native support for IP 
addresses, just like in PostgreSQL.

The issue with storing IP addresses as strings, is that most of the matches 
(like if an IP address belong to a subnet) in most cases can't take leverage of 
parquet bloom filters. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32759) Support for INET data type

2020-08-31 Thread Ruslan Dautkhanov (Jira)
Ruslan Dautkhanov created SPARK-32759:
-

 Summary: Support for INET data type
 Key: SPARK-32759
 URL: https://issues.apache.org/jira/browse/SPARK-32759
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 3.0.0, 2.4.0, 3.1.0
Reporter: Ruslan Dautkhanov


PostgreSQL has support for `INET` data type 

[https://www.postgresql.org/docs/9.1/datatype-net-types.html]

We have a few customers that are interested in similar, native support for IP 
addresses, just like in PostgreSQL.

The issue with storing IP addresses as strings, is that most of the matches 
(like if an IP address belong to a subnet) in most cases can't take leverage of 
parquet bloom filters. 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-08-31 Thread Ivan Tsukanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Tsukanov updated SPARK-32758:
--
Description: 
If we run the following code
{code:scala}
  val sparkConf = new SparkConf()
.setAppName("test-app")
.setMaster("local[1]")
  val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

  import sparkSession.implicits._

  val df = (1 to 10)
.toDF("c1")
.repartition(1000)

  implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)

  df.limit(1)
.map(identity)
.collect()

  df.map(identity)
.limit(1)
.collect()

  Thread.sleep(10)
{code}
we will see that in the first case spark started 1002 tasks despite the fact 
there is limit(1) -

!image-2020-09-01-10-51-09-417.png!

Expected behavior - both scenarios (limit before and after map) will produce 
the same results - one or two tasks to get one value from the DataFrame.

  was:
If we run the following code
{code:scala}
  val sparkConf = new SparkConf()
.setAppName("test-app")
.setMaster("local[1]")
  val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

  import sparkSession.implicits._

  val df = (1 to 10)
.toDF("c1")
.repartition(1000)

  implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)

  df.limit(1)
.map(identity)
.collect()

  df.map(identity)
.limit(1)
.collect()

  Thread.sleep(10)
{code}
we will see that spark started 1002 tasks despite the fact there is limit(1) -

!image-2020-09-01-10-51-09-417.png!

Expected behavior - both scenarios (limit before and after map) will produce 
the same results - one or two tasks to get one value from the DataFrame.


> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that in the first case spark started 1002 tasks despite the fact 
> there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-08-31 Thread Ivan Tsukanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Tsukanov updated SPARK-32758:
--
Environment: (was:  
 
должен
 )

> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that spark started 1002 tasks despite the fact there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-08-31 Thread Ivan Tsukanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Tsukanov updated SPARK-32758:
--
Description: 
If we run the following code
{code:scala}
  val sparkConf = new SparkConf()
.setAppName("test-app")
.setMaster("local[1]")
  val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

  import sparkSession.implicits._

  val df = (1 to 10)
.toDF("c1")
.repartition(1000)

  implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)

  df.limit(1)
.map(identity)
.collect()

  df.map(identity)
.limit(1)
.collect()

  Thread.sleep(10)
{code}
we will see that spark started 1002 tasks despite the fact there is limit(1) -

!image-2020-09-01-10-51-09-417.png!

Expected behavior - both scenarios (limit before and after map) will produce 
the same results - one or two tasks to get one value from the DataFrame.

  was:
If we run the following code
{code:scala}
  val sparkConf = new SparkConf()
.setAppName("test-app")
.setMaster("local[1]")
  val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

  import sparkSession.implicits._

  val df = (1 to 10)
.toDF("c1")
.repartition(1000)

  implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)

  df.limit(1)
.map(identity)
.collect()

  df.map(identity)
.limit(1)
.collect()

  Thread.sleep(10)
{code}
we will see that spark started 1002 tasks despite the fact there is limit(1) -

!image-2020-09-01-10-34-47-580.png!  

Expected behavior - both scenarios (limit before and after map) will produce 
the same results - one or two tasks to get one value from the DataFrame.


> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment:  
>  
> должен
>  
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that spark started 1002 tasks despite the fact there is limit(1) -
> !image-2020-09-01-10-51-09-417.png!
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-08-31 Thread Ivan Tsukanov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ivan Tsukanov updated SPARK-32758:
--
Attachment: image-2020-09-01-10-51-09-417.png

> Spark ignores limit(1) and starts tasks for all partition
> -
>
> Key: SPARK-32758
> URL: https://issues.apache.org/jira/browse/SPARK-32758
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment:  
>  
> должен
>  
>Reporter: Ivan Tsukanov
>Priority: Major
> Attachments: image-2020-09-01-10-51-09-417.png
>
>
> If we run the following code
> {code:scala}
>   val sparkConf = new SparkConf()
> .setAppName("test-app")
> .setMaster("local[1]")
>   val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
>   import sparkSession.implicits._
>   val df = (1 to 10)
> .toDF("c1")
> .repartition(1000)
>   implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)
>   df.limit(1)
> .map(identity)
> .collect()
>   df.map(identity)
> .limit(1)
> .collect()
>   Thread.sleep(10)
> {code}
> we will see that spark started 1002 tasks despite the fact there is limit(1) -
> !image-2020-09-01-10-34-47-580.png!  
> Expected behavior - both scenarios (limit before and after map) will produce 
> the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition

2020-08-31 Thread Ivan Tsukanov (Jira)
Ivan Tsukanov created SPARK-32758:
-

 Summary: Spark ignores limit(1) and starts tasks for all partition
 Key: SPARK-32758
 URL: https://issues.apache.org/jira/browse/SPARK-32758
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.0
 Environment:  
 
должен
 
Reporter: Ivan Tsukanov


If we run the following code
{code:scala}
  val sparkConf = new SparkConf()
.setAppName("test-app")
.setMaster("local[1]")
  val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()

  import sparkSession.implicits._

  val df = (1 to 10)
.toDF("c1")
.repartition(1000)

  implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema)

  df.limit(1)
.map(identity)
.collect()

  df.map(identity)
.limit(1)
.collect()

  Thread.sleep(10)
{code}
we will see that spark started 1002 tasks despite the fact there is limit(1) -

!image-2020-09-01-10-34-47-580.png!  

Expected behavior - both scenarios (limit before and after map) will produce 
the same results - one or two tasks to get one value from the DataFrame.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-31 Thread huangtianhua (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188113#comment-17188113
 ] 

huangtianhua commented on SPARK-32691:
--

> testOnly *DistributedSuite -- -z "caching in memory and disk, replicated 
> (encryption = on) (with replication as stream)"
I test only for case "caching in memory and disk, replicated (encryption = on) 
(with replication as stream)", it's not fail always.
I am so sorry I can't fix this issue, and the arm jenkins failed for a few 
days, I am uploaded the success.log and failure.log to attach files, so if 
anybody can help to analysis, and I can provide the arm64 instance if need, 
thanks all!

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
> Attachments: failure.log, success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-31 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated SPARK-32691:
-
Attachment: failure.log

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
> Attachments: failure.log, success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-31 Thread huangtianhua (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huangtianhua updated SPARK-32691:
-
Attachment: success.log

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
> Attachments: failure.log, success.log
>
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32715) Broadcast block pieces may memory leak

2020-08-31 Thread Lantao Jin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lantao Jin updated SPARK-32715:
---
Affects Version/s: 2.4.6
   3.0.0

> Broadcast block pieces may memory leak
> --
>
> Key: SPARK-32715
> URL: https://issues.apache.org/jira/browse/SPARK-32715
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.6, 3.0.0, 3.1.0
>Reporter: Lantao Jin
>Priority: Major
>
> We use Spark thrift-server as a long-running service. A bad query submitted a 
> heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the 
> bad query but we found the driver's memory usage was still high and full GCs 
> had very frequency. By investigating with GC dump and log, we found the 
> broadcast may memory leak.
> 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) 
> 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc):
> 116G->112G(170G), 184.9121920 secs]
> [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: 
> 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)]
> num #instances #bytes class name
> --
> 1: 676531691 72035438432 [B
> 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow
> 3: 99551 12018117568 [Ljava.lang.Object;
> 4: 26570 4349629040 [I
> 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow;
> 6: 1708819 256299456 [C
> 7: 2338 179615208 [J
> 8: 1703669 54517408 java.lang.String
> 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper
> 10: 177396 25545024 java.net.URI
> ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188105#comment-17188105
 ] 

Apache Spark commented on SPARK-27951:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29604

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188104#comment-17188104
 ] 

Apache Spark commented on SPARK-27951:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/29604

> ANSI SQL: NTH_VALUE function
> 
>
> Key: SPARK-27951
> URL: https://issues.apache.org/jira/browse/SPARK-27951
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Zhu, Lipeng
>Priority: Major
>
> |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as 
> }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of 
> the window frame (counting from 1); null if no such row|
> [https://www.postgresql.org/docs/8.4/functions-window.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals

2020-08-31 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188103#comment-17188103
 ] 

Kent Yao commented on SPARK-32752:
--

two many forms of interval representation that we support and some of them 
conflict, I haven't found a way to separate them yet

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0

2020-08-31 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188079#comment-17188079
 ] 

Hyukjin Kwon commented on SPARK-32312:
--

[~bryanc] just out of curiosity, are you still working on this?

> Upgrade Apache Arrow to 1.0.0
> -
>
> Key: SPARK-32312
> URL: https://issues.apache.org/jira/browse/SPARK-32312
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.0.0
>Reporter: Bryan Cutler
>Priority: Major
>
> Apache Arrow will soon release v1.0.0 which provides backward/forward 
> compatibility guarantees as well as a number of fixes and improvements. This 
> will upgrade the Java artifact and PySpark API. Although PySpark will not 
> need special changes, it might be a good idea to bump up minimum supported 
> version and CI testing.
> TBD: list of important improvements and fixes



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188075#comment-17188075
 ] 

Apache Spark commented on SPARK-32721:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29603

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188073#comment-17188073
 ] 

Apache Spark commented on SPARK-32721:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/29603

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32750) Add code-gen for SortAggregateExec

2020-08-31 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188071#comment-17188071
 ] 

Takeshi Yamamuro commented on SPARK-32750:
--

Yea, I don't remember the impl. details in the PR, but you might be able to 
refer to parts of them ;)

> Add code-gen for SortAggregateExec
> --
>
> Key: SPARK-32750
> URL: https://issues.apache.org/jira/browse/SPARK-32750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> We have codegen for hash aggregate (`HashAggregateExec`) for a long time, but 
> missing codegen for sort aggregate (`SortAggregate`). Sort aggregate is still 
> useful in terms of performance if (1). the data after aggregate still too big 
> to fit in memory (both hash aggregate and object hash aggregate needs to 
> spill), (2).user can disable hash aggregate and object hash aggregate by 
> config to prefer sort aggregate if the aggregate is after e.g. sort merge 
> join and do not need sort at all.
> Create this Jira to add codegen support for sort aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-31 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-32728.
--
Resolution: Invalid

> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes 
> Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query in a python3 notebook on a cluster with 
> *multiple workers (>1)*, the result is not 0.0, even though I would expect it 
> to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-31 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188070#comment-17188070
 ] 

Takeshi Yamamuro commented on SPARK-32728:
--

+1 on the Sean comment, so I'll close this.

> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes 
> Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query in a python3 notebook on a cluster with 
> *multiple workers (>1)*, the result is not 0.0, even though I would expect it 
> to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-31 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188057#comment-17188057
 ] 

L. C. Hsieh commented on SPARK-32693:
-

Ok. Thanks [~maropu]

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-31 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188055#comment-17188055
 ] 

Takeshi Yamamuro commented on SPARK-32693:
--

I changed 3.0.2 to 3.0.1 in "Fix Version/s" because the RC3 vote for v3.0.1 
seemed to fail: 
[http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Spark-3-0-1-RC3-td30092.html]

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property

2020-08-31 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-32693:
-
Fix Version/s: (was: 3.0.2)
   3.0.1

> Compare two dataframes with same schema except nullable property
> 
>
> Key: SPARK-32693
> URL: https://issues.apache.org/jira/browse/SPARK-32693
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.1.0
>Reporter: david bernuau
>Assignee: L. C. Hsieh
>Priority: Minor
> Fix For: 2.4.7, 3.0.1, 3.1.0
>
>
> My aim is to compare two dataframes with very close schemas : same number of 
> fields, with the same names, types and metadata. The only difference comes 
> from the fact that a given field might be nullable in one dataframe and not 
> in the other.
> Here is the code that i used :
> {code:java}
> val session = SparkSession.builder().getOrCreate()
> import org.apache.spark.sql.Row
> import java.sql.Timestamp
> import scala.collection.JavaConverters._
> case class A(g: Timestamp, h: Option[Timestamp], i: Int)
> case class B(e: Int, f: Seq[A])
> case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int])
> case class D(e: Option[Int], f: Seq[C])
> val schema1 = StructType(Array(StructField("a", IntegerType, false), 
> StructField("b", IntegerType, false), StructField("c", IntegerType, false)))
> val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2))
> val df1 = session.createDataFrame(rowSeq1.asJava, schema1)
> df1.printSchema()
> val schema2 = StructType(Array(StructField("a", IntegerType), 
> StructField("b", IntegerType), StructField("c", IntegerType)))
> val rowSeq2: List[Row] = List(Row(10, 1, 1))
> val df2 = session.createDataFrame(rowSeq2.asJava, schema2)
> df2.printSchema()
> println(s"Number of records for first case : ${df1.except(df2).count()}")
> val schema3 = StructType(
>  Array(
>  StructField("a", IntegerType, false),
>  StructField("b", IntegerType, false), 
>  StructField("c", IntegerType, false), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, 
> false), StructField("f", ArrayType(StructType(Array(StructField("g", 
> TimestampType), StructField("h", TimestampType), StructField("i", 
> IntegerType, false)
>  
>  
>  )
>  )
> val date1 = new Timestamp(1597589638L)
> val date2 = new Timestamp(1597599638L)
> val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, 
> 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2))
> val df3 = session.createDataFrame(rowSeq3.asJava, schema3)
> df3.printSchema()
> val schema4 = StructType(
>  Array(
>  StructField("a", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("b", IntegerType), 
>  StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), 
> StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), 
> StructField("h", TimestampType), StructField("i", IntegerType)
>  
>  
>  )
>  )
> val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, 
> None, Some(1)))
> val df4 = session.createDataFrame(rowSeq4.asJava, schema3)
> df4.printSchema()
> println(s"Number of records for second case : ${df3.except(df4).count()}")
> {code}
> The preceeding code shows what seems to me a bug in Spark :
>  * If you consider two dataframes (df1 and df2) having exactly the same 
> schema, except fields are not nullable for the first dataframe and are 
> nullable for the second. Then, doing df1.except(df2).count() works well.
>  * Now, if you consider two other dataframes (df3 and df4) having the same 
> schema (with fields nullable on one side and not on the other). If these two 
> dataframes contain nested fields, then, this time, the action 
> df3.except(df4).count gives the following exception : 
> java.lang.IllegalArgumentException: requirement failed: Join keys from two 
> sides should have same types



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-32721.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29567
[https://github.com/apache/spark/pull/29567]

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
> Fix For: 3.1.0
>
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32721) Simplify if clauses with null and boolean

2020-08-31 Thread DB Tsai (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-32721:
---

Assignee: Chao Sun

> Simplify if clauses with null and boolean
> -
>
> Key: SPARK-32721
> URL: https://issues.apache.org/jira/browse/SPARK-32721
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>
> The following if clause:
> {code:sql}
> if(p, null, false) 
> {code}
> can be simplified to:
> {code:sql}
> and(p, null)
> {code}
> And similarly, the following clause:
> {code:sql}
> if(p, null, true)
> {code}
> can be simplified to:
> {code:sql}
> or(not(p), null)
> {code}
> iff predicate {{p}} is deterministic, i.e., can be evaluated to either true 
> or false, but not null.
> {{and}} and {{or}} clauses are more optimization friendly. For instance, by 
> converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can 
> potentially push the filter {{col > 42}} down to data sources to avoid 
> unnecessary IO.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars

2020-08-31 Thread Luca Canali (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187980#comment-17187980
 ] 

Luca Canali commented on SPARK-32119:
-

I believe it would be good to have this fix also in the upcoming Spark 3.0.1, 
if possible. Any thoughts?

> ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
> --
>
> Key: SPARK-32119
> URL: https://issues.apache.org/jira/browse/SPARK-32119
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Major
> Fix For: 3.1.0
>
>
> ExecutorPlugin can't work with Standalone Cluster and Kubernetes
>  when a jar which contains plugins and files used by the plugins are added by 
> --jars and --files option with spark-submit.
> This is because jars and files added by --jars and --files are not loaded on 
> Executor initialization.
>  I confirmed it works with YARN because jars/files are distributed as 
> distributed cache.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187974#comment-17187974
 ] 

Apache Spark commented on SPARK-32624:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/29602

> Replace getClass.getName with getClass.getCanonicalName in 
> CodegenContext.addReferenceObj
> -
>
> Key: SPARK-32624
> URL: https://issues.apache.org/jira/browse/SPARK-32624
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> {code:java}
> scala> Array[Byte](1, 2).getClass.getName
> res13: String = [B
> scala> Array[Byte](1, 2).getClass.getCanonicalName
> res14: String = byte[]
> {code}
> {{[B}} is not a correct java type. We should use {{byte[]}}. Otherwise will 
> hit compile issue:
> {noformat}
> 20:49:54.885 ERROR 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: 
> ...
> /* 029 */ if (!isNull_2) {
> /* 030 */   value_1 = 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[0] /* min */)) >= 0 && 
> org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) 
> references[1] /* max */)) <= 0 ).mightContainBinary(value_2);
> ...
> 20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr 
> codegen error and falling back to interpreter mode
> java.util.concurrent.ExecutionException: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: failed to compile: 
> org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 
> 30, Column 81: Unexpected token "[" in primary
>   at 
> com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-32756:
-
Priority: Minor  (was: Major)

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Minor
>
>  
> "Spark SQL" module doesn't compile in Scala 2.13:
> {code:java}
> [info] Compiling 26 Scala sources to 
> /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+("path".$minus$greater(path))
> [error] Error occurred in an application involving default arguments.
> [error] this.extraOptions += ("path" -> path)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: DataFrameWriter.this.extraOptions = 
> DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
> [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
>  type mismatch;
> [error] found : 
> scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] CaseInsensitiveMap(dedupPrimitiveFields)
> [error] ^
> [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
> <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] new CaseInsensitiveStringMap(withoutPath.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [error] 7 errors found{code}
>  
> The + function in CaseInsensitiveStringMap missing, resulting in {{+}} from 
> standard Map being used, which 

[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187934#comment-17187934
 ] 

Sean R. Owen commented on SPARK-32756:
--

Meh, maybe not. We aren't going to make this change for 3.0.x. OK leave it

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>
>  
> "Spark SQL" module doesn't compile in Scala 2.13:
> {code:java}
> [info] Compiling 26 Scala sources to 
> /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+("path".$minus$greater(path))
> [error] Error occurred in an application involving default arguments.
> [error] this.extraOptions += ("path" -> path)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: DataFrameWriter.this.extraOptions = 
> DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
> [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
>  type mismatch;
> [error] found : 
> scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] CaseInsensitiveMap(dedupPrimitiveFields)
> [error] ^
> [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
> <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] new CaseInsensitiveStringMap(withoutPath.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [error] 7 errors found{code}
>  
> The + function in 

[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187933#comment-17187933
 ] 

Sean R. Owen commented on SPARK-32756:
--

This should just be a followup for SPARK-32364

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>
>  
> "Spark SQL" module doesn't compile in Scala 2.13:
> {code:java}
> [info] Compiling 26 Scala sources to 
> /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+(key.$minus$greater(value))
> [error] this.extraOptions += (key -> value)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: this.extraOptions = 
> this.extraOptions.+("path".$minus$greater(path))
> [error] Error occurred in an application involving default arguments.
> [error] this.extraOptions += ("path" -> path)
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] Error occurred in an application involving default arguments.
> [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
>  value += is not a member of 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] Expression does not convert to assignment because:
> [error] type mismatch;
> [error] found : scala.collection.immutable.Map[String,String]
> [error] required: 
> org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
> [error] expansion: DataFrameWriter.this.extraOptions = 
> DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
> [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
> [error] ^
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
>  type mismatch;
> [error] found : 
> scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
> [error] CaseInsensitiveMap(dedupPrimitiveFields)
> [error] ^
> [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
> <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
> [info] false
> [error] 
> /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
>  type mismatch;
> [error] found : Iterable[(String, String)]
> [error] required: java.util.Map[String,String]
> [error] new CaseInsensitiveStringMap(withoutPath.asJava)
> [error] ^
> [info] Iterable[(String, String)] <: java.util.Map[String,String]?
> [error] 7 errors found{code}
>  
> The + function in CaseInsensitiveStringMap missing, 

[jira] [Commented] (SPARK-32728) Using groupby with rand creates different values when joining table with itself

2020-08-31 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187932#comment-17187932
 ] 

Sean R. Owen commented on SPARK-32728:
--

I think this is to be expected: you are going to get different values every 
time x is evaluated. It is not somehow fixed at the time you declare the 
column. I would expect it, however, to work as you are expecting if you 
materialize it with cache() + count() for example. Even then there are 
circumstances where it's reevaluated.

> Using groupby with rand creates different values when joining table with 
> itself
> ---
>
> Key: SPARK-32728
> URL: https://issues.apache.org/jira/browse/SPARK-32728
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0
> Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes 
> Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11))
> Worker type: Standard_DS3_v2 (2 workers)
>  
>Reporter: Joachim Bargsten
>Priority: Minor
>
> When running following query in a python3 notebook on a cluster with 
> *multiple workers (>1)*, the result is not 0.0, even though I would expect it 
> to be.
> {code:java}
> import pyspark.sql.functions as F
> sdf = spark.range(100)
> sdf = (
> sdf.withColumn("a", F.col("id") + 1)
> .withColumn("b", F.col("id") + 2)
> .withColumn("c", F.col("id") + 3)
> .withColumn("d", F.col("id") + 4)
> .withColumn("e", F.col("id") + 5)
> )
> sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e"))
> sdf = sdf.withColumn("x", F.rand() * F.col("e"))
> sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"])
> sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - 
> F.col("xx"))).agg(F.sum("delta_x"))
> sum_delta_x = sdf2.head()[0]
> print(f"{sum_delta_x} should be 0.0")
> assert abs(sum_delta_x) < 0.001
> {code}
> If the groupby statement is commented out, the code is working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32757:


Assignee: Wenchen Fan  (was: Apache Spark)

> Physical InSubqueryExec should be consistent with logical InSubquery
> 
>
> Key: SPARK-32757
> URL: https://issues.apache.org/jira/browse/SPARK-32757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187929#comment-17187929
 ] 

Apache Spark commented on SPARK-32757:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29601

> Physical InSubqueryExec should be consistent with logical InSubquery
> 
>
> Key: SPARK-32757
> URL: https://issues.apache.org/jira/browse/SPARK-32757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32757:


Assignee: Apache Spark  (was: Wenchen Fan)

> Physical InSubqueryExec should be consistent with logical InSubquery
> 
>
> Key: SPARK-32757
> URL: https://issues.apache.org/jira/browse/SPARK-32757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187931#comment-17187931
 ] 

Apache Spark commented on SPARK-32757:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/29601

> Physical InSubqueryExec should be consistent with logical InSubquery
> 
>
> Key: SPARK-32757
> URL: https://issues.apache.org/jira/browse/SPARK-32757
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals

2020-08-31 Thread Wenchen Fan (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187928#comment-17187928
 ] 

Wenchen Fan commented on SPARK-32752:
-

[~Qin Yao] do you have any idea to fix it?

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery

2020-08-31 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-32757:
---

 Summary: Physical InSubqueryExec should be consistent with logical 
InSubquery
 Key: SPARK-32757
 URL: https://issues.apache.org/jira/browse/SPARK-32757
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187902#comment-17187902
 ] 

Apache Spark commented on SPARK-32317:
--

User 'izchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/29600

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 

[jira] [Assigned] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32317:


Assignee: Apache Spark

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Assignee: Apache Spark
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> 

[jira] [Assigned] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32317:


Assignee: (was: Apache Spark)

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> at 
> 

[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187900#comment-17187900
 ] 

Apache Spark commented on SPARK-32317:
--

User 'izchen' has created a pull request for this issue:
https://github.com/apache/spark/pull/29600

> Parquet file loading with different schema(Decimal(N, P)) in files is not 
> working as expected
> -
>
> Key: SPARK-32317
> URL: https://issues.apache.org/jira/browse/SPARK-32317
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.0
> Environment: Its failing in all environments that I tried.
>Reporter: Krish
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Hi,
>  
> We generate parquet files which are partitioned on Date on a daily basis, and 
> we send updates to historical data some times, what we noticed is due to some 
> configuration error the patch data schema is inconsistent to earlier files.
> Assuming we had files generated with schema having ID and Amount as fields. 
> Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the 
> files we send as updates has schema like DECIMAL(15,2). 
>  
> Having two different schema in a Date partition and when we load the data of 
> a Date into spark, it is loading the data but the amount is getting 
> manipulated.
>  
> file1.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,6)
>  Content:
>  1,19500.00
>  2,198.34
> file2.snappy.parquet
>  ID: INT
>  AMOUNT: DECIMAL(15,2)
>  Content:
>  1,19500.00
>  3,198.34
> Load these two files togeather
> df3 = spark.read.parquet("output/")
> df3.show() #-we can see amount getting manipulated here,
> +-+---+
> |ID|   AMOUNT|
> +-+---+
> |1|1.95|
> |3|0.019834|
> |1|19500.00|
> |2|198.34|
> +-+---+
>  x
> Options Tried:
> We tried to give schema as String for all fields, but that didt work
> df3 = spark.read.format("parquet").schema(schema).load("output/")
> Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet 
> column cannot be converted in file file*.snappy.parquet. Column: 
> [AMOUNT], Expected: string, Found: INT64"
>  
> I know merge schema works if it finds few extra columns in one file but the 
> fileds which are in common needs to have same schema. That might nort work 
> here.
>  
> Looking for some work around solution here. Or if there is an option which I 
> havent tried you can point me to that.
>  
> With schema merging I got below eeror:
> An error occurred while calling o2272.parquet. : 
> org.apache.spark.SparkException: Failed merging schema: root |-- ID: string 
> (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100)
>  at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95)
>  at 
> scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at 
> scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) 
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at 
> org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107)
>  at 
> org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69)
>  at scala.Option.orElse(Option.scala:447) at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82)
>  at 
> org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80)
>  at 
> org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141)
>  at 
> org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at 
> org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 

[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Karol Chmist (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karol Chmist updated SPARK-32756:
-
Description: 
 

"Spark SQL" module doesn't compile in Scala 2.13:

{code:java}
[info] Compiling 26 Scala sources to 
/home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+(key.$minus$greater(value))
[error] this.extraOptions += (key -> value)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+(key.$minus$greater(value))
[error] this.extraOptions += (key -> value)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+("path".$minus$greater(path))
[error] Error occurred in an application involving default arguments.
[error] this.extraOptions += ("path" -> path)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
 type mismatch;
[error] found : Iterable[(String, String)]
[error] required: java.util.Map[String,String]
[error] Error occurred in an application involving default arguments.
[error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
[error] ^
[info] Iterable[(String, String)] <: java.util.Map[String,String]?
[info] false
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: DataFrameWriter.this.extraOptions = 
DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
[error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
 type mismatch;
[error] found : 
scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
[error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
[error] CaseInsensitiveMap(dedupPrimitiveFields)
[error] ^
[info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
<: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
[info] false
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
 type mismatch;
[error] found : Iterable[(String, String)]
[error] required: java.util.Map[String,String]
[error] new CaseInsensitiveStringMap(withoutPath.asJava)
[error] ^
[info] Iterable[(String, String)] <: java.util.Map[String,String]?
[error] 7 errors found{code}
 

The + function in CaseInsensitiveStringMap missing, resulting in {{+}} from 
standard Map being used, which returns Map instead of CaseInsensitiveStringMap.

  was:
[info] Compiling 26 Scala sources to 
/home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: 

[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns

2020-08-31 Thread Andy Van Yperen-De Deyne (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187897#comment-17187897
 ] 

Andy Van Yperen-De Deyne commented on SPARK-18067:
--

I am facing this performance issue when using spark 3.0.0. 
This topic is resolved on the 21st of May 2019, however, the PR on git is 
closed... Is there any chance this solution is going to be merged in the master 
branch? Do you have a timeline for that? 

(maybe this is not the correct location to ask this, in which case I apologize)

> SortMergeJoin adds shuffle if join predicates have non partitioned columns
> --
>
> Key: SPARK-18067
> URL: https://issues.apache.org/jira/browse/SPARK-18067
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.1
>Reporter: Paul Jones
>Priority: Minor
>  Labels: bulk-closed
>
> Basic setup
> {code}
> scala> case class Data1(key: String, value1: Int)
> scala> case class Data2(key: String, value2: Int)
> scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x))
> .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache
> {code}
> Join on key
> {code}
> scala> partition1.join(partition2, "key").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [key#0], [key#12]
>:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we get a super efficient join with no shuffle.
> But if we add a filter our join gets less efficient and we end up with a 
> shuffle.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" === 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- SortMergeJoin [value1#1,key#0], [value2#13,key#12]
>:- Sort [value1#1 ASC,key#0 ASC], false, 0
>:  +- TungstenExchange hashpartitioning(value1#1,key#0,200), None
>: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>+- Sort [value2#13 ASC,key#12 ASC], false, 0
>   +- TungstenExchange hashpartitioning(value2#13,key#12,200), None
>  +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> And we can avoid the shuffle if use a filter statement that can't be pushed 
> in the join.
> {code}
> scala> partition1.join(partition2, "key").filter($"value1" >= 
> $"value2").explain
> == Physical Plan ==
> Project [key#0,value1#1,value2#13]
> +- Filter (value1#1 >= value2#13)
>+- SortMergeJoin [key#0], [key#12]
>   :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation 
> [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort 
> [key#0 ASC], false, 0, None
>   +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation 
> [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), 
> Sort [key#12 ASC], false, 0, None
> {code}
> What's the best way to avoid the filter pushdown here??



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Karol Chmist (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karol Chmist updated SPARK-32756:
-
Description: 
[info] Compiling 26 Scala sources to 
/home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes...
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+(key.$minus$greater(value))
[error] this.extraOptions += (key -> value)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+(key.$minus$greater(value))
[error] this.extraOptions += (key -> value)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: this.extraOptions = 
this.extraOptions.+("path".$minus$greater(path))
[error] Error occurred in an application involving default arguments.
[error] this.extraOptions += ("path" -> path)
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317:
 type mismatch;
[error] found : Iterable[(String, String)]
[error] required: java.util.Map[String,String]
[error] Error occurred in an application involving default arguments.
[error] val dsOptions = new CaseInsensitiveStringMap(options.asJava)
[error] ^
[info] Iterable[(String, String)] <: java.util.Map[String,String]?
[info] false
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412:
 value += is not a member of 
org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] Expression does not convert to assignment because:
[error] type mismatch;
[error] found : scala.collection.immutable.Map[String,String]
[error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String]
[error] expansion: DataFrameWriter.this.extraOptions = 
DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns)))
[error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY ->
[error] ^
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85:
 type mismatch;
[error] found : 
scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField]
[error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField]
[error] CaseInsensitiveMap(dedupPrimitiveFields)
[error] ^
[info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] 
<: Map[String,OrcFiltersBase.this.OrcPrimitiveField]?
[info] false
[error] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64:
 type mismatch;
[error] found : Iterable[(String, String)]
[error] required: java.util.Map[String,String]
[error] new CaseInsensitiveStringMap(withoutPath.asJava)
[error] ^
[info] Iterable[(String, String)] <: java.util.Map[String,String]?
[info] false
[warn] 
/home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala:64:
 method without a parameter list overrides a method with a single empty one
[warn] def next = Identifier.of(namespace, rs.getString("TABLE_NAME"))
[warn] ^
[warn] one warning found
[error] 7 errors found

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>
> [info] Compiling 26 Scala sources 

[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32756:


Assignee: (was: Apache Spark)

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32756:


Assignee: Apache Spark

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187865#comment-17187865
 ] 

Apache Spark commented on SPARK-32756:
--

User 'karolchmist' has created a pull request for this issue:
https://github.com/apache/spark/pull/29584

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187864#comment-17187864
 ] 

Apache Spark commented on SPARK-32756:
--

User 'karolchmist' has created a pull request for this issue:
https://github.com/apache/spark/pull/29584

> Fix CaseInsensitiveMap in Scala 2.13
> 
>
> Key: SPARK-32756
> URL: https://issues.apache.org/jira/browse/SPARK-32756
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Karol Chmist
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13

2020-08-31 Thread Karol Chmist (Jira)
Karol Chmist created SPARK-32756:


 Summary: Fix CaseInsensitiveMap in Scala 2.13
 Key: SPARK-32756
 URL: https://issues.apache.org/jira/browse/SPARK-32756
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Karol Chmist






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32750) Add code-gen for SortAggregateExec

2020-08-31 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187860#comment-17187860
 ] 

Cheng Su commented on SPARK-32750:
--

[~maropu] - thanks a lot for pointing out existing context. Will take a look. 
Thanks.

> Add code-gen for SortAggregateExec
> --
>
> Key: SPARK-32750
> URL: https://issues.apache.org/jira/browse/SPARK-32750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> We have codegen for hash aggregate (`HashAggregateExec`) for a long time, but 
> missing codegen for sort aggregate (`SortAggregate`). Sort aggregate is still 
> useful in terms of performance if (1). the data after aggregate still too big 
> to fit in memory (both hash aggregate and object hash aggregate needs to 
> spill), (2).user can disable hash aggregate and object hash aggregate by 
> config to prefer sort aggregate if the aggregate is after e.g. sort merge 
> join and do not need sort at all.
> Create this Jira to add codegen support for sort aggregate.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32542:


Assignee: Apache Spark

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Assignee: Apache Spark
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32542:


Assignee: (was: Apache Spark)

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0

2020-08-31 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187849#comment-17187849
 ] 

Chao Sun commented on SPARK-27733:
--

bq. BTW, is Hive community will to introduce the version upgrade of Avro in 
their 2.3.x?

It seems non-trivial to upgrade Avro in Hive (see HIVE-21737). Maybe we can 
shade Avro in Hive first so it won't cause conflicts? Even for that we may need 
to first enable CI for Hive branch 2.3 (see 
https://github.com/apache/hive/pull/1398). After these we can cut a release and 
do a vote in the community. 

> Upgrade to Avro 1.10.0
> --
>
> Key: SPARK-27733
> URL: https://issues.apache.org/jira/browse/SPARK-27733
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.1.0
>Reporter: Ismaël Mejía
>Priority: Minor
>
> Avro 1.9.2 was released with many nice features including reduced size (1MB 
> less), and removed dependencies, no paranamer, no shaded guava, security 
> updates, so probably a worth upgrade.
> Avro 1.10.0 was released and this is still not done.
> There is at the moment (2020/08) still a blocker because of Hive related 
> transitive dependencies bringing older versions of Avro, so we could say that 
> this is somehow still blocked until HIVE-21737 is solved.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins

2020-08-31 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187843#comment-17187843
 ] 

Dongjoon Hyun commented on SPARK-32691:
---

Yes. That failure looks like that, but it's still irrelevant to DISK_3, isn't 
it? You can attach that to the PR description and make the scope broader.
{code}
caching on disk, replicated 2 (encryption = on)
{code}

> Test org.apache.spark.DistributedSuite failed on arm64 jenkins
> --
>
> Key: SPARK-32691
> URL: https://issues.apache.org/jira/browse/SPARK-32691
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Tests
>Affects Versions: 3.1.0
> Environment: ARM64
>Reporter: huangtianhua
>Priority: Major
>
> Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ 
> - caching in memory and disk, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory and disk, serialized, replicated (encryption = on) 
> (with replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> - caching in memory, serialized, replicated (encryption = on) (with 
> replication as stream) *** FAILED ***
>   3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191)
> ...
> 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-31 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang reopened SPARK-32542:
---

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates

2020-08-31 Thread karl wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

karl wang resolved SPARK-32542.
---
Resolution: Fixed

> Add an optimizer rule to split an Expand into multiple Expands for aggregates
> -
>
> Key: SPARK-32542
> URL: https://issues.apache.org/jira/browse/SPARK-32542
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: karl wang
>Priority: Major
>
> Split an expand into several small Expand, which contains the Specified 
> number of projections.
> For instance, like this sql.select a, b, c, d, count(1) from table1 group by 
> a, b, c, d with cube. It can expand 2^4 times of original data size.
> Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be 
> split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve 
> performance in multidimensional analysis when data is huge.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32755:


Assignee: (was: Apache Spark)

> Maintain the order of expressions in AttributeSet and ExpressionSet 
> 
>
> Key: SPARK-32755
> URL: https://issues.apache.org/jira/browse/SPARK-32755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Priority: Major
>
> Expressions identity is based on the ExprId which is an auto-incremented 
> number. This means that the same query can yield a query plan with different 
> expression ids in different runs. AttributeSet and ExpressionSet internally 
> use a HashSet as the underlying data structure, and therefore cannot 
> guarantee the a fixed order of operations in different runs. This can be 
> problematic in cases we like to check for plan changes in different runs.
> We change do the following changes to AttributeSet and ExpressionSet to 
> maintain the insertion order of the elements:
>  * We change the underlying data structure of AttributeSet from HashSet to 
> LinkedHashSet to maintain the insertion order.
>  * ExpressionSet already uses a list to keep track of the expressions, 
> however, since it is extending Scala's immutable.Set class, operations such 
> as map and flatMap are delegated to the immutable.Set itself. This means that 
> the result of these operations is not an instance of ExpressionSet anymore, 
> rather it's a implementation picked up by the parent class. We also remove 
> this inheritance from immutable.Set and implement the needed methods 
> directly. ExpressionSet has a very specific semantics and it does not make 
> sense to extend immutable.Set anyway.
>  * We change the PlanStabilitySuite to not sort the attributes, to be able to 
> catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32755:


Assignee: Apache Spark

> Maintain the order of expressions in AttributeSet and ExpressionSet 
> 
>
> Key: SPARK-32755
> URL: https://issues.apache.org/jira/browse/SPARK-32755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Assignee: Apache Spark
>Priority: Major
>
> Expressions identity is based on the ExprId which is an auto-incremented 
> number. This means that the same query can yield a query plan with different 
> expression ids in different runs. AttributeSet and ExpressionSet internally 
> use a HashSet as the underlying data structure, and therefore cannot 
> guarantee the a fixed order of operations in different runs. This can be 
> problematic in cases we like to check for plan changes in different runs.
> We change do the following changes to AttributeSet and ExpressionSet to 
> maintain the insertion order of the elements:
>  * We change the underlying data structure of AttributeSet from HashSet to 
> LinkedHashSet to maintain the insertion order.
>  * ExpressionSet already uses a list to keep track of the expressions, 
> however, since it is extending Scala's immutable.Set class, operations such 
> as map and flatMap are delegated to the immutable.Set itself. This means that 
> the result of these operations is not an instance of ExpressionSet anymore, 
> rather it's a implementation picked up by the parent class. We also remove 
> this inheritance from immutable.Set and implement the needed methods 
> directly. ExpressionSet has a very specific semantics and it does not make 
> sense to extend immutable.Set anyway.
>  * We change the PlanStabilitySuite to not sort the attributes, to be able to 
> catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187779#comment-17187779
 ] 

Apache Spark commented on SPARK-32755:
--

User 'dbaliafroozeh' has created a pull request for this issue:
https://github.com/apache/spark/pull/29598

> Maintain the order of expressions in AttributeSet and ExpressionSet 
> 
>
> Key: SPARK-32755
> URL: https://issues.apache.org/jira/browse/SPARK-32755
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Ali Afroozeh
>Priority: Major
>
> Expressions identity is based on the ExprId which is an auto-incremented 
> number. This means that the same query can yield a query plan with different 
> expression ids in different runs. AttributeSet and ExpressionSet internally 
> use a HashSet as the underlying data structure, and therefore cannot 
> guarantee the a fixed order of operations in different runs. This can be 
> problematic in cases we like to check for plan changes in different runs.
> We change do the following changes to AttributeSet and ExpressionSet to 
> maintain the insertion order of the elements:
>  * We change the underlying data structure of AttributeSet from HashSet to 
> LinkedHashSet to maintain the insertion order.
>  * ExpressionSet already uses a list to keep track of the expressions, 
> however, since it is extending Scala's immutable.Set class, operations such 
> as map and flatMap are delegated to the immutable.Set itself. This means that 
> the result of these operations is not an instance of ExpressionSet anymore, 
> rather it's a implementation picked up by the parent class. We also remove 
> this inheritance from immutable.Set and implement the needed methods 
> directly. ExpressionSet has a very specific semantics and it does not make 
> sense to extend immutable.Set anyway.
>  * We change the PlanStabilitySuite to not sort the attributes, to be able to 
> catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet

2020-08-31 Thread Ali Afroozeh (Jira)
Ali Afroozeh created SPARK-32755:


 Summary: Maintain the order of expressions in AttributeSet and 
ExpressionSet 
 Key: SPARK-32755
 URL: https://issues.apache.org/jira/browse/SPARK-32755
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Ali Afroozeh


Expressions identity is based on the ExprId which is an auto-incremented 
number. This means that the same query can yield a query plan with different 
expression ids in different runs. AttributeSet and ExpressionSet internally use 
a HashSet as the underlying data structure, and therefore cannot guarantee the 
a fixed order of operations in different runs. This can be problematic in cases 
we like to check for plan changes in different runs.

We change do the following changes to AttributeSet and ExpressionSet to 
maintain the insertion order of the elements:
 * We change the underlying data structure of AttributeSet from HashSet to 
LinkedHashSet to maintain the insertion order.
 * ExpressionSet already uses a list to keep track of the expressions, however, 
since it is extending Scala's immutable.Set class, operations such as map and 
flatMap are delegated to the immutable.Set itself. This means that the result 
of these operations is not an instance of ExpressionSet anymore, rather it's a 
implementation picked up by the parent class. We also remove this inheritance 
from immutable.Set and implement the needed methods directly. ExpressionSet has 
a very specific semantics and it does not make sense to extend immutable.Set 
anyway.
 * We change the PlanStabilitySuite to not sort the attributes, to be able to 
catch changes in the order of expressions in different runs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32742) FileOutputCommitter warns "No Output found for attempt"

2020-08-31 Thread Ryan Luo (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187751#comment-17187751
 ] 

Ryan Luo commented on SPARK-32742:
--

[~hyukjin.kwon] Thanks for advising.

> FileOutputCommitter warns "No Output found for attempt"
> ---
>
> Key: SPARK-32742
> URL: https://issues.apache.org/jira/browse/SPARK-32742
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: Hadoop 2.6.0-cdh5.16.2
> YARN(MR2 included)
>  
>Reporter: Ryan Luo
>Priority: Major
>
> Hi team,
> This is my first time to report an issue here.
> We submitted and ran the spark job on the cluster. 
> We found that one of the parquet output partition is missing in the output 
> directory. We checked the spark job log, all the tasks status are showing 
> success. The output record size matches expected number.
> However, we checked the container log, found that there was a warning says 
> *No Output found for attempt_20200819094307_0003_m_02_11*, which stopped 
> moving the output from taskAttemptPath to output directory. As a result, we 
> are missing some of the output rows.
> Re-run the job helped to solve the issue, however the report is critical for 
> us. It is appreciated if you can advise the cause for the issue.
>  
> Below are the container logs:
>  
> {code:java}
> 20/08/19 09:44:57 INFO output.FileOutputCommitter: FileOutputCommitter skip 
> cleanup _temporary folders under output directory:false, ignore cleanup 
> failures: false
> 20/08/19 09:44:57 INFO datasources.SQLHadoopMapReduceCommitProtocol: Using 
> user defined output committer class parquet.hadoop.ParquetOutputCommitter
> 20/08/19 09:44:57 INFO output.FileOutputCommitter: File Output Committer 
> Algorithm version is 2
> 20/08/19 09:44:57 INFO output.FileOutputCommitter: FileOutputCommitter skip 
> cleanup _temporary folders under output directory:false, ignore cleanup 
> failures: false
> 20/08/19 09:44:57 INFO datasources.SQLHadoopMapReduceCommitProtocol: Using 
> output committer class parquet.hadoop.ParquetOutputCommitter
> 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 12.370642 ms
> 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 6.927118 ms
> 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 12.004204 ms
> 20/08/19 09:44:57 INFO parquet.ParquetWriteSupport: Initialized Parquet 
> WriteSupport with Catalyst schema:
> . (skipped)
> 20/08/19 09:44:57 WARN output.FileOutputCommitter: No Output found for 
> attempt_20200819094307_0003_m_02_11
> 20/08/19 09:44:57 INFO mapred.SparkHadoopMapRedUtil: 
> attempt_20200819094307_0003_m_02_11: Committed
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32592) Make `DataFrameReader.table` take the specified options

2020-08-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-32592.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 29535
[https://github.com/apache/spark/pull/29535]

> Make `DataFrameReader.table` take the specified options
> ---
>
> Key: SPARK-32592
> URL: https://issues.apache.org/jira/browse/SPARK-32592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.1.0
>
>
> Currently, `DataFrameReader.table` ignores the specified options. The options 
> specified like the following are lost.
> {code:java}
> val df = spark.read
>   .option("partitionColumn", "id")
>   .option("lowerBound", "0")
>   .option("upperBound", "3")
>   .option("numPartitions", "2")
>   .table("h2.test.people")
> {code}
> We need to make `DataFrameReader.table` take the specified options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32592) Make `DataFrameReader.table` take the specified options

2020-08-31 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-32592:
---

Assignee: Huaxin Gao

> Make `DataFrameReader.table` take the specified options
> ---
>
> Key: SPARK-32592
> URL: https://issues.apache.org/jira/browse/SPARK-32592
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
>
> Currently, `DataFrameReader.table` ignores the specified options. The options 
> specified like the following are lost.
> {code:java}
> val df = spark.read
>   .option("partitionColumn", "id")
>   .option("lowerBound", "0")
>   .option("upperBound", "3")
>   .option("numPartitions", "2")
>   .table("h2.test.people")
> {code}
> We need to make `DataFrameReader.table` take the specified options.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals

2020-08-31 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187692#comment-17187692
 ] 

Kent Yao commented on SPARK-32752:
--

cc [~cloud_fan] 

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187686#comment-17187686
 ] 

Apache Spark commented on SPARK-32190:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29596

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32190:


Assignee: (was: Apache Spark)

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32190:


Assignee: Apache Spark

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32190) Development - Contribution Guide

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187684#comment-17187684
 ] 

Apache Spark commented on SPARK-32190:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/29596

> Development - Contribution Guide
> 
>
> Key: SPARK-32190
> URL: https://issues.apache.org/jira/browse/SPARK-32190
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Will be similar with http://spark.apache.org/contributing.html but avoid 
> duplications with more details.
> - Style Guide: PEP8 but there are a couple of exceptions such as 
> https://github.com/apache/spark/blob/master/dev/tox.ini#L17.
> - Bug Reports: JIRA. Maybe we should just point back the link,  
> http://spark.apache.org/contributing.html
> - Contribution Workflow: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Documentation Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html
> - Code Contribution: e.g.) 
> https://pandas.pydata.org/docs/development/contributing.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187631#comment-17187631
 ] 

Apache Spark commented on SPARK-32659:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/29595

> Fix the data issue of inserted DPP on non-atomic type
> -
>
> Key: SPARK-32659
> URL: https://issues.apache.org/jira/browse/SPARK-32659
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.1, 3.1.0
>
>
> DPP has data issue when pruning on non-atomic type. for example:
> {noformat}
>  spark.range(1000)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df1");
> spark.range(100)
>  .select(col("id"), col("id").as("k"))
>  .write
>  .partitionBy("k")
>  .format("parquet")
>  .mode("overwrite")
>  .saveAsTable("df2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
> spark.sql("set 
> spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
> spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = 
> struct(df2.k) AND df2.id < 2").show
> {noformat}
>  It should return two records, but it returns empty.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32746) Not able to run Pandas UDF

2020-08-31 Thread Rahul Bhatia (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187617#comment-17187617
 ] 

Rahul Bhatia commented on SPARK-32746:
--

[~hyukjin.kwon]

Here are the logs, and more symptoms as mentioned are that the code runs fine 
and completes in 2-4 seconds in standalone mode(on my local machine), on the 
cluster, it shows no progress.
{noformat}
20/08/29 18:51:25 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
does not implement 'receive'
at 
org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/08/30 12:51:25 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
does not implement 'receive'
at 
org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/08/31 06:51:25 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) 
does not implement 'receive'
at 
org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at 
org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20/08/31 07:31:04 INFO yarn.YarnAllocator: Completed container 
container_e66_1589136601716_6876212_01_31 on host: 
bhdp4080.prod.bdd.jp.local (state: COMPLETE, exit status: -102)
20/08/31 07:31:04 INFO yarn.YarnAllocator: Container 
container_e66_1589136601716_6876212_01_31 on host: 
bhdp4080.prod.bdd.jp.local was preempted.
20/08/31 07:31:07 INFO yarn.YarnAllocator: Will request 1 executor 
container(s), each with 16 core(s) and 18022 MB memory (including 1638 MB of 
overhead)
20/08/31 07:31:07 INFO yarn.YarnAllocator: Submitted 1 unlocalized container 
requests.
20/08/31 07:31:55 INFO yarn.YarnAllocator: Completed container 
container_e66_1589136601716_6876212_01_30 on host: 
bhdp4365.prod.hnd1.bdd.local (state: COMPLETE, exit status: -102)
20/08/31 07:31:55 INFO yarn.YarnAllocator: Container 
container_e66_1589136601716_6876212_01_30 on host: 
bhdp4365.prod.hnd1.bdd.local was preempted.
20/08/31 07:31:55 INFO yarn.YarnAllocator: Will request 1 executor 
container(s), each with 16 core(s) and 18022 MB memory (including 1638 MB of 
overhead)
20/08/31 07:31:55 INFO yarn.YarnAllocator: Submitted 1 unlocalized container 
requests.
20/08/31 07:37:04 INFO impl.AMRMClientImpl: Received new token for : 
bhdp4564.prod.hnd1.bdd.local:45454
20/08/31 07:37:04 INFO yarn.YarnAllocator: Launching container 
container_e66_1589136601716_6876212_01_32 on host 
bhdp4564.prod.hnd1.bdd.local for executor with ID 31
20/08/31 07:37:04 INFO yarn.YarnAllocator: Received 1 containers from YARN, 
launching executors on 1 of them.
20/08/31 07:37:04 INFO impl.ContainerManagementProtocolProxy: 
yarn.client.max-cached-nodemanagers-proxies : 0
20/08/31 07:37:04 INFO impl.ContainerManagementProtocolProxy: Opening proxy : 
bhdp4564.prod.hnd1.bdd.local:45454

[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals

2020-08-31 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187613#comment-17187613
 ] 

Kent Yao commented on SPARK-32752:
--

this cases will be captured by grammar rule `multiUnitsInterval`, e.g.  for 
interval '1 day' day, the value is `1 day` and the unit is day 

> Alias breaks for interval typed literals
> 
>
> Key: SPARK-32752
> URL: https://issues.apache.org/jira/browse/SPARK-32752
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> Cases we found:
> {code:java}
> +-- !query
> +select interval '1 day' as day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +no viable alternative at input 'as'(line 1, pos 24)
> +
> +== SQL ==
> +select interval '1 day' as day
> +^^^
> +
> +
> +-- !query
> +select interval '1 day' day
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, 
> pos 16)
> +
> +== SQL ==
> +select interval '1 day' day
> +^^^
> +
> +
> +-- !query
> +select interval '1-2' year as y
> +-- !query schema
> +struct<>
> +-- !query output
> +org.apache.spark.sql.catalyst.parser.ParseException
> +
> +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16)
> +
> +== SQL ==
> +select interval '1-2' year as y
> +^^^
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32754:


Assignee: (was: Apache Spark)

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32754:


Assignee: Apache Spark

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187591#comment-17187591
 ] 

Apache Spark commented on SPARK-32754:
--

User 'wzhfy' has created a pull request for this issue:
https://github.com/apache/spark/pull/29594

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-08-31 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-32754:


Assignee: Apache Spark

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Assignee: Apache Spark
>Priority: Minor
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites

2020-08-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32754:
-
Summary: Unify `assertEqualPlans` for join reorder suites  (was: Unify 
`assertEqualPlans` for all join reorder suites)

> Unify `assertEqualPlans` for join reorder suites
> 
>
> Key: SPARK-32754
> URL: https://issues.apache.org/jira/browse/SPARK-32754
> Project: Spark
>  Issue Type: Improvement
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Minor
>
> Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
> `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
> the logic is almost the same. We can extract the method to a single place for 
> code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32754) Unify `assertEqualPlans` for all join reorder suites

2020-08-31 Thread Zhenhua Wang (Jira)
Zhenhua Wang created SPARK-32754:


 Summary: Unify `assertEqualPlans` for all join reorder suites
 Key: SPARK-32754
 URL: https://issues.apache.org/jira/browse/SPARK-32754
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Zhenhua Wang


Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, 
`StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and 
the logic is almost the same. We can extract the method to a single place for 
code simplicity.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-08-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32748:
-
Description: 
Since SPARK-22590, local property propagation is supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.

  was:
Since 
[SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a),
 local property propagation is supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.


> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
>
> Since SPARK-22590, local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-08-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32748:
-
Description: 
Since 
[SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a),
 local property propagation is supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.

  was:
Currently local property propagation is supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.


> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
>
> Since 
> [SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a),
>  local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec

2020-08-31 Thread Zhenhua Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenhua Wang updated SPARK-32748:
-
Description: 
Currently local property propagation is supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.

  was:
Currently local property propagation are supported through 
`SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
`SubqueryExec` when computing `relationFuture`.

The propagation is missed in `SubqueryBroadcastExec`.


> Support local property propagation in SubqueryBroadcastExec
> ---
>
> Key: SPARK-32748
> URL: https://issues.apache.org/jira/browse/SPARK-32748
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zhenhua Wang
>Priority: Major
>
> Currently local property propagation is supported through 
> `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and 
> `SubqueryExec` when computing `relationFuture`.
> The propagation is missed in `SubqueryBroadcastExec`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-08-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-32747:


Assignee: Hyukjin Kwon

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R

2020-08-31 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-32747.
--
Fix Version/s: 3.1.0
   3.0.1
   Resolution: Fixed

Issue resolved by pull request 29592
[https://github.com/apache/spark/pull/29592]

> Deduplicate configuration set/unset in test_sparkSQL_arrow.R
> 
>
> Key: SPARK-32747
> URL: https://issues.apache.org/jira/browse/SPARK-32747
> Project: Spark
>  Issue Type: Test
>  Components: R, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.1, 3.1.0
>
>
> Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` 
> test cases. We can just set once in globally and deduplicate such try-catch 
> logics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE

2020-08-31 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187507#comment-17187507
 ] 

Apache Spark commented on SPARK-32753:
--

User 'manuzhang' has created a pull request for this issue:
https://github.com/apache/spark/pull/29593

> Deduplicating and repartitioning the same column create duplicate rows with 
> AQE
> ---
>
> Key: SPARK-32753
> URL: https://issues.apache.org/jira/browse/SPARK-32753
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Priority: Minor
>
> To reproduce:
> {code:java}
> spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
> val df = spark.sql("select id from v1 group by id distribute by id") 
> println(df.collect().toArray.mkString(","))
> println(df.queryExecution.executedPlan)
> // With AQE
> [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
> AdaptiveSparkPlan(isFinalPlan=true)
> +- CustomShuffleReader local
>+- ShuffleQueryStage 0
>   +- Exchange hashpartitioning(id#183L, 10), true
>  +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
> +- Union
>:- *(1) Range (0, 10, step=1, splits=2)
>+- *(2) Range (0, 10, step=1, splits=2)
> // Without AQE
> [4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
> *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
> +- Exchange hashpartitioning(id#206L, 10), true
>+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
>   +- Union
>  :- *(1) Range (0, 10, step=1, splits=2)
>  +- *(2) Range (0, 10, step=1, splits=2){code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >