[jira] [Assigned] (SPARK-32762) Enhance the verification of sql-expression-schema.md
[ https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32762: Assignee: (was: Apache Spark) > Enhance the verification of sql-expression-schema.md > > > Key: SPARK-32762 > URL: https://issues.apache.org/jira/browse/SPARK-32762 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: sql-expression-schema.md automatically generated by > ExpressionsSchemaSuite, but only expressions entry are checked in > ExpressionsSchemaSuite. So if we manually modify the contents of the > sql-expression-schema.md, ExpressionsSchemaSuite does not necessarily > guarantee the correctness of the sql-expression-schema.md. >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32762) Enhance the verification of sql-expression-schema.md
[ https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188149#comment-17188149 ] Apache Spark commented on SPARK-32762: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/29608 > Enhance the verification of sql-expression-schema.md > > > Key: SPARK-32762 > URL: https://issues.apache.org/jira/browse/SPARK-32762 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: sql-expression-schema.md automatically generated by > ExpressionsSchemaSuite, but only expressions entry are checked in > ExpressionsSchemaSuite. So if we manually modify the contents of the > sql-expression-schema.md, ExpressionsSchemaSuite does not necessarily > guarantee the correctness of the sql-expression-schema.md. >Reporter: Yang Jie >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32762) Enhance the verification of sql-expression-schema.md
[ https://issues.apache.org/jira/browse/SPARK-32762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32762: Assignee: Apache Spark > Enhance the verification of sql-expression-schema.md > > > Key: SPARK-32762 > URL: https://issues.apache.org/jira/browse/SPARK-32762 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 > Environment: sql-expression-schema.md automatically generated by > ExpressionsSchemaSuite, but only expressions entry are checked in > ExpressionsSchemaSuite. So if we manually modify the contents of the > sql-expression-schema.md, ExpressionsSchemaSuite does not necessarily > guarantee the correctness of the sql-expression-schema.md. >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32762) Enhance the verification of sql-expression-schema.md
Yang Jie created SPARK-32762: Summary: Enhance the verification of sql-expression-schema.md Key: SPARK-32762 URL: https://issues.apache.org/jira/browse/SPARK-32762 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Environment: sql-expression-schema.md automatically generated by ExpressionsSchemaSuite, but only expressions entry are checked in ExpressionsSchemaSuite. So if we manually modify the contents of the sql-expression-schema.md, ExpressionsSchemaSuite does not necessarily guarantee the correctness of the sql-expression-schema.md. Reporter: Yang Jie -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32187) User Guide - Shipping Python Package
[ https://issues.apache.org/jira/browse/SPARK-32187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32187: Assignee: Fabian Höring > User Guide - Shipping Python Package > > > Key: SPARK-32187 > URL: https://issues.apache.org/jira/browse/SPARK-32187 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Fabian Höring >Priority: Major > > - Zipped file > - Python files > - PEX \(?\) (see also SPARK-25433) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns
[ https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32761: Assignee: Apache Spark > Planner error when aggregating multiple distinct Constant columns > - > > Key: SPARK-32761 > URL: https://issues.apache.org/jira/browse/SPARK-32761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Assignee: Apache Spark >Priority: Major > > SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug. > The problematic code is: > > {code:java} > val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => > val unfoldableChildren = > e.aggregateFunction.children.filter(!_.foldable).toSet > if (unfoldableChildren.nonEmpty) { > // Only expand the unfoldable children > unfoldableChildren > } else { > // If aggregateFunction's children are all foldable > // we must expand at least one of the children (here we take the first > child), > // or If we don't, we will get the wrong result, for example: > // count(distinct 1) will be explained to count(1) after the rewrite > function. > // Generally, the distinct aggregateFunction should not run > // foldable TypeCheck for the first child. > e.aggregateFunction.children.take(1).toSet > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns
[ https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32761: Assignee: (was: Apache Spark) > Planner error when aggregating multiple distinct Constant columns > - > > Key: SPARK-32761 > URL: https://issues.apache.org/jira/browse/SPARK-32761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Priority: Major > > SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug. > The problematic code is: > > {code:java} > val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => > val unfoldableChildren = > e.aggregateFunction.children.filter(!_.foldable).toSet > if (unfoldableChildren.nonEmpty) { > // Only expand the unfoldable children > unfoldableChildren > } else { > // If aggregateFunction's children are all foldable > // we must expand at least one of the children (here we take the first > child), > // or If we don't, we will get the wrong result, for example: > // count(distinct 1) will be explained to count(1) after the rewrite > function. > // Generally, the distinct aggregateFunction should not run > // foldable TypeCheck for the first child. > e.aggregateFunction.children.take(1).toSet > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns
[ https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188146#comment-17188146 ] Apache Spark commented on SPARK-32761: -- User 'linhongliu-db' has created a pull request for this issue: https://github.com/apache/spark/pull/29607 > Planner error when aggregating multiple distinct Constant columns > - > > Key: SPARK-32761 > URL: https://issues.apache.org/jira/browse/SPARK-32761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Priority: Major > > SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug. > The problematic code is: > > {code:java} > val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => > val unfoldableChildren = > e.aggregateFunction.children.filter(!_.foldable).toSet > if (unfoldableChildren.nonEmpty) { > // Only expand the unfoldable children > unfoldableChildren > } else { > // If aggregateFunction's children are all foldable > // we must expand at least one of the children (here we take the first > child), > // or If we don't, we will get the wrong result, for example: > // count(distinct 1) will be explained to count(1) after the rewrite > function. > // Generally, the distinct aggregateFunction should not run > // foldable TypeCheck for the first child. > e.aggregateFunction.children.take(1).toSet > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32190. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29596 [https://github.com/apache/spark/pull/29596] > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32190: Assignee: Hyukjin Kwon > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns
[ https://issues.apache.org/jira/browse/SPARK-32761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Linhong Liu updated SPARK-32761: Description: SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug. The problematic code is: {code:java} val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet if (unfoldableChildren.nonEmpty) { // Only expand the unfoldable children unfoldableChildren } else { // If aggregateFunction's children are all foldable // we must expand at least one of the children (here we take the first child), // or If we don't, we will get the wrong result, for example: // count(distinct 1) will be explained to count(1) after the rewrite function. // Generally, the distinct aggregateFunction should not run // foldable TypeCheck for the first child. e.aggregateFunction.children.take(1).toSet } } {code} was: SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3) will trigger this bug. The problematic code is: {code:java} val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet if (unfoldableChildren.nonEmpty) { // Only expand the unfoldable children unfoldableChildren } else { // If aggregateFunction's children are all foldable // we must expand at least one of the children (here we take the first child), // or If we don't, we will get the wrong result, for example: // count(distinct 1) will be explained to count(1) after the rewrite function. // Generally, the distinct aggregateFunction should not run // foldable TypeCheck for the first child. e.aggregateFunction.children.take(1).toSet } } {code} > Planner error when aggregating multiple distinct Constant columns > - > > Key: SPARK-32761 > URL: https://issues.apache.org/jira/browse/SPARK-32761 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Linhong Liu >Priority: Major > > SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3) will trigger this bug. > The problematic code is: > > {code:java} > val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => > val unfoldableChildren = > e.aggregateFunction.children.filter(!_.foldable).toSet > if (unfoldableChildren.nonEmpty) { > // Only expand the unfoldable children > unfoldableChildren > } else { > // If aggregateFunction's children are all foldable > // we must expand at least one of the children (here we take the first > child), > // or If we don't, we will get the wrong result, for example: > // count(distinct 1) will be explained to count(1) after the rewrite > function. > // Generally, the distinct aggregateFunction should not run > // foldable TypeCheck for the first child. > e.aggregateFunction.children.take(1).toSet > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32761) Planner error when aggregating multiple distinct Constant columns
Linhong Liu created SPARK-32761: --- Summary: Planner error when aggregating multiple distinct Constant columns Key: SPARK-32761 URL: https://issues.apache.org/jira/browse/SPARK-32761 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Linhong Liu SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3) will trigger this bug. The problematic code is: {code:java} val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => val unfoldableChildren = e.aggregateFunction.children.filter(!_.foldable).toSet if (unfoldableChildren.nonEmpty) { // Only expand the unfoldable children unfoldableChildren } else { // If aggregateFunction's children are all foldable // we must expand at least one of the children (here we take the first child), // or If we don't, we will get the wrong result, for example: // count(distinct 1) will be explained to count(1) after the rewrite function. // Generally, the distinct aggregateFunction should not run // foldable TypeCheck for the first child. e.aggregateFunction.children.take(1).toSet } } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32191) Migration Guide
[ https://issues.apache.org/jira/browse/SPARK-32191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188139#comment-17188139 ] Apache Spark commented on SPARK-32191: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29606 > Migration Guide > --- > > Key: SPARK-32191 > URL: https://issues.apache.org/jira/browse/SPARK-32191 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > > Port http://spark.apache.org/docs/latest/pyspark-migration-guide.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32760) Support for INET data type
[ https://issues.apache.org/jira/browse/SPARK-32760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188138#comment-17188138 ] Xiao Li commented on SPARK-32760: - supporting a new data type is very expensive. I suggested to close it as LATER. > Support for INET data type > -- > > Key: SPARK-32760 > URL: https://issues.apache.org/jira/browse/SPARK-32760 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > PostgreSQL has support for `INET` data type > [https://www.postgresql.org/docs/9.1/datatype-net-types.html] > We have a few customers that are interested in similar, native support for IP > addresses, just like in PostgreSQL. > The issue with storing IP addresses as strings, is that most of the matches > (like if an IP address belong to a subnet) in most cases can't take leverage > of parquet bloom filters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188131#comment-17188131 ] Apache Spark commented on SPARK-31511: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/29605 > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, > 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31511) Make BytesToBytesMap iterator() thread-safe
[ https://issues.apache.org/jira/browse/SPARK-31511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188132#comment-17188132 ] Apache Spark commented on SPARK-31511: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/29605 > Make BytesToBytesMap iterator() thread-safe > --- > > Key: SPARK-31511 > URL: https://issues.apache.org/jira/browse/SPARK-31511 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.4.1, 1.5.2, 1.6.3, 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, > 3.0.0 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Labels: correctness > Fix For: 3.1.0, 3.0.2 > > > BytesToBytesMap currently has a thread safe and unsafe iterator method. This > is somewhat confusing. We should make iterator() thread safe and remove the > safeIterator() function. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32759) Support for INET data type
[ https://issues.apache.org/jira/browse/SPARK-32759?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ruslan Dautkhanov resolved SPARK-32759. --- Resolution: Duplicate > Support for INET data type > -- > > Key: SPARK-32759 > URL: https://issues.apache.org/jira/browse/SPARK-32759 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 2.4.0, 3.0.0, 3.1.0 >Reporter: Ruslan Dautkhanov >Priority: Major > > PostgreSQL has support for `INET` data type > [https://www.postgresql.org/docs/9.1/datatype-net-types.html] > We have a few customers that are interested in similar, native support for IP > addresses, just like in PostgreSQL. > The issue with storing IP addresses as strings, is that most of the matches > (like if an IP address belong to a subnet) in most cases can't take leverage > of parquet bloom filters. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32760) Support for INET data type
Ruslan Dautkhanov created SPARK-32760: - Summary: Support for INET data type Key: SPARK-32760 URL: https://issues.apache.org/jira/browse/SPARK-32760 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0, 2.4.0, 3.1.0 Reporter: Ruslan Dautkhanov PostgreSQL has support for `INET` data type [https://www.postgresql.org/docs/9.1/datatype-net-types.html] We have a few customers that are interested in similar, native support for IP addresses, just like in PostgreSQL. The issue with storing IP addresses as strings, is that most of the matches (like if an IP address belong to a subnet) in most cases can't take leverage of parquet bloom filters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32759) Support for INET data type
Ruslan Dautkhanov created SPARK-32759: - Summary: Support for INET data type Key: SPARK-32759 URL: https://issues.apache.org/jira/browse/SPARK-32759 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 3.0.0, 2.4.0, 3.1.0 Reporter: Ruslan Dautkhanov PostgreSQL has support for `INET` data type [https://www.postgresql.org/docs/9.1/datatype-net-types.html] We have a few customers that are interested in similar, native support for IP addresses, just like in PostgreSQL. The issue with storing IP addresses as strings, is that most of the matches (like if an IP address belong to a subnet) in most cases can't take leverage of parquet bloom filters. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition
[ https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-32758: -- Description: If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 10) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(10) {code} we will see that in the first case spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-51-09-417.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. was: If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 10) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(10) {code} we will see that spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-51-09-417.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. > Spark ignores limit(1) and starts tasks for all partition > - > > Key: SPARK-32758 > URL: https://issues.apache.org/jira/browse/SPARK-32758 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2020-09-01-10-51-09-417.png > > > If we run the following code > {code:scala} > val sparkConf = new SparkConf() > .setAppName("test-app") > .setMaster("local[1]") > val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > val df = (1 to 10) > .toDF("c1") > .repartition(1000) > implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) > df.limit(1) > .map(identity) > .collect() > df.map(identity) > .limit(1) > .collect() > Thread.sleep(10) > {code} > we will see that in the first case spark started 1002 tasks despite the fact > there is limit(1) - > !image-2020-09-01-10-51-09-417.png! > Expected behavior - both scenarios (limit before and after map) will produce > the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition
[ https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-32758: -- Environment: (was: должен ) > Spark ignores limit(1) and starts tasks for all partition > - > > Key: SPARK-32758 > URL: https://issues.apache.org/jira/browse/SPARK-32758 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2020-09-01-10-51-09-417.png > > > If we run the following code > {code:scala} > val sparkConf = new SparkConf() > .setAppName("test-app") > .setMaster("local[1]") > val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > val df = (1 to 10) > .toDF("c1") > .repartition(1000) > implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) > df.limit(1) > .map(identity) > .collect() > df.map(identity) > .limit(1) > .collect() > Thread.sleep(10) > {code} > we will see that spark started 1002 tasks despite the fact there is limit(1) - > !image-2020-09-01-10-51-09-417.png! > Expected behavior - both scenarios (limit before and after map) will produce > the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition
[ https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-32758: -- Description: If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 10) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(10) {code} we will see that spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-51-09-417.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. was: If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 10) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(10) {code} we will see that spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-34-47-580.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. > Spark ignores limit(1) and starts tasks for all partition > - > > Key: SPARK-32758 > URL: https://issues.apache.org/jira/browse/SPARK-32758 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: > > должен > >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2020-09-01-10-51-09-417.png > > > If we run the following code > {code:scala} > val sparkConf = new SparkConf() > .setAppName("test-app") > .setMaster("local[1]") > val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > val df = (1 to 10) > .toDF("c1") > .repartition(1000) > implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) > df.limit(1) > .map(identity) > .collect() > df.map(identity) > .limit(1) > .collect() > Thread.sleep(10) > {code} > we will see that spark started 1002 tasks despite the fact there is limit(1) - > !image-2020-09-01-10-51-09-417.png! > Expected behavior - both scenarios (limit before and after map) will produce > the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition
[ https://issues.apache.org/jira/browse/SPARK-32758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ivan Tsukanov updated SPARK-32758: -- Attachment: image-2020-09-01-10-51-09-417.png > Spark ignores limit(1) and starts tasks for all partition > - > > Key: SPARK-32758 > URL: https://issues.apache.org/jira/browse/SPARK-32758 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: > > должен > >Reporter: Ivan Tsukanov >Priority: Major > Attachments: image-2020-09-01-10-51-09-417.png > > > If we run the following code > {code:scala} > val sparkConf = new SparkConf() > .setAppName("test-app") > .setMaster("local[1]") > val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() > import sparkSession.implicits._ > val df = (1 to 10) > .toDF("c1") > .repartition(1000) > implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) > df.limit(1) > .map(identity) > .collect() > df.map(identity) > .limit(1) > .collect() > Thread.sleep(10) > {code} > we will see that spark started 1002 tasks despite the fact there is limit(1) - > !image-2020-09-01-10-34-47-580.png! > Expected behavior - both scenarios (limit before and after map) will produce > the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32758) Spark ignores limit(1) and starts tasks for all partition
Ivan Tsukanov created SPARK-32758: - Summary: Spark ignores limit(1) and starts tasks for all partition Key: SPARK-32758 URL: https://issues.apache.org/jira/browse/SPARK-32758 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.0 Environment: должен Reporter: Ivan Tsukanov If we run the following code {code:scala} val sparkConf = new SparkConf() .setAppName("test-app") .setMaster("local[1]") val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate() import sparkSession.implicits._ val df = (1 to 10) .toDF("c1") .repartition(1000) implicit val encoder: ExpressionEncoder[Row] = RowEncoder(df.schema) df.limit(1) .map(identity) .collect() df.map(identity) .limit(1) .collect() Thread.sleep(10) {code} we will see that spark started 1002 tasks despite the fact there is limit(1) - !image-2020-09-01-10-34-47-580.png! Expected behavior - both scenarios (limit before and after map) will produce the same results - one or two tasks to get one value from the DataFrame. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188113#comment-17188113 ] huangtianhua commented on SPARK-32691: -- > testOnly *DistributedSuite -- -z "caching in memory and disk, replicated > (encryption = on) (with replication as stream)" I test only for case "caching in memory and disk, replicated (encryption = on) (with replication as stream)", it's not fail always. I am so sorry I can't fix this issue, and the arm jenkins failed for a few days, I am uploaded the success.log and failure.log to attach files, so if anybody can help to analysis, and I can provide the arm64 instance if need, thanks all! > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Priority: Major > Attachments: failure.log, success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtianhua updated SPARK-32691: - Attachment: failure.log > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Priority: Major > Attachments: failure.log, success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huangtianhua updated SPARK-32691: - Attachment: success.log > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Priority: Major > Attachments: failure.log, success.log > > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32715) Broadcast block pieces may memory leak
[ https://issues.apache.org/jira/browse/SPARK-32715?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lantao Jin updated SPARK-32715: --- Affects Version/s: 2.4.6 3.0.0 > Broadcast block pieces may memory leak > -- > > Key: SPARK-32715 > URL: https://issues.apache.org/jira/browse/SPARK-32715 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.6, 3.0.0, 3.1.0 >Reporter: Lantao Jin >Priority: Major > > We use Spark thrift-server as a long-running service. A bad query submitted a > heavy BroadcastNestLoopJoin operation and made driver full GC. We killed the > bad query but we found the driver's memory usage was still high and full GCs > had very frequency. By investigating with GC dump and log, we found the > broadcast may memory leak. > 2020-08-19T18:54:02.824-0700: [Full GC (Allocation Failure) > 2020-08-19T18:54:02.824-0700: [Class Histogram (before full gc): > 116G->112G(170G), 184.9121920 secs] > [Eden: 32.0M(7616.0M)->0.0B(8704.0M) Survivors: 1088.0M->0.0B Heap: > 116.4G(170.0G)->112.9G(170.0G)], [Metaspace: 177285K->177270K(182272K)] > num #instances #bytes class name > -- > 1: 676531691 72035438432 [B > 2: 676502528 32472121344 org.apache.spark.sql.catalyst.expressions.UnsafeRow > 3: 99551 12018117568 [Ljava.lang.Object; > 4: 26570 4349629040 [I > 5: 6 3264536688 [Lorg.apache.spark.sql.catalyst.InternalRow; > 6: 1708819 256299456 [C > 7: 2338 179615208 [J > 8: 1703669 54517408 java.lang.String > 9: 103860 34896960 org.apache.spark.status.TaskDataWrapper > 10: 177396 25545024 java.net.URI > ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function
[ https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188105#comment-17188105 ] Apache Spark commented on SPARK-27951: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/29604 > ANSI SQL: NTH_VALUE function > > > Key: SPARK-27951 > URL: https://issues.apache.org/jira/browse/SPARK-27951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Major > > |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as > }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of > the window frame (counting from 1); null if no such row| > [https://www.postgresql.org/docs/8.4/functions-window.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27951) ANSI SQL: NTH_VALUE function
[ https://issues.apache.org/jira/browse/SPARK-27951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188104#comment-17188104 ] Apache Spark commented on SPARK-27951: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/29604 > ANSI SQL: NTH_VALUE function > > > Key: SPARK-27951 > URL: https://issues.apache.org/jira/browse/SPARK-27951 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Zhu, Lipeng >Priority: Major > > |{{nth_value({{value}}{{any}}, {{nth}}{{integer}})}}|{{same type as > }}{{value}}|returns {{value}} evaluated at the row that is the {{nth}} row of > the window frame (counting from 1); null if no such row| > [https://www.postgresql.org/docs/8.4/functions-window.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188103#comment-17188103 ] Kent Yao commented on SPARK-32752: -- two many forms of interval representation that we support and some of them conflict, I haven't found a way to separate them yet > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32312) Upgrade Apache Arrow to 1.0.0
[ https://issues.apache.org/jira/browse/SPARK-32312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188079#comment-17188079 ] Hyukjin Kwon commented on SPARK-32312: -- [~bryanc] just out of curiosity, are you still working on this? > Upgrade Apache Arrow to 1.0.0 > - > > Key: SPARK-32312 > URL: https://issues.apache.org/jira/browse/SPARK-32312 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.0.0 >Reporter: Bryan Cutler >Priority: Major > > Apache Arrow will soon release v1.0.0 which provides backward/forward > compatibility guarantees as well as a number of fixes and improvements. This > will upgrade the Java artifact and PySpark API. Although PySpark will not > need special changes, it might be a good idea to bump up minimum supported > version and CI testing. > TBD: list of important improvements and fixes -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188075#comment-17188075 ] Apache Spark commented on SPARK-32721: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29603 > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. > {{and}} and {{or}} clauses are more optimization friendly. For instance, by > converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can > potentially push the filter {{col > 42}} down to data sources to avoid > unnecessary IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188073#comment-17188073 ] Apache Spark commented on SPARK-32721: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/29603 > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. > {{and}} and {{or}} clauses are more optimization friendly. For instance, by > converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can > potentially push the filter {{col > 42}} down to data sources to avoid > unnecessary IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32750) Add code-gen for SortAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-32750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188071#comment-17188071 ] Takeshi Yamamuro commented on SPARK-32750: -- Yea, I don't remember the impl. details in the PR, but you might be able to refer to parts of them ;) > Add code-gen for SortAggregateExec > -- > > Key: SPARK-32750 > URL: https://issues.apache.org/jira/browse/SPARK-32750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > We have codegen for hash aggregate (`HashAggregateExec`) for a long time, but > missing codegen for sort aggregate (`SortAggregate`). Sort aggregate is still > useful in terms of performance if (1). the data after aggregate still too big > to fit in memory (both hash aggregate and object hash aggregate needs to > spill), (2).user can disable hash aggregate and object hash aggregate by > config to prefer sort aggregate if the aggregate is after e.g. sort merge > join and do not need sort at all. > Create this Jira to add codegen support for sort aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-32728. -- Resolution: Invalid > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188070#comment-17188070 ] Takeshi Yamamuro commented on SPARK-32728: -- +1 on the Sean comment, so I'll close this. > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188057#comment-17188057 ] L. C. Hsieh commented on SPARK-32693: - Ok. Thanks [~maropu] > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17188055#comment-17188055 ] Takeshi Yamamuro commented on SPARK-32693: -- I changed 3.0.2 to 3.0.1 in "Fix Version/s" because the RC3 vote for v3.0.1 seemed to fail: [http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Spark-3-0-1-RC3-td30092.html] > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32693) Compare two dataframes with same schema except nullable property
[ https://issues.apache.org/jira/browse/SPARK-32693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-32693: - Fix Version/s: (was: 3.0.2) 3.0.1 > Compare two dataframes with same schema except nullable property > > > Key: SPARK-32693 > URL: https://issues.apache.org/jira/browse/SPARK-32693 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.1.0 >Reporter: david bernuau >Assignee: L. C. Hsieh >Priority: Minor > Fix For: 2.4.7, 3.0.1, 3.1.0 > > > My aim is to compare two dataframes with very close schemas : same number of > fields, with the same names, types and metadata. The only difference comes > from the fact that a given field might be nullable in one dataframe and not > in the other. > Here is the code that i used : > {code:java} > val session = SparkSession.builder().getOrCreate() > import org.apache.spark.sql.Row > import java.sql.Timestamp > import scala.collection.JavaConverters._ > case class A(g: Timestamp, h: Option[Timestamp], i: Int) > case class B(e: Int, f: Seq[A]) > case class C(g: Timestamp, h: Option[Timestamp], i: Option[Int]) > case class D(e: Option[Int], f: Seq[C]) > val schema1 = StructType(Array(StructField("a", IntegerType, false), > StructField("b", IntegerType, false), StructField("c", IntegerType, false))) > val rowSeq1: List[Row] = List(Row(10, 1, 1), Row(10, 50, 2)) > val df1 = session.createDataFrame(rowSeq1.asJava, schema1) > df1.printSchema() > val schema2 = StructType(Array(StructField("a", IntegerType), > StructField("b", IntegerType), StructField("c", IntegerType))) > val rowSeq2: List[Row] = List(Row(10, 1, 1)) > val df2 = session.createDataFrame(rowSeq2.asJava, schema2) > df2.printSchema() > println(s"Number of records for first case : ${df1.except(df2).count()}") > val schema3 = StructType( > Array( > StructField("a", IntegerType, false), > StructField("b", IntegerType, false), > StructField("c", IntegerType, false), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType, > false), StructField("f", ArrayType(StructType(Array(StructField("g", > TimestampType), StructField("h", TimestampType), StructField("i", > IntegerType, false) > > > ) > ) > val date1 = new Timestamp(1597589638L) > val date2 = new Timestamp(1597599638L) > val rowSeq3: List[Row] = List(Row(10, 1, 1, Seq(B(100, Seq(A(date1, None, > 1), Row(10, 50, 2, Seq(B(101, Seq(A(date2, None, 2)) > val df3 = session.createDataFrame(rowSeq3.asJava, schema3) > df3.printSchema() > val schema4 = StructType( > Array( > StructField("a", IntegerType), > StructField("b", IntegerType), > StructField("b", IntegerType), > StructField("d", ArrayType(StructType(Array(StructField("e", IntegerType), > StructField("f", ArrayType(StructType(Array(StructField("g", TimestampType), > StructField("h", TimestampType), StructField("i", IntegerType) > > > ) > ) > val rowSeq4: List[Row] = List(Row(10, 1, 1, Seq(D(Some(100), Seq(C(date1, > None, Some(1))) > val df4 = session.createDataFrame(rowSeq4.asJava, schema3) > df4.printSchema() > println(s"Number of records for second case : ${df3.except(df4).count()}") > {code} > The preceeding code shows what seems to me a bug in Spark : > * If you consider two dataframes (df1 and df2) having exactly the same > schema, except fields are not nullable for the first dataframe and are > nullable for the second. Then, doing df1.except(df2).count() works well. > * Now, if you consider two other dataframes (df3 and df4) having the same > schema (with fields nullable on one side and not on the other). If these two > dataframes contain nested fields, then, this time, the action > df3.except(df4).count gives the following exception : > java.lang.IllegalArgumentException: requirement failed: Join keys from two > sides should have same types -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-32721. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29567 [https://github.com/apache/spark/pull/29567] > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. > {{and}} and {{or}} clauses are more optimization friendly. For instance, by > converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can > potentially push the filter {{col > 42}} down to data sources to avoid > unnecessary IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32721) Simplify if clauses with null and boolean
[ https://issues.apache.org/jira/browse/SPARK-32721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-32721: --- Assignee: Chao Sun > Simplify if clauses with null and boolean > - > > Key: SPARK-32721 > URL: https://issues.apache.org/jira/browse/SPARK-32721 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > > The following if clause: > {code:sql} > if(p, null, false) > {code} > can be simplified to: > {code:sql} > and(p, null) > {code} > And similarly, the following clause: > {code:sql} > if(p, null, true) > {code} > can be simplified to: > {code:sql} > or(not(p), null) > {code} > iff predicate {{p}} is deterministic, i.e., can be evaluated to either true > or false, but not null. > {{and}} and {{or}} clauses are more optimization friendly. For instance, by > converting {{if(col > 42, null, false)}} to {{and(col > 42, null)}}, we can > potentially push the filter {{col > 42}} down to data sources to avoid > unnecessary IO. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32119) ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
[ https://issues.apache.org/jira/browse/SPARK-32119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187980#comment-17187980 ] Luca Canali commented on SPARK-32119: - I believe it would be good to have this fix also in the upcoming Spark 3.0.1, if possible. Any thoughts? > ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars > -- > > Key: SPARK-32119 > URL: https://issues.apache.org/jira/browse/SPARK-32119 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > Fix For: 3.1.0 > > > ExecutorPlugin can't work with Standalone Cluster and Kubernetes > when a jar which contains plugins and files used by the plugins are added by > --jars and --files option with spark-submit. > This is because jars and files added by --jars and --files are not loaded on > Executor initialization. > I confirmed it works with YARN because jars/files are distributed as > distributed cache. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32624) Replace getClass.getName with getClass.getCanonicalName in CodegenContext.addReferenceObj
[ https://issues.apache.org/jira/browse/SPARK-32624?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187974#comment-17187974 ] Apache Spark commented on SPARK-32624: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/29602 > Replace getClass.getName with getClass.getCanonicalName in > CodegenContext.addReferenceObj > - > > Key: SPARK-32624 > URL: https://issues.apache.org/jira/browse/SPARK-32624 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > {code:java} > scala> Array[Byte](1, 2).getClass.getName > res13: String = [B > scala> Array[Byte](1, 2).getClass.getCanonicalName > res14: String = byte[] > {code} > {{[B}} is not a correct java type. We should use {{byte[]}}. Otherwise will > hit compile issue: > {noformat} > 20:49:54.885 ERROR > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: > ... > /* 029 */ if (!isNull_2) { > /* 030 */ value_1 = > org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) > references[0] /* min */)) >= 0 && > org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) > references[1] /* max */)) <= 0 ).mightContainBinary(value_2); > ... > 20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr > codegen error and falling back to interpreter mode > java.util.concurrent.ExecutionException: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 30, Column 81: failed to compile: > org.codehaus.commons.compiler.CompileException: File 'generated.java', Line > 30, Column 81: Unexpected token "[" in primary > at > com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:306) > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-32756: - Priority: Minor (was: Major) > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Minor > > > "Spark SQL" module doesn't compile in Scala 2.13: > {code:java} > [info] Compiling 26 Scala sources to > /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+("path".$minus$greater(path)) > [error] Error occurred in an application involving default arguments. > [error] this.extraOptions += ("path" -> path) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] Error occurred in an application involving default arguments. > [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: DataFrameWriter.this.extraOptions = > DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) > [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: > type mismatch; > [error] found : > scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] CaseInsensitiveMap(dedupPrimitiveFields) > [error] ^ > [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] new CaseInsensitiveStringMap(withoutPath.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [error] 7 errors found{code} > > The + function in CaseInsensitiveStringMap missing, resulting in {{+}} from > standard Map being used, which
[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187934#comment-17187934 ] Sean R. Owen commented on SPARK-32756: -- Meh, maybe not. We aren't going to make this change for 3.0.x. OK leave it > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > > > "Spark SQL" module doesn't compile in Scala 2.13: > {code:java} > [info] Compiling 26 Scala sources to > /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+("path".$minus$greater(path)) > [error] Error occurred in an application involving default arguments. > [error] this.extraOptions += ("path" -> path) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] Error occurred in an application involving default arguments. > [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: DataFrameWriter.this.extraOptions = > DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) > [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: > type mismatch; > [error] found : > scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] CaseInsensitiveMap(dedupPrimitiveFields) > [error] ^ > [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] new CaseInsensitiveStringMap(withoutPath.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [error] 7 errors found{code} > > The + function in
[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187933#comment-17187933 ] Sean R. Owen commented on SPARK-32756: -- This should just be a followup for SPARK-32364 > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > > > "Spark SQL" module doesn't compile in Scala 2.13: > {code:java} > [info] Compiling 26 Scala sources to > /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+(key.$minus$greater(value)) > [error] this.extraOptions += (key -> value) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: this.extraOptions = > this.extraOptions.+("path".$minus$greater(path)) > [error] Error occurred in an application involving default arguments. > [error] this.extraOptions += ("path" -> path) > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] Error occurred in an application involving default arguments. > [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: > value += is not a member of > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] Expression does not convert to assignment because: > [error] type mismatch; > [error] found : scala.collection.immutable.Map[String,String] > [error] required: > org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] > [error] expansion: DataFrameWriter.this.extraOptions = > DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) > [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> > [error] ^ > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: > type mismatch; > [error] found : > scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] > [error] CaseInsensitiveMap(dedupPrimitiveFields) > [error] ^ > [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] > <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? > [info] false > [error] > /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: > type mismatch; > [error] found : Iterable[(String, String)] > [error] required: java.util.Map[String,String] > [error] new CaseInsensitiveStringMap(withoutPath.asJava) > [error] ^ > [info] Iterable[(String, String)] <: java.util.Map[String,String]? > [error] 7 errors found{code} > > The + function in CaseInsensitiveStringMap missing,
[jira] [Commented] (SPARK-32728) Using groupby with rand creates different values when joining table with itself
[ https://issues.apache.org/jira/browse/SPARK-32728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187932#comment-17187932 ] Sean R. Owen commented on SPARK-32728: -- I think this is to be expected: you are going to get different values every time x is evaluated. It is not somehow fixed at the time you declare the column. I would expect it, however, to work as you are expecting if you materialize it with cache() + count() for example. Even then there are circumstances where it's reevaluated. > Using groupby with rand creates different values when joining table with > itself > --- > > Key: SPARK-32728 > URL: https://issues.apache.org/jira/browse/SPARK-32728 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 > Environment: I tested it with Azure Databricks 7.2 (& 6.6) (includes > Apache Spark 3.0.0 (& 2.4.5), Scala 2.12 (& 2.11)) > Worker type: Standard_DS3_v2 (2 workers) > >Reporter: Joachim Bargsten >Priority: Minor > > When running following query in a python3 notebook on a cluster with > *multiple workers (>1)*, the result is not 0.0, even though I would expect it > to be. > {code:java} > import pyspark.sql.functions as F > sdf = spark.range(100) > sdf = ( > sdf.withColumn("a", F.col("id") + 1) > .withColumn("b", F.col("id") + 2) > .withColumn("c", F.col("id") + 3) > .withColumn("d", F.col("id") + 4) > .withColumn("e", F.col("id") + 5) > ) > sdf = sdf.groupby(["a", "b", "c", "d"]).agg(F.sum("e").alias("e")) > sdf = sdf.withColumn("x", F.rand() * F.col("e")) > sdf2 = sdf.join(sdf.withColumnRenamed("x", "xx"), ["a", "b", "c", "d"]) > sdf2 = sdf2.withColumn("delta_x", F.abs(F.col('x') - > F.col("xx"))).agg(F.sum("delta_x")) > sum_delta_x = sdf2.head()[0] > print(f"{sum_delta_x} should be 0.0") > assert abs(sum_delta_x) < 0.001 > {code} > If the groupby statement is commented out, the code is working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
[ https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32757: Assignee: Wenchen Fan (was: Apache Spark) > Physical InSubqueryExec should be consistent with logical InSubquery > > > Key: SPARK-32757 > URL: https://issues.apache.org/jira/browse/SPARK-32757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
[ https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187929#comment-17187929 ] Apache Spark commented on SPARK-32757: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29601 > Physical InSubqueryExec should be consistent with logical InSubquery > > > Key: SPARK-32757 > URL: https://issues.apache.org/jira/browse/SPARK-32757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
[ https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32757: Assignee: Apache Spark (was: Wenchen Fan) > Physical InSubqueryExec should be consistent with logical InSubquery > > > Key: SPARK-32757 > URL: https://issues.apache.org/jira/browse/SPARK-32757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
[ https://issues.apache.org/jira/browse/SPARK-32757?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187931#comment-17187931 ] Apache Spark commented on SPARK-32757: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/29601 > Physical InSubqueryExec should be consistent with logical InSubquery > > > Key: SPARK-32757 > URL: https://issues.apache.org/jira/browse/SPARK-32757 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187928#comment-17187928 ] Wenchen Fan commented on SPARK-32752: - [~Qin Yao] do you have any idea to fix it? > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32757) Physical InSubqueryExec should be consistent with logical InSubquery
Wenchen Fan created SPARK-32757: --- Summary: Physical InSubqueryExec should be consistent with logical InSubquery Key: SPARK-32757 URL: https://issues.apache.org/jira/browse/SPARK-32757 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187902#comment-17187902 ] Apache Spark commented on SPARK-32317: -- User 'izchen' has created a pull request for this issue: https://github.com/apache/spark/pull/29600 > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Assigned] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32317: Assignee: Apache Spark > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Assignee: Apache Spark >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at >
[jira] [Assigned] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32317: Assignee: (was: Apache Spark) > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at >
[jira] [Commented] (SPARK-32317) Parquet file loading with different schema(Decimal(N, P)) in files is not working as expected
[ https://issues.apache.org/jira/browse/SPARK-32317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187900#comment-17187900 ] Apache Spark commented on SPARK-32317: -- User 'izchen' has created a pull request for this issue: https://github.com/apache/spark/pull/29600 > Parquet file loading with different schema(Decimal(N, P)) in files is not > working as expected > - > > Key: SPARK-32317 > URL: https://issues.apache.org/jira/browse/SPARK-32317 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 > Environment: Its failing in all environments that I tried. >Reporter: Krish >Priority: Major > Labels: easyfix > Original Estimate: 24h > Remaining Estimate: 24h > > Hi, > > We generate parquet files which are partitioned on Date on a daily basis, and > we send updates to historical data some times, what we noticed is due to some > configuration error the patch data schema is inconsistent to earlier files. > Assuming we had files generated with schema having ID and Amount as fields. > Historical data is having schema like ID INT, AMOUNT DECIMAL(15,6) and the > files we send as updates has schema like DECIMAL(15,2). > > Having two different schema in a Date partition and when we load the data of > a Date into spark, it is loading the data but the amount is getting > manipulated. > > file1.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,6) > Content: > 1,19500.00 > 2,198.34 > file2.snappy.parquet > ID: INT > AMOUNT: DECIMAL(15,2) > Content: > 1,19500.00 > 3,198.34 > Load these two files togeather > df3 = spark.read.parquet("output/") > df3.show() #-we can see amount getting manipulated here, > +-+---+ > |ID| AMOUNT| > +-+---+ > |1|1.95| > |3|0.019834| > |1|19500.00| > |2|198.34| > +-+---+ > x > Options Tried: > We tried to give schema as String for all fields, but that didt work > df3 = spark.read.format("parquet").schema(schema).load("output/") > Error: "org.apache.spark.sql.execution.QueryExecutionException: Parquet > column cannot be converted in file file*.snappy.parquet. Column: > [AMOUNT], Expected: string, Found: INT64" > > I know merge schema works if it finds few extra columns in one file but the > fileds which are in common needs to have same schema. That might nort work > here. > > Looking for some work around solution here. Or if there is an option which I > havent tried you can point me to that. > > With schema merging I got below eeror: > An error occurred while calling o2272.parquet. : > org.apache.spark.SparkException: Failed merging schema: root |-- ID: string > (nullable = true) |-- AMOUNT: decimal(15,6) (nullable = true) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5(SchemaMergeUtils.scala:100) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$5$adapted(SchemaMergeUtils.scala:95) > at > scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at > scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.mergeSchemasInParallel(SchemaMergeUtils.scala:95) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.mergeSchemasInParallel(ParquetFileFormat.scala:485) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetUtils$.inferSchema(ParquetUtils.scala:107) > at > org.apache.spark.sql.execution.datasources.v2.parquet.ParquetTable.inferSchema(ParquetTable.scala:44) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.$anonfun$dataSchema$4(FileTable.scala:69) > at scala.Option.orElse(Option.scala:447) at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema$lzycompute(FileTable.scala:69) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.dataSchema(FileTable.scala:63) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema$lzycompute(FileTable.scala:82) > at > org.apache.spark.sql.execution.datasources.v2.FileTable.schema(FileTable.scala:80) > at > org.apache.spark.sql.execution.datasources.v2.DataSourceV2Relation$.create(DataSourceV2Relation.scala:141) > at > org.apache.spark.sql.DataFrameReader.$anonfun$load$1(DataFrameReader.scala:225) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:206) at > org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:674) at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karol Chmist updated SPARK-32756: - Description: "Spark SQL" module doesn't compile in Scala 2.13: {code:java} [info] Compiling 26 Scala sources to /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+(key.$minus$greater(value)) [error] this.extraOptions += (key -> value) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+(key.$minus$greater(value)) [error] this.extraOptions += (key -> value) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+("path".$minus$greater(path)) [error] Error occurred in an application involving default arguments. [error] this.extraOptions += ("path" -> path) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: type mismatch; [error] found : Iterable[(String, String)] [error] required: java.util.Map[String,String] [error] Error occurred in an application involving default arguments. [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) [error] ^ [info] Iterable[(String, String)] <: java.util.Map[String,String]? [info] false [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: DataFrameWriter.this.extraOptions = DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: type mismatch; [error] found : scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] [error] CaseInsensitiveMap(dedupPrimitiveFields) [error] ^ [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? [info] false [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: type mismatch; [error] found : Iterable[(String, String)] [error] required: java.util.Map[String,String] [error] new CaseInsensitiveStringMap(withoutPath.asJava) [error] ^ [info] Iterable[(String, String)] <: java.util.Map[String,String]? [error] 7 errors found{code} The + function in CaseInsensitiveStringMap missing, resulting in {{+}} from standard Map being used, which returns Map instead of CaseInsensitiveStringMap. was: [info] Compiling 26 Scala sources to /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required:
[jira] [Commented] (SPARK-18067) SortMergeJoin adds shuffle if join predicates have non partitioned columns
[ https://issues.apache.org/jira/browse/SPARK-18067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187897#comment-17187897 ] Andy Van Yperen-De Deyne commented on SPARK-18067: -- I am facing this performance issue when using spark 3.0.0. This topic is resolved on the 21st of May 2019, however, the PR on git is closed... Is there any chance this solution is going to be merged in the master branch? Do you have a timeline for that? (maybe this is not the correct location to ask this, in which case I apologize) > SortMergeJoin adds shuffle if join predicates have non partitioned columns > -- > > Key: SPARK-18067 > URL: https://issues.apache.org/jira/browse/SPARK-18067 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.1 >Reporter: Paul Jones >Priority: Minor > Labels: bulk-closed > > Basic setup > {code} > scala> case class Data1(key: String, value1: Int) > scala> case class Data2(key: String, value2: Int) > scala> val partition1 = sc.parallelize(1 to 10).map(x => Data1(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > scala> val partition2 = sc.parallelize(1 to 10).map(x => Data2(s"$x", x)) > .toDF.repartition(col("key")).sortWithinPartitions(col("key")).cache > {code} > Join on key > {code} > scala> partition1.join(partition2, "key").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [key#0], [key#12] >:- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we get a super efficient join with no shuffle. > But if we add a filter our join gets less efficient and we end up with a > shuffle. > {code} > scala> partition1.join(partition2, "key").filter($"value1" === > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- SortMergeJoin [value1#1,key#0], [value2#13,key#12] >:- Sort [value1#1 ASC,key#0 ASC], false, 0 >: +- TungstenExchange hashpartitioning(value1#1,key#0,200), None >: +- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None >+- Sort [value2#13 ASC,key#12 ASC], false, 0 > +- TungstenExchange hashpartitioning(value2#13,key#12,200), None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > And we can avoid the shuffle if use a filter statement that can't be pushed > in the join. > {code} > scala> partition1.join(partition2, "key").filter($"value1" >= > $"value2").explain > == Physical Plan == > Project [key#0,value1#1,value2#13] > +- Filter (value1#1 >= value2#13) >+- SortMergeJoin [key#0], [key#12] > :- InMemoryColumnarTableScan [key#0,value1#1], InMemoryRelation > [key#0,value1#1], true, 1, StorageLevel(true, true, false, true, 1), Sort > [key#0 ASC], false, 0, None > +- InMemoryColumnarTableScan [key#12,value2#13], InMemoryRelation > [key#12,value2#13], true, 1, StorageLevel(true, true, false, true, 1), > Sort [key#12 ASC], false, 0, None > {code} > What's the best way to avoid the filter pushdown here?? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karol Chmist updated SPARK-32756: - Description: [info] Compiling 26 Scala sources to /home/karol/workspace/open-source/spark/sql/core/target/scala-2.13/classes... [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:121: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+(key.$minus$greater(value)) [error] this.extraOptions += (key -> value) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:132: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+(key.$minus$greater(value)) [error] this.extraOptions += (key -> value) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:294: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: this.extraOptions = this.extraOptions.+("path".$minus$greater(path)) [error] Error occurred in an application involving default arguments. [error] this.extraOptions += ("path" -> path) [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:317: type mismatch; [error] found : Iterable[(String, String)] [error] required: java.util.Map[String,String] [error] Error occurred in an application involving default arguments. [error] val dsOptions = new CaseInsensitiveStringMap(options.asJava) [error] ^ [info] Iterable[(String, String)] <: java.util.Map[String,String]? [info] false [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala:412: value += is not a member of org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] Expression does not convert to assignment because: [error] type mismatch; [error] found : scala.collection.immutable.Map[String,String] [error] required: org.apache.spark.sql.catalyst.util.CaseInsensitiveMap[String] [error] expansion: DataFrameWriter.this.extraOptions = DataFrameWriter.this.extraOptions.+(DataSourceUtils.PARTITIONING_COLUMNS_KEY.$minus$greater(DataSourceUtils.encodePartitioningColumns(columns))) [error] extraOptions += (DataSourceUtils.PARTITIONING_COLUMNS_KEY -> [error] ^ [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala:85: type mismatch; [error] found : scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] [error] required: Map[String,OrcFiltersBase.this.OrcPrimitiveField] [error] CaseInsensitiveMap(dedupPrimitiveFields) [error] ^ [info] scala.collection.MapView[String,OrcFiltersBase.this.OrcPrimitiveField] <: Map[String,OrcFiltersBase.this.OrcPrimitiveField]? [info] false [error] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileDataSourceV2.scala:64: type mismatch; [error] found : Iterable[(String, String)] [error] required: java.util.Map[String,String] [error] new CaseInsensitiveStringMap(withoutPath.asJava) [error] ^ [info] Iterable[(String, String)] <: java.util.Map[String,String]? [info] false [warn] /home/karol/workspace/open-source/spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/jdbc/JDBCTableCatalog.scala:64: method without a parameter list overrides a method with a single empty one [warn] def next = Identifier.of(namespace, rs.getString("TABLE_NAME")) [warn] ^ [warn] one warning found [error] 7 errors found > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > > [info] Compiling 26 Scala sources
[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32756: Assignee: (was: Apache Spark) > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32756: Assignee: Apache Spark > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187865#comment-17187865 ] Apache Spark commented on SPARK-32756: -- User 'karolchmist' has created a pull request for this issue: https://github.com/apache/spark/pull/29584 > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
[ https://issues.apache.org/jira/browse/SPARK-32756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187864#comment-17187864 ] Apache Spark commented on SPARK-32756: -- User 'karolchmist' has created a pull request for this issue: https://github.com/apache/spark/pull/29584 > Fix CaseInsensitiveMap in Scala 2.13 > > > Key: SPARK-32756 > URL: https://issues.apache.org/jira/browse/SPARK-32756 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Karol Chmist >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32756) Fix CaseInsensitiveMap in Scala 2.13
Karol Chmist created SPARK-32756: Summary: Fix CaseInsensitiveMap in Scala 2.13 Key: SPARK-32756 URL: https://issues.apache.org/jira/browse/SPARK-32756 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Karol Chmist -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32750) Add code-gen for SortAggregateExec
[ https://issues.apache.org/jira/browse/SPARK-32750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187860#comment-17187860 ] Cheng Su commented on SPARK-32750: -- [~maropu] - thanks a lot for pointing out existing context. Will take a look. Thanks. > Add code-gen for SortAggregateExec > -- > > Key: SPARK-32750 > URL: https://issues.apache.org/jira/browse/SPARK-32750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > > We have codegen for hash aggregate (`HashAggregateExec`) for a long time, but > missing codegen for sort aggregate (`SortAggregate`). Sort aggregate is still > useful in terms of performance if (1). the data after aggregate still too big > to fit in memory (both hash aggregate and object hash aggregate needs to > spill), (2).user can disable hash aggregate and object hash aggregate by > config to prefer sort aggregate if the aggregate is after e.g. sort merge > join and do not need sort at all. > Create this Jira to add codegen support for sort aggregate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32542: Assignee: Apache Spark > Add an optimizer rule to split an Expand into multiple Expands for aggregates > - > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: karl wang >Assignee: Apache Spark >Priority: Major > > Split an expand into several small Expand, which contains the Specified > number of projections. > For instance, like this sql.select a, b, c, d, count(1) from table1 group by > a, b, c, d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32542: Assignee: (was: Apache Spark) > Add an optimizer rule to split an Expand into multiple Expands for aggregates > - > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: karl wang >Priority: Major > > Split an expand into several small Expand, which contains the Specified > number of projections. > For instance, like this sql.select a, b, c, d, count(1) from table1 group by > a, b, c, d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27733) Upgrade to Avro 1.10.0
[ https://issues.apache.org/jira/browse/SPARK-27733?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187849#comment-17187849 ] Chao Sun commented on SPARK-27733: -- bq. BTW, is Hive community will to introduce the version upgrade of Avro in their 2.3.x? It seems non-trivial to upgrade Avro in Hive (see HIVE-21737). Maybe we can shade Avro in Hive first so it won't cause conflicts? Even for that we may need to first enable CI for Hive branch 2.3 (see https://github.com/apache/hive/pull/1398). After these we can cut a release and do a vote in the community. > Upgrade to Avro 1.10.0 > -- > > Key: SPARK-27733 > URL: https://issues.apache.org/jira/browse/SPARK-27733 > Project: Spark > Issue Type: Improvement > Components: Build, SQL >Affects Versions: 3.1.0 >Reporter: Ismaël Mejía >Priority: Minor > > Avro 1.9.2 was released with many nice features including reduced size (1MB > less), and removed dependencies, no paranamer, no shaded guava, security > updates, so probably a worth upgrade. > Avro 1.10.0 was released and this is still not done. > There is at the moment (2020/08) still a blocker because of Hive related > transitive dependencies bringing older versions of Avro, so we could say that > this is somehow still blocked until HIVE-21737 is solved. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32691) Test org.apache.spark.DistributedSuite failed on arm64 jenkins
[ https://issues.apache.org/jira/browse/SPARK-32691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187843#comment-17187843 ] Dongjoon Hyun commented on SPARK-32691: --- Yes. That failure looks like that, but it's still irrelevant to DISK_3, isn't it? You can attach that to the PR description and make the scope broader. {code} caching on disk, replicated 2 (encryption = on) {code} > Test org.apache.spark.DistributedSuite failed on arm64 jenkins > -- > > Key: SPARK-32691 > URL: https://issues.apache.org/jira/browse/SPARK-32691 > Project: Spark > Issue Type: Bug > Components: Spark Core, Tests >Affects Versions: 3.1.0 > Environment: ARM64 >Reporter: huangtianhua >Priority: Major > > Tests of org.apache.spark.DistributedSuite are failed on arm64 jenkins: > https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-arm/ > - caching in memory and disk, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory and disk, serialized, replicated (encryption = on) > (with replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > - caching in memory, serialized, replicated (encryption = on) (with > replication as stream) *** FAILED *** > 3 did not equal 2; got 3 replicas instead of 2 (DistributedSuite.scala:191) > ... > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang reopened SPARK-32542: --- > Add an optimizer rule to split an Expand into multiple Expands for aggregates > - > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: karl wang >Priority: Major > > Split an expand into several small Expand, which contains the Specified > number of projections. > For instance, like this sql.select a, b, c, d, count(1) from table1 group by > a, b, c, d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32542) Add an optimizer rule to split an Expand into multiple Expands for aggregates
[ https://issues.apache.org/jira/browse/SPARK-32542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] karl wang resolved SPARK-32542. --- Resolution: Fixed > Add an optimizer rule to split an Expand into multiple Expands for aggregates > - > > Key: SPARK-32542 > URL: https://issues.apache.org/jira/browse/SPARK-32542 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: karl wang >Priority: Major > > Split an expand into several small Expand, which contains the Specified > number of projections. > For instance, like this sql.select a, b, c, d, count(1) from table1 group by > a, b, c, d with cube. It can expand 2^4 times of original data size. > Now we specify the spark.sql.optimizer.projections.size=4.The Expand will be > split into 2^4/4 smallExpand.It can reduce the shuffle pressure and improve > performance in multidimensional analysis when data is huge. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
[ https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32755: Assignee: (was: Apache Spark) > Maintain the order of expressions in AttributeSet and ExpressionSet > > > Key: SPARK-32755 > URL: https://issues.apache.org/jira/browse/SPARK-32755 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Priority: Major > > Expressions identity is based on the ExprId which is an auto-incremented > number. This means that the same query can yield a query plan with different > expression ids in different runs. AttributeSet and ExpressionSet internally > use a HashSet as the underlying data structure, and therefore cannot > guarantee the a fixed order of operations in different runs. This can be > problematic in cases we like to check for plan changes in different runs. > We change do the following changes to AttributeSet and ExpressionSet to > maintain the insertion order of the elements: > * We change the underlying data structure of AttributeSet from HashSet to > LinkedHashSet to maintain the insertion order. > * ExpressionSet already uses a list to keep track of the expressions, > however, since it is extending Scala's immutable.Set class, operations such > as map and flatMap are delegated to the immutable.Set itself. This means that > the result of these operations is not an instance of ExpressionSet anymore, > rather it's a implementation picked up by the parent class. We also remove > this inheritance from immutable.Set and implement the needed methods > directly. ExpressionSet has a very specific semantics and it does not make > sense to extend immutable.Set anyway. > * We change the PlanStabilitySuite to not sort the attributes, to be able to > catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
[ https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32755: Assignee: Apache Spark > Maintain the order of expressions in AttributeSet and ExpressionSet > > > Key: SPARK-32755 > URL: https://issues.apache.org/jira/browse/SPARK-32755 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Assignee: Apache Spark >Priority: Major > > Expressions identity is based on the ExprId which is an auto-incremented > number. This means that the same query can yield a query plan with different > expression ids in different runs. AttributeSet and ExpressionSet internally > use a HashSet as the underlying data structure, and therefore cannot > guarantee the a fixed order of operations in different runs. This can be > problematic in cases we like to check for plan changes in different runs. > We change do the following changes to AttributeSet and ExpressionSet to > maintain the insertion order of the elements: > * We change the underlying data structure of AttributeSet from HashSet to > LinkedHashSet to maintain the insertion order. > * ExpressionSet already uses a list to keep track of the expressions, > however, since it is extending Scala's immutable.Set class, operations such > as map and flatMap are delegated to the immutable.Set itself. This means that > the result of these operations is not an instance of ExpressionSet anymore, > rather it's a implementation picked up by the parent class. We also remove > this inheritance from immutable.Set and implement the needed methods > directly. ExpressionSet has a very specific semantics and it does not make > sense to extend immutable.Set anyway. > * We change the PlanStabilitySuite to not sort the attributes, to be able to > catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
[ https://issues.apache.org/jira/browse/SPARK-32755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187779#comment-17187779 ] Apache Spark commented on SPARK-32755: -- User 'dbaliafroozeh' has created a pull request for this issue: https://github.com/apache/spark/pull/29598 > Maintain the order of expressions in AttributeSet and ExpressionSet > > > Key: SPARK-32755 > URL: https://issues.apache.org/jira/browse/SPARK-32755 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Ali Afroozeh >Priority: Major > > Expressions identity is based on the ExprId which is an auto-incremented > number. This means that the same query can yield a query plan with different > expression ids in different runs. AttributeSet and ExpressionSet internally > use a HashSet as the underlying data structure, and therefore cannot > guarantee the a fixed order of operations in different runs. This can be > problematic in cases we like to check for plan changes in different runs. > We change do the following changes to AttributeSet and ExpressionSet to > maintain the insertion order of the elements: > * We change the underlying data structure of AttributeSet from HashSet to > LinkedHashSet to maintain the insertion order. > * ExpressionSet already uses a list to keep track of the expressions, > however, since it is extending Scala's immutable.Set class, operations such > as map and flatMap are delegated to the immutable.Set itself. This means that > the result of these operations is not an instance of ExpressionSet anymore, > rather it's a implementation picked up by the parent class. We also remove > this inheritance from immutable.Set and implement the needed methods > directly. ExpressionSet has a very specific semantics and it does not make > sense to extend immutable.Set anyway. > * We change the PlanStabilitySuite to not sort the attributes, to be able to > catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32755) Maintain the order of expressions in AttributeSet and ExpressionSet
Ali Afroozeh created SPARK-32755: Summary: Maintain the order of expressions in AttributeSet and ExpressionSet Key: SPARK-32755 URL: https://issues.apache.org/jira/browse/SPARK-32755 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Ali Afroozeh Expressions identity is based on the ExprId which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. AttributeSet and ExpressionSet internally use a HashSet as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs. We change do the following changes to AttributeSet and ExpressionSet to maintain the insertion order of the elements: * We change the underlying data structure of AttributeSet from HashSet to LinkedHashSet to maintain the insertion order. * ExpressionSet already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from immutable.Set and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend immutable.Set anyway. * We change the PlanStabilitySuite to not sort the attributes, to be able to catch changes in the order of expressions in different runs. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32742) FileOutputCommitter warns "No Output found for attempt"
[ https://issues.apache.org/jira/browse/SPARK-32742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187751#comment-17187751 ] Ryan Luo commented on SPARK-32742: -- [~hyukjin.kwon] Thanks for advising. > FileOutputCommitter warns "No Output found for attempt" > --- > > Key: SPARK-32742 > URL: https://issues.apache.org/jira/browse/SPARK-32742 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: Hadoop 2.6.0-cdh5.16.2 > YARN(MR2 included) > >Reporter: Ryan Luo >Priority: Major > > Hi team, > This is my first time to report an issue here. > We submitted and ran the spark job on the cluster. > We found that one of the parquet output partition is missing in the output > directory. We checked the spark job log, all the tasks status are showing > success. The output record size matches expected number. > However, we checked the container log, found that there was a warning says > *No Output found for attempt_20200819094307_0003_m_02_11*, which stopped > moving the output from taskAttemptPath to output directory. As a result, we > are missing some of the output rows. > Re-run the job helped to solve the issue, however the report is critical for > us. It is appreciated if you can advise the cause for the issue. > > Below are the container logs: > > {code:java} > 20/08/19 09:44:57 INFO output.FileOutputCommitter: FileOutputCommitter skip > cleanup _temporary folders under output directory:false, ignore cleanup > failures: false > 20/08/19 09:44:57 INFO datasources.SQLHadoopMapReduceCommitProtocol: Using > user defined output committer class parquet.hadoop.ParquetOutputCommitter > 20/08/19 09:44:57 INFO output.FileOutputCommitter: File Output Committer > Algorithm version is 2 > 20/08/19 09:44:57 INFO output.FileOutputCommitter: FileOutputCommitter skip > cleanup _temporary folders under output directory:false, ignore cleanup > failures: false > 20/08/19 09:44:57 INFO datasources.SQLHadoopMapReduceCommitProtocol: Using > output committer class parquet.hadoop.ParquetOutputCommitter > 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 12.370642 ms > 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 6.927118 ms > 20/08/19 09:44:57 INFO codegen.CodeGenerator: Code generated in 12.004204 ms > 20/08/19 09:44:57 INFO parquet.ParquetWriteSupport: Initialized Parquet > WriteSupport with Catalyst schema: > . (skipped) > 20/08/19 09:44:57 WARN output.FileOutputCommitter: No Output found for > attempt_20200819094307_0003_m_02_11 > 20/08/19 09:44:57 INFO mapred.SparkHadoopMapRedUtil: > attempt_20200819094307_0003_m_02_11: Committed > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32592) Make `DataFrameReader.table` take the specified options
[ https://issues.apache.org/jira/browse/SPARK-32592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-32592. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 29535 [https://github.com/apache/spark/pull/29535] > Make `DataFrameReader.table` take the specified options > --- > > Key: SPARK-32592 > URL: https://issues.apache.org/jira/browse/SPARK-32592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Currently, `DataFrameReader.table` ignores the specified options. The options > specified like the following are lost. > {code:java} > val df = spark.read > .option("partitionColumn", "id") > .option("lowerBound", "0") > .option("upperBound", "3") > .option("numPartitions", "2") > .table("h2.test.people") > {code} > We need to make `DataFrameReader.table` take the specified options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32592) Make `DataFrameReader.table` take the specified options
[ https://issues.apache.org/jira/browse/SPARK-32592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-32592: --- Assignee: Huaxin Gao > Make `DataFrameReader.table` take the specified options > --- > > Key: SPARK-32592 > URL: https://issues.apache.org/jira/browse/SPARK-32592 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Currently, `DataFrameReader.table` ignores the specified options. The options > specified like the following are lost. > {code:java} > val df = spark.read > .option("partitionColumn", "id") > .option("lowerBound", "0") > .option("upperBound", "3") > .option("numPartitions", "2") > .table("h2.test.people") > {code} > We need to make `DataFrameReader.table` take the specified options. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187692#comment-17187692 ] Kent Yao commented on SPARK-32752: -- cc [~cloud_fan] > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187686#comment-17187686 ] Apache Spark commented on SPARK-32190: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29596 > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32190: Assignee: (was: Apache Spark) > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32190: Assignee: Apache Spark > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32190) Development - Contribution Guide
[ https://issues.apache.org/jira/browse/SPARK-32190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187684#comment-17187684 ] Apache Spark commented on SPARK-32190: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/29596 > Development - Contribution Guide > > > Key: SPARK-32190 > URL: https://issues.apache.org/jira/browse/SPARK-32190 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Will be similar with http://spark.apache.org/contributing.html but avoid > duplications with more details. > - Style Guide: PEP8 but there are a couple of exceptions such as > https://github.com/apache/spark/blob/master/dev/tox.ini#L17. > - Bug Reports: JIRA. Maybe we should just point back the link, > http://spark.apache.org/contributing.html > - Contribution Workflow: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Documentation Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html > - Code Contribution: e.g.) > https://pandas.pydata.org/docs/development/contributing.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32659) Fix the data issue of inserted DPP on non-atomic type
[ https://issues.apache.org/jira/browse/SPARK-32659?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187631#comment-17187631 ] Apache Spark commented on SPARK-32659: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/29595 > Fix the data issue of inserted DPP on non-atomic type > - > > Key: SPARK-32659 > URL: https://issues.apache.org/jira/browse/SPARK-32659 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Labels: correctness > Fix For: 3.0.1, 3.1.0 > > > DPP has data issue when pruning on non-atomic type. for example: > {noformat} > spark.range(1000) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df1"); > spark.range(100) > .select(col("id"), col("id").as("k")) > .write > .partitionBy("k") > .format("parquet") > .mode("overwrite") > .saveAsTable("df2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") > spark.sql("set > spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") > spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = > struct(df2.k) AND df2.id < 2").show > {noformat} > It should return two records, but it returns empty. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32746) Not able to run Pandas UDF
[ https://issues.apache.org/jira/browse/SPARK-32746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187617#comment-17187617 ] Rahul Bhatia commented on SPARK-32746: -- [~hyukjin.kwon] Here are the logs, and more symptoms as mentioned are that the code runs fine and completes in 2-4 seconds in standalone mode(on my local machine), on the cluster, it shows no progress. {noformat} 20/08/29 18:51:25 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/08/30 12:51:25 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/08/31 06:51:25 ERROR netty.Inbox: Ignoring error org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive' at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70) at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) 20/08/31 07:31:04 INFO yarn.YarnAllocator: Completed container container_e66_1589136601716_6876212_01_31 on host: bhdp4080.prod.bdd.jp.local (state: COMPLETE, exit status: -102) 20/08/31 07:31:04 INFO yarn.YarnAllocator: Container container_e66_1589136601716_6876212_01_31 on host: bhdp4080.prod.bdd.jp.local was preempted. 20/08/31 07:31:07 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 16 core(s) and 18022 MB memory (including 1638 MB of overhead) 20/08/31 07:31:07 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests. 20/08/31 07:31:55 INFO yarn.YarnAllocator: Completed container container_e66_1589136601716_6876212_01_30 on host: bhdp4365.prod.hnd1.bdd.local (state: COMPLETE, exit status: -102) 20/08/31 07:31:55 INFO yarn.YarnAllocator: Container container_e66_1589136601716_6876212_01_30 on host: bhdp4365.prod.hnd1.bdd.local was preempted. 20/08/31 07:31:55 INFO yarn.YarnAllocator: Will request 1 executor container(s), each with 16 core(s) and 18022 MB memory (including 1638 MB of overhead) 20/08/31 07:31:55 INFO yarn.YarnAllocator: Submitted 1 unlocalized container requests. 20/08/31 07:37:04 INFO impl.AMRMClientImpl: Received new token for : bhdp4564.prod.hnd1.bdd.local:45454 20/08/31 07:37:04 INFO yarn.YarnAllocator: Launching container container_e66_1589136601716_6876212_01_32 on host bhdp4564.prod.hnd1.bdd.local for executor with ID 31 20/08/31 07:37:04 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 1 of them. 20/08/31 07:37:04 INFO impl.ContainerManagementProtocolProxy: yarn.client.max-cached-nodemanagers-proxies : 0 20/08/31 07:37:04 INFO impl.ContainerManagementProtocolProxy: Opening proxy : bhdp4564.prod.hnd1.bdd.local:45454
[jira] [Commented] (SPARK-32752) Alias breaks for interval typed literals
[ https://issues.apache.org/jira/browse/SPARK-32752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187613#comment-17187613 ] Kent Yao commented on SPARK-32752: -- this cases will be captured by grammar rule `multiUnitsInterval`, e.g. for interval '1 day' day, the value is `1 day` and the unit is day > Alias breaks for interval typed literals > > > Key: SPARK-32752 > URL: https://issues.apache.org/jira/browse/SPARK-32752 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > > Cases we found: > {code:java} > +-- !query > +select interval '1 day' as day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +no viable alternative at input 'as'(line 1, pos 24) > + > +== SQL == > +select interval '1 day' as day > +^^^ > + > + > +-- !query > +select interval '1 day' day > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1 day day' to interval, unrecognized number 'day'(line 1, > pos 16) > + > +== SQL == > +select interval '1 day' day > +^^^ > + > + > +-- !query > +select interval '1-2' year as y > +-- !query schema > +struct<> > +-- !query output > +org.apache.spark.sql.catalyst.parser.ParseException > + > +Error parsing ' 1-2 year' to interval, invalid value '1-2'(line 1, pos 16) > + > +== SQL == > +select interval '1-2' year as y > +^^^ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites
[ https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32754: Assignee: (was: Apache Spark) > Unify `assertEqualPlans` for join reorder suites > > > Key: SPARK-32754 > URL: https://issues.apache.org/jira/browse/SPARK-32754 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, > `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and > the logic is almost the same. We can extract the method to a single place for > code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites
[ https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32754: Assignee: Apache Spark > Unify `assertEqualPlans` for join reorder suites > > > Key: SPARK-32754 > URL: https://issues.apache.org/jira/browse/SPARK-32754 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark >Priority: Minor > > Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, > `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and > the logic is almost the same. We can extract the method to a single place for > code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites
[ https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187591#comment-17187591 ] Apache Spark commented on SPARK-32754: -- User 'wzhfy' has created a pull request for this issue: https://github.com/apache/spark/pull/29594 > Unify `assertEqualPlans` for join reorder suites > > > Key: SPARK-32754 > URL: https://issues.apache.org/jira/browse/SPARK-32754 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, > `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and > the logic is almost the same. We can extract the method to a single place for > code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites
[ https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-32754: Assignee: Apache Spark > Unify `assertEqualPlans` for join reorder suites > > > Key: SPARK-32754 > URL: https://issues.apache.org/jira/browse/SPARK-32754 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Assignee: Apache Spark >Priority: Minor > > Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, > `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and > the logic is almost the same. We can extract the method to a single place for > code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32754) Unify `assertEqualPlans` for join reorder suites
[ https://issues.apache.org/jira/browse/SPARK-32754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32754: - Summary: Unify `assertEqualPlans` for join reorder suites (was: Unify `assertEqualPlans` for all join reorder suites) > Unify `assertEqualPlans` for join reorder suites > > > Key: SPARK-32754 > URL: https://issues.apache.org/jira/browse/SPARK-32754 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Minor > > Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, > `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and > the logic is almost the same. We can extract the method to a single place for > code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32754) Unify `assertEqualPlans` for all join reorder suites
Zhenhua Wang created SPARK-32754: Summary: Unify `assertEqualPlans` for all join reorder suites Key: SPARK-32754 URL: https://issues.apache.org/jira/browse/SPARK-32754 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: Zhenhua Wang Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and the logic is almost the same. We can extract the method to a single place for code simplicity. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32748: - Description: Since SPARK-22590, local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. was: Since [SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a), local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Major > > Since SPARK-22590, local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32748: - Description: Since [SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a), local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. was: Currently local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Major > > Since > [SPARK-22590](https://github.com/apache/spark/commit/2854091d12d670b014c41713e72153856f4d3f6a), > local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32748) Support local property propagation in SubqueryBroadcastExec
[ https://issues.apache.org/jira/browse/SPARK-32748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhenhua Wang updated SPARK-32748: - Description: Currently local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. was: Currently local property propagation are supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. The propagation is missed in `SubqueryBroadcastExec`. > Support local property propagation in SubqueryBroadcastExec > --- > > Key: SPARK-32748 > URL: https://issues.apache.org/jira/browse/SPARK-32748 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zhenhua Wang >Priority: Major > > Currently local property propagation is supported through > `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and > `SubqueryExec` when computing `relationFuture`. > The propagation is missed in `SubqueryBroadcastExec`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-32747: Assignee: Hyukjin Kwon > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32747) Deduplicate configuration set/unset in test_sparkSQL_arrow.R
[ https://issues.apache.org/jira/browse/SPARK-32747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-32747. -- Fix Version/s: 3.1.0 3.0.1 Resolution: Fixed Issue resolved by pull request 29592 [https://github.com/apache/spark/pull/29592] > Deduplicate configuration set/unset in test_sparkSQL_arrow.R > > > Key: SPARK-32747 > URL: https://issues.apache.org/jira/browse/SPARK-32747 > Project: Spark > Issue Type: Test > Components: R, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.1, 3.1.0 > > > Currently, there are many set/unset duplicated in `test_sparkSQL_arrow.R` > test cases. We can just set once in globally and deduplicate such try-catch > logics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32753) Deduplicating and repartitioning the same column create duplicate rows with AQE
[ https://issues.apache.org/jira/browse/SPARK-32753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17187507#comment-17187507 ] Apache Spark commented on SPARK-32753: -- User 'manuzhang' has created a pull request for this issue: https://github.com/apache/spark/pull/29593 > Deduplicating and repartitioning the same column create duplicate rows with > AQE > --- > > Key: SPARK-32753 > URL: https://issues.apache.org/jira/browse/SPARK-32753 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Manu Zhang >Priority: Minor > > To reproduce: > {code:java} > spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") > val df = spark.sql("select id from v1 group by id distribute by id") > println(df.collect().toArray.mkString(",")) > println(df.queryExecution.executedPlan) > // With AQE > [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] > AdaptiveSparkPlan(isFinalPlan=true) > +- CustomShuffleReader local >+- ShuffleQueryStage 0 > +- Exchange hashpartitioning(id#183L, 10), true > +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) > +- Union >:- *(1) Range (0, 10, step=1, splits=2) >+- *(2) Range (0, 10, step=1, splits=2) > // Without AQE > [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] > *(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Exchange hashpartitioning(id#206L, 10), true >+- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) > +- Union > :- *(1) Range (0, 10, step=1, splits=2) > +- *(2) Range (0, 10, step=1, splits=2){code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org