[jira] [Resolved] (SPARK-33477) Hive partition pruning support date type
[ https://issues.apache.org/jira/browse/SPARK-33477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33477. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30408 [https://github.com/apache/spark/pull/30408] > Hive partition pruning support date type > - > > Key: SPARK-33477 > URL: https://issues.apache.org/jira/browse/SPARK-33477 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Hive partition pruning can support date type: > https://issues.apache.org/jira/browse/HIVE-5679 > https://github.com/apache/hive/commit/5106bf1c8671740099fca8e1a7d4b37afe97137f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
[ https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238525#comment-17238525 ] Apache Spark commented on SPARK-33548: -- User 'JQ-Cao' has created a pull request for this issue: https://github.com/apache/spark/pull/30495 > Peak Execution Memory not display on Spark Executor UI intuitively > -- > > Key: SPARK-33548 > URL: https://issues.apache.org/jira/browse/SPARK-33548 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0, 3.0.1 >Reporter: xuziqiJS >Priority: Major > > Now, Peak Execution Memory can only be obtained through restAPI and cannot be > displayed on Spark Executor UI intuitively, although spark users tune spark > executor memory are dependent on the metrics. Therefore, it is very important > to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
[ https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33548: Assignee: (was: Apache Spark) > Peak Execution Memory not display on Spark Executor UI intuitively > -- > > Key: SPARK-33548 > URL: https://issues.apache.org/jira/browse/SPARK-33548 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0, 3.0.1 >Reporter: xuziqiJS >Priority: Major > > Now, Peak Execution Memory can only be obtained through restAPI and cannot be > displayed on Spark Executor UI intuitively, although spark users tune spark > executor memory are dependent on the metrics. Therefore, it is very important > to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
[ https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33548: Assignee: Apache Spark > Peak Execution Memory not display on Spark Executor UI intuitively > -- > > Key: SPARK-33548 > URL: https://issues.apache.org/jira/browse/SPARK-33548 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0, 3.0.1 >Reporter: xuziqiJS >Assignee: Apache Spark >Priority: Major > > Now, Peak Execution Memory can only be obtained through restAPI and cannot be > displayed on Spark Executor UI intuitively, although spark users tune spark > executor memory are dependent on the metrics. Therefore, it is very important > to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
[ https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238522#comment-17238522 ] Apache Spark commented on SPARK-33548: -- User 'JQ-Cao' has created a pull request for this issue: https://github.com/apache/spark/pull/30495 > Peak Execution Memory not display on Spark Executor UI intuitively > -- > > Key: SPARK-33548 > URL: https://issues.apache.org/jira/browse/SPARK-33548 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0, 3.0.1 >Reporter: xuziqiJS >Priority: Major > > Now, Peak Execution Memory can only be obtained through restAPI and cannot be > displayed on Spark Executor UI intuitively, although spark users tune spark > executor memory are dependent on the metrics. Therefore, it is very important > to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31710: -- Labels: (was: correctness) > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Major > Fix For: 3.1.0 > > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33551) Do not use custom shuffle reader for repartition
[ https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238516#comment-17238516 ] Apache Spark commented on SPARK-33551: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/30494 > Do not use custom shuffle reader for repartition > > > Key: SPARK-33551 > URL: https://issues.apache.org/jira/browse/SPARK-33551 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Wei Xue >Priority: Major > > We should have a more thorough fix for all sorts of custom shuffle readers > when the original query has a repartition shuffle, based on the discussions > on the initial PR: [https://github.com/apache/spark/pull/29797]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33551) Do not use custom shuffle reader for repartition
[ https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33551: Assignee: Apache Spark > Do not use custom shuffle reader for repartition > > > Key: SPARK-33551 > URL: https://issues.apache.org/jira/browse/SPARK-33551 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Wei Xue >Assignee: Apache Spark >Priority: Major > > We should have a more thorough fix for all sorts of custom shuffle readers > when the original query has a repartition shuffle, based on the discussions > on the initial PR: [https://github.com/apache/spark/pull/29797]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33551) Do not use custom shuffle reader for repartition
[ https://issues.apache.org/jira/browse/SPARK-33551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33551: Assignee: (was: Apache Spark) > Do not use custom shuffle reader for repartition > > > Key: SPARK-33551 > URL: https://issues.apache.org/jira/browse/SPARK-33551 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.1 >Reporter: Wei Xue >Priority: Major > > We should have a more thorough fix for all sorts of custom shuffle readers > when the original query has a repartition shuffle, based on the discussions > on the initial PR: [https://github.com/apache/spark/pull/29797]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33551) Do not use custom shuffle reader for repartition
Wei Xue created SPARK-33551: --- Summary: Do not use custom shuffle reader for repartition Key: SPARK-33551 URL: https://issues.apache.org/jira/browse/SPARK-33551 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Wei Xue We should have a more thorough fix for all sorts of custom shuffle readers when the original query has a repartition shuffle, based on the discussions on the initial PR: [https://github.com/apache/spark/pull/29797]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33550: Description: https://github.com/apache/spark/pull/30478#discussion_r529179587 > Recover hive-service-rpc to built-in Hive version when we upgrade built-in > Hive to 3.1.2 > > > Key: SPARK-33550 > URL: https://issues.apache.org/jira/browse/SPARK-33550 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > https://github.com/apache/spark/pull/30478#discussion_r529179587 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2
[ https://issues.apache.org/jira/browse/SPARK-33550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33550: Description: Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2. Please see https://github.com/apache/spark/pull/30478#discussion_r529179587 for more details. was:https://github.com/apache/spark/pull/30478#discussion_r529179587 > Recover hive-service-rpc to built-in Hive version when we upgrade built-in > Hive to 3.1.2 > > > Key: SPARK-33550 > URL: https://issues.apache.org/jira/browse/SPARK-33550 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Recover hive-service-rpc to built-in Hive version when we upgrade built-in > Hive to 3.1.2. Please see > https://github.com/apache/spark/pull/30478#discussion_r529179587 for more > details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33550) Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2
Yuming Wang created SPARK-33550: --- Summary: Recover hive-service-rpc to built-in Hive version when we upgrade built-in Hive to 3.1.2 Key: SPARK-33550 URL: https://issues.apache.org/jira/browse/SPARK-33550 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31710) Fail casting numeric to timestamp by default
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31710: -- Labels: correctness (was: ) > Fail casting numeric to timestamp by default > > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Assignee: philipse >Priority: Major > Labels: correctness > Fix For: 3.1.0 > > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
[ https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33549: Assignee: Gengliang Wang (was: Apache Spark) > Remove configuration spark.sql.legacy.allowCastNumericToTimestamp > - > > Key: SPARK-33549 > URL: https://issues.apache.org/jira/browse/SPARK-33549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > In the current master branch, there is a new configuration > `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast > Numeric types to Timestamp or not. The default value is true. > After https://github.com/apache/spark/pull/30260, the type conversion between > Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need > to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` > for disallowing the conversion. > We should remove the configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
[ https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238501#comment-17238501 ] Apache Spark commented on SPARK-33549: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/30493 > Remove configuration spark.sql.legacy.allowCastNumericToTimestamp > - > > Key: SPARK-33549 > URL: https://issues.apache.org/jira/browse/SPARK-33549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > In the current master branch, there is a new configuration > `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast > Numeric types to Timestamp or not. The default value is true. > After https://github.com/apache/spark/pull/30260, the type conversion between > Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need > to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` > for disallowing the conversion. > We should remove the configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
[ https://issues.apache.org/jira/browse/SPARK-33549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33549: Assignee: Apache Spark (was: Gengliang Wang) > Remove configuration spark.sql.legacy.allowCastNumericToTimestamp > - > > Key: SPARK-33549 > URL: https://issues.apache.org/jira/browse/SPARK-33549 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Minor > > In the current master branch, there is a new configuration > `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast > Numeric types to Timestamp or not. The default value is true. > After https://github.com/apache/spark/pull/30260, the type conversion between > Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need > to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` > for disallowing the conversion. > We should remove the configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33533) BasicConnectionProvider should consider case-sensitivity for properties.
[ https://issues.apache.org/jira/browse/SPARK-33533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33533. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30485 [https://github.com/apache/spark/pull/30485] > BasicConnectionProvider should consider case-sensitivity for properties. > > > Key: SPARK-33533 > URL: https://issues.apache.org/jira/browse/SPARK-33533 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > Fix For: 3.1.0 > > > After SPARK-32001, BasicConnectionProvider doesn't consider case-sensitivity > for properties. > Caused by this issue, OracleIntegrationSuite doesn't pass. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33224) Expose watermark information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-33224: Assignee: Jungtaek Lim > Expose watermark information on SS UI > - > > Key: SPARK-33224 > URL: https://issues.apache.org/jira/browse/SPARK-33224 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Assignee: Jungtaek Lim >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33533) BasicConnectionProvider should consider case-sensitivity for properties.
[ https://issues.apache.org/jira/browse/SPARK-33533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-33533: -- Priority: Critical (was: Major) > BasicConnectionProvider should consider case-sensitivity for properties. > > > Key: SPARK-33533 > URL: https://issues.apache.org/jira/browse/SPARK-33533 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Critical > > After SPARK-32001, BasicConnectionProvider doesn't consider case-sensitivity > for properties. > Caused by this issue, OracleIntegrationSuite doesn't pass. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33224) Expose watermark information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33224. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30427 [https://github.com/apache/spark/pull/30427] > Expose watermark information on SS UI > - > > Key: SPARK-33224 > URL: https://issues.apache.org/jira/browse/SPARK-33224 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33549) Remove configuration spark.sql.legacy.allowCastNumericToTimestamp
Gengliang Wang created SPARK-33549: -- Summary: Remove configuration spark.sql.legacy.allowCastNumericToTimestamp Key: SPARK-33549 URL: https://issues.apache.org/jira/browse/SPARK-33549 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Gengliang Wang Assignee: Gengliang Wang In the current master branch, there is a new configuration `spark.sql.legacy.allowCastNumericToTimestamp` which controls whether to cast Numeric types to Timestamp or not. The default value is true. After https://github.com/apache/spark/pull/30260, the type conversion between Timestamp type and Numeric type is disallowed in ANSI mode. So, we don't need to a separate configuration `spark.sql.legacy.allowCastNumericToTimestamp` for disallowing the conversion. We should remove the configuration. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33542) Group exception messages in catalyst/catalog
[ https://issues.apache.org/jira/browse/SPARK-33542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33542: - Summary: Group exception messages in catalyst/catalog (was: Group exceptions in catalyst/catalog) > Group exception messages in catalyst/catalog > > > Key: SPARK-33542 > URL: https://issues.apache.org/jira/browse/SPARK-33542 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, there are three major improvements proposed: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. was: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, there are three major > improvements proposed: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > A general rule of thumb for grouping exceptions: > * AnalysisException => QueryCompilationErrors > * SparkException, RuntimeException(UnsupportedOperationException, > IllegalStateException...) => QueryExecutionErrors > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] > Please see the SPIP: > [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33541) Group exception messages in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33541: - Summary: Group exception messages in catalyst/expressions (was: Group exceptions in catalyst/expressions) > Group exception messages in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33071: Assignee: Apache Spark > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1, 3.1.0 >Reporter: George >Assignee: Apache Spark >Priority: Critical > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 =
[jira] [Commented] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238481#comment-17238481 ] Apache Spark commented on SPARK-33071: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/30488 > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1, 3.1.0 >Reporter: George >Priority: Critical > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 =
[jira] [Assigned] (SPARK-33071) Join with ambiguous column succeeding but giving wrong output
[ https://issues.apache.org/jira/browse/SPARK-33071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33071: Assignee: (was: Apache Spark) > Join with ambiguous column succeeding but giving wrong output > - > > Key: SPARK-33071 > URL: https://issues.apache.org/jira/browse/SPARK-33071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.1, 3.1.0 >Reporter: George >Priority: Critical > Labels: correctness > > When joining two datasets where one column in each dataset is sourced from > the same input dataset, the join successfully runs, but does not select the > correct columns, leading to incorrect output. > Repro using pyspark: > {code:java} > sc.version > import pyspark.sql.functions as F > d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, 'units' > : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', 'sales': 1, > 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > input_df = spark.createDataFrame(d) > df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > df1 = df1.filter(F.col("key") != F.lit("c")) > df2 = df2.filter(F.col("key") != F.lit("d")) > ret = df1.join(df2, df1.key == df2.key, "full").select( > df1["key"].alias("df1_key"), > df2["key"].alias("df2_key"), > df1["sales"], > df2["units"], > F.coalesce(df1["key"], df2["key"]).alias("key")) > ret.show() > ret.explain(){code} > output for 2.4.4: > {code:java} > >>> sc.version > u'2.4.4' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>> df2 = df2.filter(F.col("key") != F.lit("d")) > >>> ret = df1.join(df2, df1.key == df2.key, "full").select( > ... df1["key"].alias("df1_key"), > ... df2["key"].alias("df2_key"), > ... df1["sales"], > ... df2["units"], > ... F.coalesce(df1["key"], df2["key"]).alias("key")) > 20/10/05 15:46:14 WARN Column: Constructing trivially true equals predicate, > 'key#213 = key#213'. Perhaps you need to use aliases. > >>> ret.show() > +---+---+-+-++ > |df1_key|df2_key|sales|units| key| > +---+---+-+-++ > | d| d|3| null| d| > | null| null| null|2|null| > | b| b|5| 10| b| > | a| a|3|6| a| > +---+---+-+-++>>> ret.explain() > == Physical Plan == > *(5) Project [key#213 AS df1_key#258, key#213 AS df2_key#259, sales#223L, > units#230L, coalesce(key#213, key#213) AS key#260] > +- SortMergeJoin [key#213], [key#237], FullOuter >:- *(2) Sort [key#213 ASC NULLS FIRST], false, 0 >: +- *(2) HashAggregate(keys=[key#213], functions=[sum(sales#214L)]) >: +- Exchange hashpartitioning(key#213, 200) >:+- *(1) HashAggregate(keys=[key#213], > functions=[partial_sum(sales#214L)]) >: +- *(1) Project [key#213, sales#214L] >: +- *(1) Filter (isnotnull(key#213) && NOT (key#213 = c)) >: +- Scan ExistingRDD[key#213,sales#214L,units#215L] >+- *(4) Sort [key#237 ASC NULLS FIRST], false, 0 > +- *(4) HashAggregate(keys=[key#237], functions=[sum(units#239L)]) > +- Exchange hashpartitioning(key#237, 200) > +- *(3) HashAggregate(keys=[key#237], > functions=[partial_sum(units#239L)]) >+- *(3) Project [key#237, units#239L] > +- *(3) Filter (isnotnull(key#237) && NOT (key#237 = d)) > +- Scan ExistingRDD[key#237,sales#238L,units#239L] > {code} > output for 3.0.1: > {code:java} > // code placeholder > >>> sc.version > u'3.0.1' > >>> import pyspark.sql.functions as F > >>> d = [{'key': 'a', 'sales': 1, 'units' : 2}, {'key': 'a', 'sales': 2, > >>> 'units' : 4}, {'key': 'b', 'sales': 5, 'units' : 10}, {'key': 'c', > >>> 'sales': 1, 'units' : 2}, {'key': 'd', 'sales': 3, 'units' : 6}] > >>> input_df = spark.createDataFrame(d) > /usr/local/lib/python2.7/site-packages/pyspark/sql/session.py:381: > UserWarning: inferring schema from dict is deprecated,please use > pyspark.sql.Row instead > warnings.warn("inferring schema from dict is deprecated," > >>> df1 = input_df.groupBy("key").agg(F.sum('sales').alias('sales')) > >>> df2 = input_df.groupBy("key").agg(F.sum('units').alias('units')) > >>> df1 = df1.filter(F.col("key") != F.lit("c")) > >>>
[jira] [Resolved] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33543. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30490 [https://github.com/apache/spark/pull/30490] > Migrate SHOW COLUMNS to new resolution framework > > > Key: SPARK-33543 > URL: https://issues.apache.org/jira/browse/SPARK-33543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.1.0 > > > Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-33543: --- Assignee: Terry Kim > Migrate SHOW COLUMNS to new resolution framework > > > Key: SPARK-33543 > URL: https://issues.apache.org/jira/browse/SPARK-33543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray
[ https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238473#comment-17238473 ] L. C. Hsieh commented on SPARK-33544: - Thanks [~hyukjin.kwon]. Will help review if [~tgraves] create a patch. > explode should not filter when used with CreateArray > > > Key: SPARK-33544 > URL: https://issues.apache.org/jira/browse/SPARK-33544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to > insert a filter for not null and size > 0 when using inner explode/inline. > This is fine in most cases but the extra filter is not needed if the explode > is with a create array and not using Literals (it already handles LIterals). > When this happens you know that the values aren't null and it has a size. It > already handles the empty array. > for instance: > val df = someDF.selectExpr("number", "explode(array(word, col3))") > So in this case we shouldn't be inserting the extra Filter and that filter > can get pushed down into like a parquet reader as well. This is just causing > extra overhead. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray
[ https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238469#comment-17238469 ] Hyukjin Kwon commented on SPARK-33544: -- cc [~viirya] FYI > explode should not filter when used with CreateArray > > > Key: SPARK-33544 > URL: https://issues.apache.org/jira/browse/SPARK-33544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to > insert a filter for not null and size > 0 when using inner explode/inline. > This is fine in most cases but the extra filter is not needed if the explode > is with a create array and not using Literals (it already handles LIterals). > When this happens you know that the values aren't null and it has a size. It > already handles the empty array. > for instance: > val df = someDF.selectExpr("number", "explode(array(word, col3))") > So in this case we shouldn't be inserting the extra Filter and that filter > can get pushed down into like a parquet reader as well. This is just causing > extra overhead. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
[ https://issues.apache.org/jira/browse/SPARK-33548?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238466#comment-17238466 ] xuziqiJS commented on SPARK-33548: -- i will fix it,please assign the task to me > Peak Execution Memory not display on Spark Executor UI intuitively > -- > > Key: SPARK-33548 > URL: https://issues.apache.org/jira/browse/SPARK-33548 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0, 3.0.1 >Reporter: xuziqiJS >Priority: Major > > Now, Peak Execution Memory can only be obtained through restAPI and cannot be > displayed on Spark Executor UI intuitively, although spark users tune spark > executor memory are dependent on the metrics. Therefore, it is very important > to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33548) Peak Execution Memory not display on Spark Executor UI intuitively
xuziqiJS created SPARK-33548: Summary: Peak Execution Memory not display on Spark Executor UI intuitively Key: SPARK-33548 URL: https://issues.apache.org/jira/browse/SPARK-33548 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.1, 3.0.0 Reporter: xuziqiJS Now, Peak Execution Memory can only be obtained through restAPI and cannot be displayed on Spark Executor UI intuitively, although spark users tune spark executor memory are dependent on the metrics. Therefore, it is very important to display the peak memory usage on the spark UI. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33494) Do not use local shuffle reader for repartition
[ https://issues.apache.org/jira/browse/SPARK-33494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-33494. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30432 [https://github.com/apache/spark/pull/30432] > Do not use local shuffle reader for repartition > --- > > Key: SPARK-33494 > URL: https://issues.apache.org/jira/browse/SPARK-33494 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33547) Doc Type Construct Literal usage
angerszhu created SPARK-33547: - Summary: Doc Type Construct Literal usage Key: SPARK-33547 URL: https://issues.apache.org/jira/browse/SPARK-33547 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.1.0 Reporter: angerszhu Add Doc about type construct literal in [https://spark.apache.org/docs/3.0.1/sql-ref-literals.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33547) Doc Type Construct Literal usage
[ https://issues.apache.org/jira/browse/SPARK-33547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238453#comment-17238453 ] angerszhu commented on SPARK-33547: --- Working on this > Doc Type Construct Literal usage > > > Key: SPARK-33547 > URL: https://issues.apache.org/jira/browse/SPARK-33547 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: angerszhu >Priority: Major > > Add Doc about type construct literal in > [https://spark.apache.org/docs/3.0.1/sql-ref-literals.html] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33546) CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE
[ https://issues.apache.org/jira/browse/SPARK-33546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-33546: Description: Currently there are several inconsistency: # CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., STORED AS PARQUET can't be used with ROW FORMAT SERDE. # CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified together, which is not necessary. # CREATE TABLE LIKE does not respect the default hive serde. > CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE > --- > > Key: SPARK-33546 > URL: https://issues.apache.org/jira/browse/SPARK-33546 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > > Currently there are several inconsistency: > # CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., > STORED AS PARQUET can't be used with ROW FORMAT SERDE. > # CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified > together, which is not necessary. > # CREATE TABLE LIKE does not respect the default hive serde. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
[ https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33252. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 30413 [https://github.com/apache/spark/pull/30413] > Migration to NumPy documentation style in MLlib (pyspark.mllib.*) > - > > Key: SPARK-33252 > URL: https://issues.apache.org/jira/browse/SPARK-33252 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.2.0 > > > This JIRA targets to migrate to NumPy documentation style in MLlib > (pyspark.mllib.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33546) CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE
Wenchen Fan created SPARK-33546: --- Summary: CREATE TABLE LIKE should resolve hive serde correctly like CREATE TABLE Key: SPARK-33546 URL: https://issues.apache.org/jira/browse/SPARK-33546 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 2.4.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
[ https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33252: Assignee: Maciej Szymkiewicz > Migration to NumPy documentation style in MLlib (pyspark.mllib.*) > - > > Key: SPARK-33252 > URL: https://issues.apache.org/jira/browse/SPARK-33252 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > > This JIRA targets to migrate to NumPy documentation style in MLlib > (pyspark.mllib.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33252) Migration to NumPy documentation style in MLlib (pyspark.mllib.*)
[ https://issues.apache.org/jira/browse/SPARK-33252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-33252: - Fix Version/s: (was: 3.2.0) 3.1.0 > Migration to NumPy documentation style in MLlib (pyspark.mllib.*) > - > > Key: SPARK-33252 > URL: https://issues.apache.org/jira/browse/SPARK-33252 > Project: Spark > Issue Type: Sub-task > Components: Documentation, PySpark >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > This JIRA targets to migrate to NumPy documentation style in MLlib > (pyspark.mllib.*). Please also see the parent JIRA. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33534) Allow specifying a minimum number of bytes in a split of a file
[ https://issues.apache.org/jira/browse/SPARK-33534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-33534: - Component/s: (was: Input/Output) SQL > Allow specifying a minimum number of bytes in a split of a file > --- > > Key: SPARK-33534 > URL: https://issues.apache.org/jira/browse/SPARK-33534 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: Niels Basjes >Priority: Major > > *Background* > Long time ago I have written a way for reading a (usually large) Gzipped > file in a way that allows better distribution of the load over an Apache > Hadoop cluster: [https://github.com/nielsbasjes/splittablegzip] > Seems like people still need this kind of functionality and it turns out my > code works without modification in conjunction with Apache Spark. > See for example: > - SPARK-29102 > - [https://stackoverflow.com/q/28127119/877069] > - [https://stackoverflow.com/q/27531816/877069] > So [~nchammas] provided documentation to my project a while ago on how to use > it with Spark. > [https://github.com/nielsbasjes/splittablegzip/blob/master/README-Spark.md] > *The problem* > Now some people have indicated getting errors from this feature of mine. > Fact is that this functionality cannot read a split if it is too small (the > number of bytes read from disk and the number of bytes coming out the > compression are different). So my code uses the {{io.file.buffer.size}} > setting but also has a hard coded lower limit split size of 4 KiB. > Now the problem I found when looking into the reports I got is that Spark > does not have a minimum number of bytes in a split. > In fact: When I created a test file and then set the > {{spark.sql.files.maxPartitionBytes}} to exactly 1 byte less than the size of > my test file my library gave the error: > {{java.lang.IllegalArgumentException: The provided InputSplit (562686;562687] > is 1 bytes which is too small. (Minimum is 65536)}} > I found the code that does this calculation here > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartition.scala#L74 > *Proposed enhancement* > So what I propose is to have a new setting > ({{spark.sql.files.minPartitionBytes}} ?) that will guarantee that no split > of a file is smaller than a configured number of bytes. > I also propose to have this set to something like 64KiB as a default. > Having some constraints on the values of > {{spark.sql.files.minPartitionBytes}} and possibly in relation with > {{spark.sql.files.maxPartitionBytes}} would be fine. > *Notes* > Hadoop already has code that does this: > https://github.com/apache/hadoop/blob/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.java#L456 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33457) Adjust mypy configuration
[ https://issues.apache.org/jira/browse/SPARK-33457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-33457: Assignee: Maciej Szymkiewicz > Adjust mypy configuration > - > > Key: SPARK-33457 > URL: https://issues.apache.org/jira/browse/SPARK-33457 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > > At the moment, with exception to type ignores, we use default MyPy > configuration. These already provide decent coverage, but are somewhat less > restrictive than the ones used in {{typeshed}} and {{pyspark-stubs}}. > We should consider at least the following: > - {{strict_optional}} > - {{no_implicit_optional}} > It might be also a good idea to add {{disallow_untyped_defs}}, which will > allow us to catch any instances of user-facing code, that are missing > annotations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33457) Adjust mypy configuration
[ https://issues.apache.org/jira/browse/SPARK-33457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-33457. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30382 [https://github.com/apache/spark/pull/30382] > Adjust mypy configuration > - > > Key: SPARK-33457 > URL: https://issues.apache.org/jira/browse/SPARK-33457 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.1.0 >Reporter: Maciej Szymkiewicz >Assignee: Maciej Szymkiewicz >Priority: Major > Fix For: 3.1.0 > > > At the moment, with exception to type ignores, we use default MyPy > configuration. These already provide decent coverage, but are somewhat less > restrictive than the ones used in {{typeshed}} and {{pyspark-stubs}}. > We should consider at least the following: > - {{strict_optional}} > - {{no_implicit_optional}} > It might be also a good idea to add {{disallow_untyped_defs}}, which will > allow us to catch any instances of user-facing code, that are missing > annotations. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
[ https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411 ] Asif edited comment on SPARK-19875 at 11/24/20, 11:43 PM: -- [~maropu], [~sameerag] [~jay.pranavamurthi] I have generated a PR for SPARK-3152 which fixes the OOM or unreasonable compile time in queries. The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185] I cannot get any body for code review. The explanation of the logic used is in the PR. If needed we can go through the code together. This is going to be used by workday in production. was (Author: ashahid7): [~maropu], [~sameerag] [~jay.pranavamurthi] I have generated a PR for SPARK-3152 which fixes the OOM or unreasonable compile time in queries. The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185] I cannot get any body for code review. The explanation of the logic used is in the PR > Map->filter on many columns gets stuck in constraint inference optimization > code > > > Key: SPARK-19875 > URL: https://issues.apache.org/jira/browse/SPARK-19875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Priority: Major > Labels: bulk-closed > Attachments: TestFilter.scala, test10cols.csv, test50cols.csv > > > The attached code (TestFilter.scala) works with a 10-column csv dataset, but > gets stuck with a 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
[ https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411 ] Asif commented on SPARK-19875: -- [~maropu] I have generated a PR for SPARK-3152 which fixes the OOM or unreasonable compile time in queries. The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185] I cannot get any body for code review. The explanation of the logic used is in the PR > Map->filter on many columns gets stuck in constraint inference optimization > code > > > Key: SPARK-19875 > URL: https://issues.apache.org/jira/browse/SPARK-19875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Priority: Major > Labels: bulk-closed > Attachments: TestFilter.scala, test10cols.csv, test50cols.csv > > > The attached code (TestFilter.scala) works with a 10-column csv dataset, but > gets stuck with a 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19875) Map->filter on many columns gets stuck in constraint inference optimization code
[ https://issues.apache.org/jira/browse/SPARK-19875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238411#comment-17238411 ] Asif edited comment on SPARK-19875 at 11/24/20, 11:42 PM: -- [~maropu], [~sameerag] [~jay.pranavamurthi] I have generated a PR for SPARK-3152 which fixes the OOM or unreasonable compile time in queries. The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185] I cannot get any body for code review. The explanation of the logic used is in the PR was (Author: ashahid7): [~maropu] I have generated a PR for SPARK-3152 which fixes the OOM or unreasonable compile time in queries. The PR is [pr-for-spark-33152|https://github.com/apache/spark/pull/30185] I cannot get any body for code review. The explanation of the logic used is in the PR > Map->filter on many columns gets stuck in constraint inference optimization > code > > > Key: SPARK-19875 > URL: https://issues.apache.org/jira/browse/SPARK-19875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Jay Pranavamurthi >Priority: Major > Labels: bulk-closed > Attachments: TestFilter.scala, test10cols.csv, test50cols.csv > > > The attached code (TestFilter.scala) works with a 10-column csv dataset, but > gets stuck with a 50-column csv dataset. Both datasets are attached. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33287) Expose state custom metrics information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-33287: Assignee: Gabor Somogyi > Expose state custom metrics information on SS UI > > > Key: SPARK-33287 > URL: https://issues.apache.org/jira/browse/SPARK-33287 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > Since not all custom metrics hold useful information it would be good to add > exclude possibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33287) Expose state custom metrics information on SS UI
[ https://issues.apache.org/jira/browse/SPARK-33287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-33287. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 30336 [https://github.com/apache/spark/pull/30336] > Expose state custom metrics information on SS UI > > > Key: SPARK-33287 > URL: https://issues.apache.org/jira/browse/SPARK-33287 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming, Web UI >Affects Versions: 3.0.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.0 > > > Since not all custom metrics hold useful information it would be good to add > exclude possibility. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission
[ https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33545: Assignee: (was: Apache Spark) > Support Fallback Storage during Worker decommission > --- > > Key: SPARK-33545 > URL: https://issues.apache.org/jira/browse/SPARK-33545 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33545) Support Fallback Storage during Worker decommission
[ https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238392#comment-17238392 ] Apache Spark commented on SPARK-33545: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/30492 > Support Fallback Storage during Worker decommission > --- > > Key: SPARK-33545 > URL: https://issues.apache.org/jira/browse/SPARK-33545 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission
[ https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33545: Assignee: (was: Apache Spark) > Support Fallback Storage during Worker decommission > --- > > Key: SPARK-33545 > URL: https://issues.apache.org/jira/browse/SPARK-33545 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33545) Support Fallback Storage during Worker decommission
[ https://issues.apache.org/jira/browse/SPARK-33545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33545: Assignee: Apache Spark > Support Fallback Storage during Worker decommission > --- > > Key: SPARK-33545 > URL: https://issues.apache.org/jira/browse/SPARK-33545 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33545) Support Fallback Storage during Worker decommission
Dongjoon Hyun created SPARK-33545: - Summary: Support Fallback Storage during Worker decommission Key: SPARK-33545 URL: https://issues.apache.org/jira/browse/SPARK-33545 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33544) explode should not filter when used with CreateArray
[ https://issues.apache.org/jira/browse/SPARK-33544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238371#comment-17238371 ] Thomas Graves commented on SPARK-33544: --- I'm working on a patch for this. > explode should not filter when used with CreateArray > > > Key: SPARK-33544 > URL: https://issues.apache.org/jira/browse/SPARK-33544 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Thomas Graves >Priority: Major > > https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to > insert a filter for not null and size > 0 when using inner explode/inline. > This is fine in most cases but the extra filter is not needed if the explode > is with a create array and not using Literals (it already handles LIterals). > When this happens you know that the values aren't null and it has a size. It > already handles the empty array. > for instance: > val df = someDF.selectExpr("number", "explode(array(word, col3))") > So in this case we shouldn't be inserting the extra Filter and that filter > can get pushed down into like a parquet reader as well. This is just causing > extra overhead. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33544) explode should not filter when used with CreateArray
Thomas Graves created SPARK-33544: - Summary: explode should not filter when used with CreateArray Key: SPARK-33544 URL: https://issues.apache.org/jira/browse/SPARK-33544 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Thomas Graves https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals). When this happens you know that the values aren't null and it has a size. It already handles the empty array. for instance: val df = someDF.selectExpr("number", "explode(array(word, col3))") So in this case we shouldn't be inserting the extra Filter and that filter can get pushed down into like a parquet reader as well. This is just causing extra overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33492) DSv2: Append/Overwrite/ReplaceTable should invalidate cache
[ https://issues.apache.org/jira/browse/SPARK-33492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238363#comment-17238363 ] Apache Spark commented on SPARK-33492: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/30491 > DSv2: Append/Overwrite/ReplaceTable should invalidate cache > --- > > Key: SPARK-33492 > URL: https://issues.apache.org/jira/browse/SPARK-33492 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Fix For: 3.1.0 > > > Unlike in DSv1, currently in DSv2 we don't invalidate table caches for > operations such as append, overwrite table by expr/partition, replace table, > etc. We should fix these so that the behavior is consistent between v1 and v2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file
[ https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-32670: --- Assignee: Xinyi Yu (was: Xiao Li) > Group exception messages in Catalyst Analyzer in one file > - > > Key: SPARK-32670 > URL: https://issues.apache.org/jira/browse/SPARK-32670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Xinyi Yu >Priority: Minor > Fix For: 3.1.0 > > > For standardization of error messages and its maintenance, we can try to > group the exception messages into a single file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32670) Group exception messages in Catalyst Analyzer in one file
[ https://issues.apache.org/jira/browse/SPARK-32670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-32670: Parent: SPARK-33539 Issue Type: Sub-task (was: Improvement) > Group exception messages in Catalyst Analyzer in one file > - > > Key: SPARK-32670 > URL: https://issues.apache.org/jira/browse/SPARK-32670 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Xiao Li >Assignee: Xiao Li >Priority: Minor > Fix For: 3.1.0 > > > For standardization of error messages and its maintenance, we can try to > group the exception messages into a single file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33543: Assignee: (was: Apache Spark) > Migrate SHOW COLUMNS to new resolution framework > > > Key: SPARK-33543 > URL: https://issues.apache.org/jira/browse/SPARK-33543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33543: Assignee: Apache Spark > Migrate SHOW COLUMNS to new resolution framework > > > Key: SPARK-33543 > URL: https://issues.apache.org/jira/browse/SPARK-33543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Assignee: Apache Spark >Priority: Minor > > Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
[ https://issues.apache.org/jira/browse/SPARK-33543?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238343#comment-17238343 ] Apache Spark commented on SPARK-33543: -- User 'imback82' has created a pull request for this issue: https://github.com/apache/spark/pull/30490 > Migrate SHOW COLUMNS to new resolution framework > > > Key: SPARK-33543 > URL: https://issues.apache.org/jira/browse/SPARK-33543 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Terry Kim >Priority: Minor > > Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33543) Migrate SHOW COLUMNS to new resolution framework
Terry Kim created SPARK-33543: - Summary: Migrate SHOW COLUMNS to new resolution framework Key: SPARK-33543 URL: https://issues.apache.org/jira/browse/SPARK-33543 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Terry Kim Migrate SHOW COLUMNS to new resolution framework. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33542) Group exceptions in catalyst/catalog
Allison Wang created SPARK-33542: Summary: Group exceptions in catalyst/catalog Key: SPARK-33542 URL: https://issues.apache.org/jira/browse/SPARK-33542 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33541) Group exceptions in catalyst/expressions
[ https://issues.apache.org/jira/browse/SPARK-33541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33541: - Summary: Group exceptions in catalyst/expressions (was: Group AnalysisException in catalyst/expressions into QueryCompilationErrors) > Group exceptions in catalyst/expressions > > > Key: SPARK-33541 > URL: https://issues.apache.org/jira/browse/SPARK-33541 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. A general rule of thumb for grouping exceptions: * AnalysisException => QueryCompilationErrors * SparkException, RuntimeException(UnsupportedOperationException, IllegalStateException...) => QueryExecutionErrors Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. was: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, we have proposed three > major tasks to standardize exception messages in Spark: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > A general rule of thumb for grouping exceptions: > * AnalysisException => QueryCompilationErrors > * SparkException, RuntimeException(UnsupportedOperationException, > IllegalStateException...) => QueryExecutionErrors > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] > Please see the SPIP: > [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33541) Group AnalysisException in catalyst/expressions into QueryCompilationErrors
Allison Wang created SPARK-33541: Summary: Group AnalysisException in catalyst/expressions into QueryCompilationErrors Key: SPARK-33541 URL: https://issues.apache.org/jira/browse/SPARK-33541 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Allison Wang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24266) Spark client terminates while driver is still running
[ https://issues.apache.org/jira/browse/SPARK-24266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24266: -- Fix Version/s: 2.4.8 > Spark client terminates while driver is still running > - > > Key: SPARK-24266 > URL: https://issues.apache.org/jira/browse/SPARK-24266 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 2.3.0, 3.0.0 >Reporter: Chun Chen >Assignee: Stijn De Haes >Priority: Critical > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > {code} > Warning: Ignoring non-spark config property: Default=system properties > included when running spark-submit. > 18/05/11 14:50:12 WARN Config: Error reading service account token from: > [/var/run/secrets/kubernetes.io/serviceaccount/token]. Ignoring. > 18/05/11 14:50:12 INFO HadoopStepsOrchestrator: Hadoop Conf directory: > Some(/data/tesla/spark-2.2.0-k8s-0.5.0-bin-2.7.3/hadoop-conf) > 18/05/11 14:50:15 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 18/05/11 14:50:15 WARN DomainSocketFactory: The short-circuit local reads > feature cannot be used because libhadoop cannot be loaded. > 18/05/11 14:50:16 INFO HadoopConfBootstrapImpl: HADOOP_CONF_DIR defined. > Mounting Hadoop specific files > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: N/A >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:17 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: N/A >container images: N/A >phase: Pending >status: [] > 18/05/11 14:50:18 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, > default-token-xvjt9 >node name: tbds-100-98-45-69 >start time: 2018-05-11T06:50:17Z >container images: docker.oa.com:8080/gaia/spark-driver-cos:20180503_9 >phase: Pending >status: [ContainerStatus(containerID=null, > image=docker.oa.com:8080/gaia/spark-driver-cos:20180503_9, imageID=, > lastState=ContainerState(running=null, terminated=null, waiting=null, > additionalProperties={}), name=spark-kubernetes-driver, ready=false, > restartCount=0, state=ContainerState(running=null, terminated=null, > waiting=ContainerStateWaiting(message=null, reason=PodInitializing, > additionalProperties={}), additionalProperties={}), additionalProperties={})] > 18/05/11 14:50:19 INFO Client: Waiting for application spark-64-293-980 to > finish... > 18/05/11 14:50:25 INFO LoggingPodStatusWatcherImpl: State changed, new state: >pod name: spark-64-293-980-1526021412180-driver >namespace: tione-603074457 >labels: network -> FLOATINGIP, spark-app-selector -> > spark-2843da19c690485b93780ad7992a101e, spark-role -> driver >pod uid: 90558303-54e7-11e8-9e64-525400da65d8 >creation time: 2018-05-11T06:50:17Z >service account name: default >volumes: spark-local-dir-0-spark-local, spark-init-properties, > download-jars-volume, download-files, spark-init-secret, hadoop-properties, >
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This change can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. was: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This change can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6–hIFOaNUNSlpaOIZs/edit?usp=sharing for more details. > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, we have proposed three > major tasks to standardize exception messages in Spark: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This change can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] > Please see the SPIP: > [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33539) Standardize exception messages in Spark
[ https://issues.apache.org/jira/browse/SPARK-33539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Allison Wang updated SPARK-33539: - Description: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. was: In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This change can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] for more details. > Standardize exception messages in Spark > --- > > Key: SPARK-33539 > URL: https://issues.apache.org/jira/browse/SPARK-33539 > Project: Spark > Issue Type: Improvement > Components: Spark Core, SQL >Affects Versions: 3.1.0 >Reporter: Allison Wang >Priority: Major > > In the SPIP: Standardize Exception Messages in Spark, we have proposed three > major tasks to standardize exception messages in Spark: > # Group error messages in dedicated files. > # Establish an error message guideline for developers. > # Improve error message quality. > The first step is to centralize error messages for each component into its > own dedicated file(s). This can help with auditing error messages and > subsequent tasks to establish a guideline and improve message quality in the > future. > Here is an example RP to group all `AnalysisExcpetion` in Analyzer into > QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] > Please see the SPIP: > [https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6--hIFOaNUNSlpaOIZs/edit?usp=sharing] > for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33540) Subexpression elimination for interpreted predicate
L. C. Hsieh created SPARK-33540: --- Summary: Subexpression elimination for interpreted predicate Key: SPARK-33540 URL: https://issues.apache.org/jira/browse/SPARK-33540 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.1.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh We can support subexpression elimination for interpreted predicate. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33539) Standardize exception messages in Spark
Allison Wang created SPARK-33539: Summary: Standardize exception messages in Spark Key: SPARK-33539 URL: https://issues.apache.org/jira/browse/SPARK-33539 Project: Spark Issue Type: Improvement Components: Spark Core, SQL Affects Versions: 3.1.0 Reporter: Allison Wang In the SPIP: Standardize Exception Messages in Spark, we have proposed three major tasks to standardize exception messages in Spark: # Group error messages in dedicated files. # Establish an error message guideline for developers. # Improve error message quality. The first step is to centralize error messages for each component into its own dedicated file(s). This change can help with auditing error messages and subsequent tasks to establish a guideline and improve message quality in the future. Here is an example RP to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors: [https://github.com/apache/spark/pull/29497] Please see the SPIP: https://docs.google.com/document/d/1XGj1o3xAFh8BA7RCn3DtwIPC6–hIFOaNUNSlpaOIZs/edit?usp=sharing for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-33535. --- Fix Version/s: 2.4.8 3.0.2 3.1.0 Resolution: Fixed Issue resolved by pull request 30487 [https://github.com/apache/spark/pull/30487] > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.1.0, 3.0.2, 2.4.8 > > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-33535: - Assignee: Yang Jie > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mori[A]rty updated SPARK-33531: --- Description: Using a new method SparkPlan#executeTakeToIterator to implement CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent method SparkPlan#executeToIterator. When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect is enabled, extra shuffle will lead to a significant performance issue for SQLs terminated with LIMIT. was: CollectLimitExec#executeToIterator should be implemented using CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent method SparkPlan#executeToIterator. When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a significant performance issue for SQLs terminated with LIMIT. > [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator > --- > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > Using a new method SparkPlan#executeTakeToIterator to implement > CollectLimitExec#executeToIterator to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, extra shuffle will lead > to a significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33531: Assignee: (was: Apache Spark) > [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator > --- > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > CollectLimitExec#executeToIterator should be implemented using > CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a > significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238257#comment-17238257 ] Apache Spark commented on SPARK-33531: -- User 'hammertank' has created a pull request for this issue: https://github.com/apache/spark/pull/30489 > [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator > --- > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > CollectLimitExec#executeToIterator should be implemented using > CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a > significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33531: Assignee: Apache Spark > [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator > --- > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Assignee: Apache Spark >Priority: Major > > CollectLimitExec#executeToIterator should be implemented using > CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a > significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32792) Improve in filter pushdown for ParquetFilters
[ https://issues.apache.org/jira/browse/SPARK-32792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-32792: Parent: SPARK-25419 Issue Type: Sub-task (was: Improvement) > Improve in filter pushdown for ParquetFilters > - > > Key: SPARK-32792 > URL: https://issues.apache.org/jira/browse/SPARK-32792 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` > maximum value when its values exceeds > `spark.sql.parquet.pushdown.inFilterThreshold`. For example: > ```sql > SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15) > ``` > We will push down `id >= 1 and id <= 15`. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33538) Directly push IN predicates to the Hive Metastore
Yuming Wang created SPARK-33538: --- Summary: Directly push IN predicates to the Hive Metastore Key: SPARK-33538 URL: https://issues.apache.org/jira/browse/SPARK-33538 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Hive 2.0 support directly push IN predicates to the Hive Metastore. Plase see https://issues.apache.org/jira/browse/HIVE-11726 for more detail. We should use this api to improve performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33477) Hive partition pruning support date type
[ https://issues.apache.org/jira/browse/SPARK-33477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33477: Parent: SPARK-33537 Issue Type: Sub-task (was: Improvement) > Hive partition pruning support date type > - > > Key: SPARK-33477 > URL: https://issues.apache.org/jira/browse/SPARK-33477 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Hive partition pruning can support date type: > https://issues.apache.org/jira/browse/HIVE-5679 > https://github.com/apache/hive/commit/5106bf1c8671740099fca8e1a7d4b37afe97137f -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-27421) RuntimeException when querying a view on a partitioned parquet table
[ https://issues.apache.org/jira/browse/SPARK-27421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-27421: Parent: SPARK-33537 Issue Type: Sub-task (was: Bug) > RuntimeException when querying a view on a partitioned parquet table > > > Key: SPARK-27421 > URL: https://issues.apache.org/jira/browse/SPARK-27421 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0, 2.4.1 > Environment: Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit > Server VM, Java 1.8.0_141) >Reporter: Eric Maynard >Assignee: Yuming Wang >Priority: Minor > Fix For: 2.4.8, 3.0.2, 3.1.0 > > > When running a simple query, I get the following stacktrace: > {code} > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. You can set the Spark configuration > setting spark.sql.hive.manageFilesourcePartitions to false to work around > this problem, however this will result in degraded performance. Please report > a bug: https://issues.apache.org/jira/browse/SPARK > at > org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:772) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:686) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283) > at > org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:221) > at > org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:220) > at > org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:266) > at > org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:684) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1268) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) > at > org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1261) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:262) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:957) > at > org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:256) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:255) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:27) > at > org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$.apply(PruneFileSourcePartitions.scala:26) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.WrappedArray.foldLeft(WrappedArray.scala:35) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at
[jira] [Updated] (SPARK-33458) Hive partition pruning support Contains, StartsWith and EndsWith predicate
[ https://issues.apache.org/jira/browse/SPARK-33458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-33458: Parent: SPARK-33537 Issue Type: Sub-task (was: Improvement) > Hive partition pruning support Contains, StartsWith and EndsWith predicate > -- > > Key: SPARK-33458 > URL: https://issues.apache.org/jira/browse/SPARK-33458 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > Fix For: 3.1.0 > > > Hive partition pruning can support Contains, StartsWith and EndsWith > predicate: > https://github.com/apache/hive/blob/0c2c8a7f57330880f156466526bc0fdc94681035/metastore/src/test/org/apache/hadoop/hive/metastore/TestHiveMetaStore.java#L1074-L1075 > https://github.com/apache/hive/commit/0c2c8a7f57330880f156466526bc0fdc94681035#diff-b1200d4259fafd48d7bbd0050e89772218813178f68461a2e82551c52319b282 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33537) Hive Metastore filter pushdown improvement
Yuming Wang created SPARK-33537: --- Summary: Hive Metastore filter pushdown improvement Key: SPARK-33537 URL: https://issues.apache.org/jira/browse/SPARK-33537 Project: Spark Issue Type: Umbrella Components: SQL Affects Versions: 3.1.0 Reporter: Yuming Wang Assignee: Yuming Wang This umbrella ticket to track Hive Metastore filter pushdown improvement. It includes: 1. Date type push down 2. Like push down 3. InSet pushdown improvement and other fixes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33531) [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator
[ https://issues.apache.org/jira/browse/SPARK-33531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mori[A]rty updated SPARK-33531: --- Description: CollectLimitExec#executeToIterator should be implemented using CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent method SparkPlan#executeToIterator. When running a SparkThriftServer and spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a significant performance issue for SQLs terminated with LIMIT. was:CollectLimitExec#executeToIterator should be implemented using CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent method SparkPlan#executeToIterator. > [SQL] Avoid shuffle when calling CollectLimitExec#executeToIterator > --- > > Key: SPARK-33531 > URL: https://issues.apache.org/jira/browse/SPARK-33531 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.1 >Reporter: Mori[A]rty >Priority: Major > > CollectLimitExec#executeToIterator should be implemented using > CollectLimitExec#executeCollect to avoid shuffle caused by invoking parent > method SparkPlan#executeToIterator. > When running a SparkThriftServer and > spark.sql.thriftServer.incrementalCollect is enabled, this will lead to a > significant performance issue for SQLs terminated with LIMIT. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33536) Incorrect join results when joining twice with the same DF
[ https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238174#comment-17238174 ] Apache Spark commented on SPARK-33536: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/30488 > Incorrect join results when joining twice with the same DF > -- > > Key: SPARK-33536 > URL: https://issues.apache.org/jira/browse/SPARK-33536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: wuyi >Priority: Major > > {code:java} > val emp1 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop"), > TestData(4, "IT")).toDS() > val emp2 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop")).toDS() > val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) > emp1.join(emp3, emp1.col("key") === emp3.col("key"), > "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show() > // wrong result > +---+-+---+ > |key|value| e2| > +---+-+---+ > | 1|sales| 1| > | 2|personnel| 2| > | 3| develop| 3| > | 4| IT| 4| > +---+-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33536) Incorrect join results when joining twice with the same DF
[ https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33536: Assignee: (was: Apache Spark) > Incorrect join results when joining twice with the same DF > -- > > Key: SPARK-33536 > URL: https://issues.apache.org/jira/browse/SPARK-33536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: wuyi >Priority: Major > > {code:java} > val emp1 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop"), > TestData(4, "IT")).toDS() > val emp2 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop")).toDS() > val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) > emp1.join(emp3, emp1.col("key") === emp3.col("key"), > "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show() > // wrong result > +---+-+---+ > |key|value| e2| > +---+-+---+ > | 1|sales| 1| > | 2|personnel| 2| > | 3| develop| 3| > | 4| IT| 4| > +---+-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33536) Incorrect join results when joining twice with the same DF
[ https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33536: Assignee: Apache Spark > Incorrect join results when joining twice with the same DF > -- > > Key: SPARK-33536 > URL: https://issues.apache.org/jira/browse/SPARK-33536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > {code:java} > val emp1 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop"), > TestData(4, "IT")).toDS() > val emp2 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop")).toDS() > val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) > emp1.join(emp3, emp1.col("key") === emp3.col("key"), > "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show() > // wrong result > +---+-+---+ > |key|value| e2| > +---+-+---+ > | 1|sales| 1| > | 2|personnel| 2| > | 3| develop| 3| > | 4| IT| 4| > +---+-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33536) Incorrect join results when joining twice with the same DF
[ https://issues.apache.org/jira/browse/SPARK-33536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238173#comment-17238173 ] Apache Spark commented on SPARK-33536: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/30488 > Incorrect join results when joining twice with the same DF > -- > > Key: SPARK-33536 > URL: https://issues.apache.org/jira/browse/SPARK-33536 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.0.1, 3.1.0 >Reporter: wuyi >Priority: Major > > {code:java} > val emp1 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop"), > TestData(4, "IT")).toDS() > val emp2 = Seq[TestData]( > TestData(1, "sales"), > TestData(2, "personnel"), > TestData(3, "develop")).toDS() > val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) > emp1.join(emp3, emp1.col("key") === emp3.col("key"), > "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show() > // wrong result > +---+-+---+ > |key|value| e2| > +---+-+---+ > | 1|sales| 1| > | 2|personnel| 2| > | 3| develop| 3| > | 4| IT| 4| > +---+-+---+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238170#comment-17238170 ] Apache Spark commented on SPARK-33535: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30487 > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33535: Assignee: (was: Apache Spark) > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33535: Assignee: Apache Spark > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Major > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33535) export LANG to en_US.UTF-8 in jenkins test script
[ https://issues.apache.org/jira/browse/SPARK-33535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238169#comment-17238169 ] Apache Spark commented on SPARK-33535: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/30487 > export LANG to en_US.UTF-8 in jenkins test script > - > > Key: SPARK-33535 > URL: https://issues.apache.org/jira/browse/SPARK-33535 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Yang Jie >Priority: Major > > > {code:java} > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V1 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V2 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V3 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V4 > get binary type > > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.HIVE_CLI_SERVICE_PROTOCOL_V5 > get binary type{code} > > failed Jenkins tests and passed GitHub Actions. The error message as follows: > > > {code:java} > Error Messageorg.scalatest.exceptions.TestFailedException: "[?](" did not > equal "[�]("Stacktracesbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedException: "[?](" did not equal "[�](" > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) > at > org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26(SparkThriftServerProtocolVersionsSuite.scala:302) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$26$adapted(SparkThriftServerProtocolVersionsSuite.scala:300) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:68) > at > org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$24(SparkThriftServerProtocolVersionsSuite.scala:300) > at > scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) > {code} > > seems that the "LANG" of some build machines is not "en_US.UTF-8" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33530) Support --archives option natively
[ https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238165#comment-17238165 ] Apache Spark commented on SPARK-33530: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30486 > Support --archives option natively > -- > > Key: SPARK-33530 > URL: https://issues.apache.org/jira/browse/SPARK-33530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} > configuration are only supported in Yarn modes: > {code} > spark-submit --help > ... > Spark on YARN only: > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working directory of each executor. > {code} > This is actually critical for PySpark to support shipping other packages > together, see also > https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment. > Due to this missing feature, PySpark cannot support conda env to ship other > packages together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33530) Support --archives option natively
[ https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238167#comment-17238167 ] Apache Spark commented on SPARK-33530: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30486 > Support --archives option natively > -- > > Key: SPARK-33530 > URL: https://issues.apache.org/jira/browse/SPARK-33530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} > configuration are only supported in Yarn modes: > {code} > spark-submit --help > ... > Spark on YARN only: > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working directory of each executor. > {code} > This is actually critical for PySpark to support shipping other packages > together, see also > https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment. > Due to this missing feature, PySpark cannot support conda env to ship other > packages together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33530) Support --archives option natively
[ https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33530: Assignee: Apache Spark > Support --archives option natively > -- > > Key: SPARK-33530 > URL: https://issues.apache.org/jira/browse/SPARK-33530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} > configuration are only supported in Yarn modes: > {code} > spark-submit --help > ... > Spark on YARN only: > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working directory of each executor. > {code} > This is actually critical for PySpark to support shipping other packages > together, see also > https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment. > Due to this missing feature, PySpark cannot support conda env to ship other > packages together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-33530) Support --archives option natively
[ https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-33530: Assignee: (was: Apache Spark) > Support --archives option natively > -- > > Key: SPARK-33530 > URL: https://issues.apache.org/jira/browse/SPARK-33530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} > configuration are only supported in Yarn modes: > {code} > spark-submit --help > ... > Spark on YARN only: > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working directory of each executor. > {code} > This is actually critical for PySpark to support shipping other packages > together, see also > https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment. > Due to this missing feature, PySpark cannot support conda env to ship other > packages together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33530) Support --archives option natively
[ https://issues.apache.org/jira/browse/SPARK-33530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17238166#comment-17238166 ] Apache Spark commented on SPARK-33530: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/30486 > Support --archives option natively > -- > > Key: SPARK-33530 > URL: https://issues.apache.org/jira/browse/SPARK-33530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, {{spark-submit --archives}} and {{spark.yarn.dist.archives}} > configuration are only supported in Yarn modes: > {code} > spark-submit --help > ... > Spark on YARN only: > --queue QUEUE_NAME The YARN queue to submit to (Default: > "default"). > --archives ARCHIVES Comma separated list of archives to be > extracted into the > working directory of each executor. > {code} > This is actually critical for PySpark to support shipping other packages > together, see also > https://hyukjin-spark.readthedocs.io/en/stable/user_guide/python_packaging.html#using-zipped-virtual-environment. > Due to this missing feature, PySpark cannot support conda env to ship other > packages together. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33536) Incorrect join results when joining twice with the same DF
wuyi created SPARK-33536: Summary: Incorrect join results when joining twice with the same DF Key: SPARK-33536 URL: https://issues.apache.org/jira/browse/SPARK-33536 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1, 3.0.0, 3.1.0 Reporter: wuyi {code:java} val emp1 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop"), TestData(4, "IT")).toDS() val emp2 = Seq[TestData]( TestData(1, "sales"), TestData(2, "personnel"), TestData(3, "develop")).toDS() val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*")) emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer").select(emp1.col("*"), emp3.col("key").as("e2")).show() // wrong result +---+-+---+ |key|value| e2| +---+-+---+ | 1|sales| 1| | 2|personnel| 2| | 3| develop| 3| | 4| IT| 4| +---+-+---+ {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org