[jira] [Updated] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Issue Type: Task (was: Bug) > Warn two-parameter TRIM/LTRIM/RTRIM functions > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30891) Arrange version info of history
jiaan.geng created SPARK-30891: -- Summary: Arrange version info of history Key: SPARK-30891 URL: https://issues.apache.org/jira/browse/SPARK-30891 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0 Reporter: jiaan.geng core/src/main/scala/org/apache/spark/internal/config/History.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30890) Arrange version info of history
jiaan.geng created SPARK-30890: -- Summary: Arrange version info of history Key: SPARK-30890 URL: https://issues.apache.org/jira/browse/SPARK-30890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: jiaan.geng core/src/main/scala/org/apache/spark/internal/config/History.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040729#comment-17040729 ] Dongjoon Hyun commented on SPARK-30886: --- Thanks to SPARK-28126, we can give a directional warning for the dangerous cases. > Warn two-parameter TRIM/LTRIM/RTRIM functions > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30889) Arrange version info of worker
jiaan.geng created SPARK-30889: -- Summary: Arrange version info of worker Key: SPARK-30889 URL: https://issues.apache.org/jira/browse/SPARK-30889 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng core/src/main/scala/org/apache/spark/internal/config/Worker.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30888) Arrange version info of network
[ https://issues.apache.org/jira/browse/SPARK-30888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30888: --- Description: spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala > Arrange version info of network > --- > > Key: SPARK-30888 > URL: https://issues.apache.org/jira/browse/SPARK-30888 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > spark/core/src/main/scala/org/apache/spark/internal/config/Network.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Affects Version/s: (was: 2.4.5) (was: 2.3.4) > Warn two-parameter TRIM/LTRIM/RTRIM functions > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30888) Arrange version info of network
jiaan.geng created SPARK-30888: -- Summary: Arrange version info of network Key: SPARK-30888 URL: https://issues.apache.org/jira/browse/SPARK-30888 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.1.0 Reporter: jiaan.geng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30887) Arrange version info of deploy
jiaan.geng created SPARK-30887: -- Summary: Arrange version info of deploy Key: SPARK-30887 URL: https://issues.apache.org/jira/browse/SPARK-30887 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.0, 3.1.0 Reporter: jiaan.geng [*spark*|https://github.com/apache/spark]/[core|https://github.com/apache/spark/tree/master/core]/[src|https://github.com/apache/spark/tree/master/core/src]/[main|https://github.com/apache/spark/tree/master/core/src/main]/[scala|https://github.com/apache/spark/tree/master/core/src/main/scala]/[org|https://github.com/apache/spark/tree/master/core/src/main/scala/org]/[apache|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache]/[spark|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark]/[internal|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/internal]/[config|https://github.com/apache/spark/tree/master/core/src/main/scala/org/apache/spark/internal/config]/*Deploy.scala* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions
Dongjoon Hyun created SPARK-30886: - Summary: Warn two-parameter TRIM/LTRIM/RTRIM functions Key: SPARK-30886 URL: https://issues.apache.org/jira/browse/SPARK-30886 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 2.3.4, 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30886) Warn two-parameter TRIM/LTRIM/RTRIM functions
[ https://issues.apache.org/jira/browse/SPARK-30886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30886: -- Description: Apache Spark community decided to keep the existing esoteric two-parameter use cases with a proper warning. This JIRA aims to show warning. > Warn two-parameter TRIM/LTRIM/RTRIM functions > - > > Key: SPARK-30886 > URL: https://issues.apache.org/jira/browse/SPARK-30886 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > Apache Spark community decided to keep the existing esoteric two-parameter > use cases with a proper warning. This JIRA aims to show warning. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30885) V1 table name should be fully qualified if catalog name is provided
[ https://issues.apache.org/jira/browse/SPARK-30885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim updated SPARK-30885: -- Summary: V1 table name should be fully qualified if catalog name is provided (was: V1 table name should be fully qualified) > V1 table name should be fully qualified if catalog name is provided > --- > > Key: SPARK-30885 > URL: https://issues.apache.org/jira/browse/SPARK-30885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Priority: Major > > For the following example, > {code:java} > sql("CREATE TABLE t USING json AS SELECT 1 AS i") > sql("SELECT * FROM spark_catalog.t") > {code} > `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the > current namespace is `default`. However, this is not consistent with V2 > behavior where namespace should be provided if the catalog name is also > provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30885) V1 table name should be fully qualified
[ https://issues.apache.org/jira/browse/SPARK-30885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Terry Kim updated SPARK-30885: -- Description: For the following example, {code:java} sql("CREATE TABLE t USING json AS SELECT 1 AS i") sql("SELECT * FROM spark_catalog.t") {code} `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the current namespace is `default`. However, this is not consistent with V2 behavior where namespace should be provided if the catalog name is also provided. was: For the following example, {code:java} sql("CREATE TABLE t USING json AS SELECT 1 AS i") sql("SELECT * FROM spark_catalog.t") {code} works, and `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming the current namespace is set to `default`. However, this is not consistent with V2 behavior where namespace should be provided if the catalog name is also provided. > V1 table name should be fully qualified > --- > > Key: SPARK-30885 > URL: https://issues.apache.org/jira/browse/SPARK-30885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Terry Kim >Priority: Major > > For the following example, > {code:java} > sql("CREATE TABLE t USING json AS SELECT 1 AS i") > sql("SELECT * FROM spark_catalog.t") > {code} > `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming that the > current namespace is `default`. However, this is not consistent with V2 > behavior where namespace should be provided if the catalog name is also > provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30885) V1 table name should be fully qualified
Terry Kim created SPARK-30885: - Summary: V1 table name should be fully qualified Key: SPARK-30885 URL: https://issues.apache.org/jira/browse/SPARK-30885 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Terry Kim For the following example, {code:java} sql("CREATE TABLE t USING json AS SELECT 1 AS i") sql("SELECT * FROM spark_catalog.t") {code} works, and `spark_catalog.t` is expanded to `spark_catalog.default.t` assuming the current namespace is set to `default`. However, this is not consistent with V2 behavior where namespace should be provided if the catalog name is also provided. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30884) Upgrade to Py4J 0.10.9
[ https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-30884: - Assignee: Dongjoon Hyun > Upgrade to Py4J 0.10.9 > -- > > Key: SPARK-30884 > URL: https://issues.apache.org/jira/browse/SPARK-30884 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9. > Py4J 0.10.9 is released with the following fixes. > - https://www.py4j.org/changelog.html#py4j-0-10-9 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30884) Upgrade to Py4J 0.10.9
[ https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30884: -- Affects Version/s: (was: 2.4.5) 3.0.0 > Upgrade to Py4J 0.10.9 > -- > > Key: SPARK-30884 > URL: https://issues.apache.org/jira/browse/SPARK-30884 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9. > Py4J 0.10.9 is released with the following fixes. > - https://www.py4j.org/changelog.html#py4j-0-10-9 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30884) Upgrade to Py4J 0.10.9
[ https://issues.apache.org/jira/browse/SPARK-30884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30884: -- Affects Version/s: (was: 3.1.0) 2.4.5 > Upgrade to Py4J 0.10.9 > -- > > Key: SPARK-30884 > URL: https://issues.apache.org/jira/browse/SPARK-30884 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.5 >Reporter: Dongjoon Hyun >Priority: Major > > This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9. > Py4J 0.10.9 is released with the following fixes. > - https://www.py4j.org/changelog.html#py4j-0-10-9 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30884) Upgrade to Py4J 0.10.9
Dongjoon Hyun created SPARK-30884: - Summary: Upgrade to Py4J 0.10.9 Key: SPARK-30884 URL: https://issues.apache.org/jira/browse/SPARK-30884 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.1.0 Reporter: Dongjoon Hyun This issue aims to upgrade Py4J from 0.10.8.1 to 0.10.9. Py4J 0.10.9 is released with the following fixes. - https://www.py4j.org/changelog.html#py4j-0-10-9 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join
[ https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30876: Description: How to reproduce this issue: {code:sql} create table t1(a int, b int, c int); create table t2(a int, b int, c int); create table t3(a int, b int, c int); select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and t3.c = 1); {code} Spark 2.3+: {noformat} == Physical Plan == *(4) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#102] +- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(3) Project +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight :- *(3) Project [b#10] : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight : :- *(3) Project [a#6] : : +- *(3) Filter isnotnull(a#6) : : +- *(3) ColumnarToRow : :+- FileScan parquet default.t1[a#6] Batched: true, DataFilters: [isnotnull(a#6)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87] :+- *(1) Project [b#10] : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) : +- *(1) ColumnarToRow : +- FileScan parquet default.t2[b#10] Batched: true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#96] +- *(2) Project [c#14] +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) +- *(2) ColumnarToRow +- FileScan parquet default.t3[c#14] Batched: true, DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: struct Time taken: 3.785 seconds, Fetched 1 row(s) {noformat} Spark 2.2.x: {noformat} == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *SortMergeJoin [b#19], [c#23], Inner :- *Project [b#19] : +- *SortMergeJoin [a#15], [b#19], Inner : :- *Sort [a#15 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(a#15, 200) : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) : :+- HiveTableScan [a#15], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, b#16, c#17] : +- *Sort [b#19 ASC NULLS FIRST], false, 0 :+- Exchange hashpartitioning(b#19, 200) : +- *Filter (isnotnull(b#19) && (b#19 = 1)) : +- HiveTableScan [b#19], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, b#19, c#20] +- *Sort [c#23 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c#23, 200) +- *Filter (isnotnull(c#23) && (c#23 = 1)) +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23] Time taken: 0.728 seconds, Fetched 1 row(s) {noformat} Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. was: How to reproduce this issue: {code:sql} create table t1(a int, b int, c int); create table t2(a int, b int, c int); create table t3(a int, b int, c int); select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and t3.c = 1) {code} Spark 2.3+: {noformat} == Physical Plan == *(4) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#102] +- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(3) Project +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight :- *(3) Project [b#10] : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight : :- *(3) Project [a#6] : : +- *(3) Filter isnotnull(a#6) : : +- *(3) ColumnarToRow : :+- FileScan parquet default.t1[a#6] Batched: true, DataFilters: [isnotnull(a#6)], Format: Parquet, Location:
[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join
[ https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30876: Description: How to reproduce this issue: {code:sql} create table t1(a int, b int, c int); create table t2(a int, b int, c int); create table t3(a int, b int, c int); select count(*) from t1 join t2 join t3 on (t1.a = t2.b and t2.b = t3.c and t3.c = 1) {code} Spark 2.3+: {noformat} == Physical Plan == *(4) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#102] +- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(3) Project +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight :- *(3) Project [b#10] : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight : :- *(3) Project [a#6] : : +- *(3) Filter isnotnull(a#6) : : +- *(3) ColumnarToRow : :+- FileScan parquet default.t1[a#6] Batched: true, DataFilters: [isnotnull(a#6)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87] :+- *(1) Project [b#10] : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) : +- *(1) ColumnarToRow : +- FileScan parquet default.t2[b#10] Batched: true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#96] +- *(2) Project [c#14] +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) +- *(2) ColumnarToRow +- FileScan parquet default.t3[c#14] Batched: true, DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: struct Time taken: 3.785 seconds, Fetched 1 row(s) {noformat} Spark 2.2.x: {noformat} == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *SortMergeJoin [b#19], [c#23], Inner :- *Project [b#19] : +- *SortMergeJoin [a#15], [b#19], Inner : :- *Sort [a#15 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(a#15, 200) : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) : :+- HiveTableScan [a#15], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, b#16, c#17] : +- *Sort [b#19 ASC NULLS FIRST], false, 0 :+- Exchange hashpartitioning(b#19, 200) : +- *Filter (isnotnull(b#19) && (b#19 = 1)) : +- HiveTableScan [b#19], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, b#19, c#20] +- *Sort [c#23 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c#23, 200) +- *Filter (isnotnull(c#23) && (c#23 = 1)) +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23] Time taken: 0.728 seconds, Fetched 1 row(s) {noformat} Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. was: How to reproduce this issue: {code:sql} create table t1(a int, b int, c int); create table t2(a int, b int, c int); create table t3(a int, b int, c int); {code} Spark 2.3+: {noformat} == Physical Plan == *(4) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#102] +- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(3) Project +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight :- *(3) Project [b#10] : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight : :- *(3) Project [a#6] : : +- *(3) Filter isnotnull(a#6) : : +- *(3) ColumnarToRow : :+- FileScan parquet default.t1[a#6] Batched: true, DataFilters: [isnotnull(a#6)], Format: Parquet, Location:
[jira] [Updated] (SPARK-30883) Tests that use setWritable,setReadable and setExecutable should be cancel when user is root
[ https://issues.apache.org/jira/browse/SPARK-30883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] deshanxiao updated SPARK-30883: --- Environment: The java api *setWritable,setReadable and setExecutable* dosen't work well when the user is root. Maybe, we could cancel these tests or fast failure when the mvn test is starting. (was: The java api *setWritable,setReadable and setExecutable* dosen't work when the user is root. Maybe, we could cancel these tests or fast failure when the mvn test is starting.) > Tests that use setWritable,setReadable and setExecutable should be cancel > when user is root > --- > > Key: SPARK-30883 > URL: https://issues.apache.org/jira/browse/SPARK-30883 > Project: Spark > Issue Type: Improvement > Components: Tests >Affects Versions: 3.0.0 > Environment: The java api *setWritable,setReadable and setExecutable* > dosen't work well when the user is root. Maybe, we could cancel these tests > or fast failure when the mvn test is starting. >Reporter: deshanxiao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30883) Tests that use setWritable,setReadable and setExecutable should be cancel when user is root
deshanxiao created SPARK-30883: -- Summary: Tests that use setWritable,setReadable and setExecutable should be cancel when user is root Key: SPARK-30883 URL: https://issues.apache.org/jira/browse/SPARK-30883 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Environment: The java api *setWritable,setReadable and setExecutable* dosen't work when the user is root. Maybe, we could cancel these tests or fast failure when the mvn test is starting. Reporter: deshanxiao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark
[ https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039884#comment-17039884 ] Saurabh Chawla edited comment on SPARK-30873 at 2/20/20 3:43 AM: - We have raised the WIP PR for this. cc [~holden] [~itskals][~amargoor] was (Author: saurabhc100): We have raised the WIP PR for this. cc [~holdenkarau] [~itskals][~amargoor] > Handling Node Decommissioning for Yarn cluster manger in Spark > -- > > Key: SPARK-30873 > URL: https://issues.apache.org/jira/browse/SPARK-30873 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Priority: Major > > In many public cloud environments, the node loss (in case of AWS > SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed > activity. > The cloud provider intimates the cluster manager about the possible loss of > node ahead of time. Few examples is listed here: > a) Spot loss in AWS(2 min before event) > b) GCP Pre-emptible VM loss (30 second before event) > c) AWS Spot block loss with info on termination time (generally few tens of > minutes before decommission as configured in Yarn) > This JIRA tries to make spark leverage the knowledge of the node loss in > future, and tries to adjust the scheduling of tasks to minimise the impact on > the application. > It is well known that when a host is lost, the executors, its running tasks, > their caches and also Shuffle data is lost. This could result in wastage of > compute and other resources. > The focus here is to build a framework for YARN, that can be extended for > other cluster managers to handle such scenario. > The framework must handle one or more of the following:- > 1) Prevent new tasks from starting on any executors on decommissioning Nodes. > 2) Decide to kill the running tasks so that they can be restarted elsewhere > (assuming they will not complete within the deadline) OR we can allow them to > continue hoping they will finish within deadline. > 3) Clear the shuffle data entry from MapOutputTracker of decommission node > hostname to prevent the shuffle fetchfailed exception.The most significant > advantage of unregistering shuffle outputs when Spark schedules the first > re-attempt to compute the missing blocks, it notices all of the missing > blocks from decommissioned nodes and recovers in only one attempt. This > speeds up the recovery process significantly over the scheduled Spark > implementation, where stages might be rescheduled multiple times to recompute > missing shuffles from all nodes, and prevent jobs from being stuck for hours > failing and recomputing. > 4) Prevent the stage to abort due to the fetchfailed exception in case of > decommissioning of node. In Spark there is number of consecutive stage > attempts allowed before a stage is aborted.This is controlled by the config > spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due > decommissioning of nodes towards stage failure improves the reliability of > the system. > Main components of change > 1) Get the ClusterInfo update from the Resource Manager -> Application Master > -> Spark Driver. > 2) DecommissionTracker, resides inside driver, tracks all the decommissioned > nodes and take necessary action and state transition. > 3) Based on the decommission node list add hooks at code to achieve > a) No new task on executor > b) Remove shuffle data mapping info for the node to be decommissioned from > the mapOutputTracker > c) Do not count fetchFailure from decommissioned towards stage failure > On the receiving info that node is to be decommissioned, the below action > needs to be performed by DecommissionTracker on driver: > * Add the entry of Nodes in DecommissionTracker with termination time and > node state as "DECOMMISSIONING". > * Stop assigning any new tasks on executors on the nodes which are candidate > for decommission. This makes sure slowly as the tasks finish the usage of > this node would die down. > * Kill all the executors for the decommissioning nodes after configurable > period of time, say "spark.graceful.decommission.executor.leasetimePct". This > killing ensures two things. Firstly, the task failure will be attributed in > job failure count. Second, avoid generation on more shuffle data on the node > that will eventually be lost. The node state is set to > "EXECUTOR_DECOMMISSIONED". > * Mark Shuffle data on the node as unavailable after > "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will > ensure that recomputation of missing shuffle partition is done early, rather > than reducers failing with a time-consuming FetchFailure. The node state is > set to
[jira] [Resolved] (SPARK-30856) SQLContext retains reference to unusable instance after SparkContext restarted
[ https://issues.apache.org/jira/browse/SPARK-30856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30856. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27610 [https://github.com/apache/spark/pull/27610] > SQLContext retains reference to unusable instance after SparkContext restarted > -- > > Key: SPARK-30856 > URL: https://issues.apache.org/jira/browse/SPARK-30856 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.5 >Reporter: Alex Favaro >Priority: Major > Fix For: 3.1.0 > > > When the underlying SQLContext is instantiated for a SparkSession, the > instance is saved as a class attribute and returned from subsequent calls to > SQLContext.getOrCreate(). If the SparkContext is stopped and a new one > started, the SQLContext class attribute is never cleared so any code which > calls SQLContext.getOrCreate() will get a SQLContext with a reference to the > old, unusable SparkContext. > A similar issue was identified and fixed for SparkSession in SPARK-19055, but > the fix did not change SQLContext as well. I ran into this because mllib > still > [uses|https://github.com/apache/spark/blob/master/python/pyspark/mllib/common.py#L105] > SQLContext.getOrCreate() under the hood. > I've already written a fix for this, which I'll be sharing in a PR, that > clears the class attribute on SQLContext when the SparkSession is stopped. > Another option would be to deprecate SQLContext.getOrCreate() entirely since > the corresponding Scala > [method|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/SQLContext.html#getOrCreate-org.apache.spark.SparkContext-] > is itself deprecated. That seems like a larger change for a relatively minor > issue, however. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30860) Different behavior between rolling and non-rolling event log
[ https://issues.apache.org/jira/browse/SPARK-30860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-30860: - Priority: Major (was: Minor) > Different behavior between rolling and non-rolling event log > > > Key: SPARK-30860 > URL: https://issues.apache.org/jira/browse/SPARK-30860 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Adam Binford >Priority: Major > > When creating a rolling event log, the application directory is created with > a call to FileSystem.mkdirs, with the file permission 770. The default > behavior of HDFS is to set the permission of a file created with > FileSystem.create or FileSystem.mkdirs to (P & ^umask), where P is the > permission in the API call and umask is a system value set by > fs.permissions.umask-mode and defaults to 0022. This means, with default > settings, any mkdirs call can have at most 755 permissions, which causes > rolling event log directories to be created with 750 permissions. This causes > the history server to be unable to prune old applications if they are not run > by the same user running the history server. > This is not a problem for non-rolling logs, because it uses > SparkHadoopUtils.createFile for Hadoop 2 backward compatibility, and then > calls FileSystem.setPermission with 770 after the file has been created. > setPermission doesn't have the umask applied to it, so this works fine. > Obviously this could be fixed by changing fs.permissions.umask-mode, but I'm > not sure the reason that's set in the first place or if this would hurt > anything else. The main issue is there is different behavior between rolling > and non-rolling event logs that might want to be updated in this repo to be > consistent across each. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28990) SparkSQL invalid call to toAttribute on unresolved object, tree: *
[ https://issues.apache.org/jira/browse/SPARK-28990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-28990. -- Resolution: Cannot Reproduce > SparkSQL invalid call to toAttribute on unresolved object, tree: * > -- > > Key: SPARK-28990 > URL: https://issues.apache.org/jira/browse/SPARK-28990 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.3 >Reporter: fengchaoge >Priority: Major > > SparkSQL create table as select from one table which may not exists throw > exceptions like: > {code} > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: > {code} > This is not friendly, spark user may have no idea about what's wrong. > Simple sql can reproduce it,like this: > {code} > spark-sql (default)> create table default.spark as select * from default.dual; > {code} > {code} > 2019-09-05 16:27:24,127 INFO (main) [Logging.scala:logInfo(54)] - Parsing > command: create table default.spark as select * from default.dual > 2019-09-05 16:27:24,772 ERROR (main) [Logging.scala:logError(91)] - Failed in > [create table default.spark as select * from default.dual] > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > toAttribute on unresolved object, tree: * > at > org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:245) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at scala.collection.immutable.List.foreach(List.scala:392) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.immutable.List.map(List.scala:296) > at > org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:52) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:160) > at > org.apache.spark.sql.hive.HiveAnalysis$$anonfun$apply$3.applyOrElse(HiveStrategies.scala:148) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1$$anonfun$2.apply(AnalysisHelper.scala:108) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:107) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$anonfun$resolveOperatorsDown$1.apply(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperatorsDown(AnalysisHelper.scala:106) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29) > at > org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.resolveOperators(AnalysisHelper.scala:73) > at > org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:148) > at org.apache.spark.sql.hive.HiveAnalysis$.apply(HiveStrategies.scala:147) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:87) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1$$anonfun$apply$1.apply(RuleExecutor.scala:84) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:48) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:84) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$execute$1.apply(RuleExecutor.scala:76) > at scala.collection.immutable.List.foreach(List.scala:392) > at > org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:76) > at > org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:127) > at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:121) > at >
[jira] [Resolved] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30332. -- Resolution: Cannot Reproduce I am resolving this. I can't reproduce and take a look for the cause. > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, > GB_position.lkupIsECGDGuaranteed, > GB_position.lkupIsMultiAcctOffsetMortgage, > GB_position.lkupIsIndexLinked, > GB_position.lkupIsRetail, > GB_position.lkupCollateralLocation, > GB_position.percentAboveBBR, >
[jira] [Commented] (SPARK-30332) When running sql query with limit catalyst throw StackOverFlow exception
[ https://issues.apache.org/jira/browse/SPARK-30332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040560#comment-17040560 ] Hyukjin Kwon commented on SPARK-30332: -- Where is error message, and how do you create the tables? There are many ways to narrow down the problem. You can remove the columns one by one, and see if it still reproduces the issue. Simplifying the multiple CSV files. Unifying the file names for readability. > When running sql query with limit catalyst throw StackOverFlow exception > - > > Key: SPARK-30332 > URL: https://issues.apache.org/jira/browse/SPARK-30332 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: spark version 3.0.0-preview >Reporter: Izek Greenfield >Priority: Major > Attachments: AGGR_41380.csv, AGGR_41390.csv, AGGR_41406.csv, > AGGR_41406.csv, AGGR_41410.csv, AGGR_41418.csv, PORTFOLIO_41446.csv, > T_41233.csv > > > Running that SQL: > {code:sql} > SELECT BT_capital.asof_date, > BT_capital.run_id, > BT_capital.v, > BT_capital.id, > BT_capital.entity, > BT_capital.level_1, > BT_capital.level_2, > BT_capital.level_3, > BT_capital.level_4, > BT_capital.level_5, > BT_capital.level_6, > BT_capital.path_bt_capital, > BT_capital.line_item, > t0.target_line_item, > t0.line_description, > BT_capital.col_item, > BT_capital.rep_amount, > root.orgUnitId, > root.cptyId, > root.instId, > root.startDate, > root.maturityDate, > root.amount, > root.nominalAmount, > root.quantity, > root.lkupAssetLiability, > root.lkupCurrency, > root.lkupProdType, > root.interestResetDate, > root.interestResetTerm, > root.noticePeriod, > root.historicCostAmount, > root.dueDate, > root.lkupResidence, > root.lkupCountryOfUltimateRisk, > root.lkupSector, > root.lkupIndustry, > root.lkupAccountingPortfolioType, > root.lkupLoanDepositTerm, > root.lkupFixedFloating, > root.lkupCollateralType, > root.lkupRiskType, > root.lkupEligibleRefinancing, > root.lkupHedging, > root.lkupIsOwnIssued, > root.lkupIsSubordinated, > root.lkupIsQuoted, > root.lkupIsSecuritised, > root.lkupIsSecuritisedServiced, > root.lkupIsSyndicated, > root.lkupIsDeRecognised, > root.lkupIsRenegotiated, > root.lkupIsTransferable, > root.lkupIsNewBusiness, > root.lkupIsFiduciary, > root.lkupIsNonPerforming, > root.lkupIsInterGroup, > root.lkupIsIntraGroup, > root.lkupIsRediscounted, > root.lkupIsCollateral, > root.lkupIsExercised, > root.lkupIsImpaired, > root.facilityId, > root.lkupIsOTC, > root.lkupIsDefaulted, > root.lkupIsSavingsPosition, > root.lkupIsForborne, > root.lkupIsDebtRestructuringLoan, > root.interestRateAAR, > root.interestRateAPRC, > root.custom1, > root.custom2, > root.custom3, > root.lkupSecuritisationType, > root.lkupIsCashPooling, > root.lkupIsEquityParticipationGTE10, > root.lkupIsConvertible, > root.lkupEconomicHedge, > root.lkupIsNonCurrHeldForSale, > root.lkupIsEmbeddedDerivative, > root.lkupLoanPurpose, > root.lkupRegulated, > root.lkupRepaymentType, > root.glAccount, > root.lkupIsRecourse, > root.lkupIsNotFullyGuaranteed, > root.lkupImpairmentStage, > root.lkupIsEntireAmountWrittenOff, > root.lkupIsLowCreditRisk, > root.lkupIsOBSWithinIFRS9, > root.lkupIsUnderSpecialSurveillance, > root.lkupProtection, > root.lkupIsGeneralAllowance, > root.lkupSectorUltimateRisk, > root.cptyOrgUnitId, > root.name, > root.lkupNationality, > root.lkupSize, > root.lkupIsSPV, > root.lkupIsCentralCounterparty, > root.lkupIsMMRMFI, > root.lkupIsKeyManagement, > root.lkupIsOtherRelatedParty, > root.lkupResidenceProvince, > root.lkupIsTradingBook, > root.entityHierarchy_entityId, > root.entityHierarchy_Residence, > root.lkupLocalCurrency, > root.cpty_entityhierarchy_entityId, > root.lkupRelationship, > root.cpty_lkupRelationship, > root.entityNationality, > root.lkupRepCurrency, > root.startDateFinancialYear, > root.numEmployees, > root.numEmployeesTotal, > root.collateralAmount, > root.guaranteeAmount, > root.impairmentSpecificIndividual, > root.impairmentSpecificCollective, > root.impairmentGeneral, > root.creditRiskAmount, > root.provisionSpecificIndividual, > root.provisionSpecificCollective, > root.provisionGeneral, > root.writeOffAmount, > root.interest, > root.fairValueAmount, > root.grossCarryingAmount, > root.carryingAmount, > root.code, > root.lkupInstrumentType, > root.price, > root.amountAtIssue, > root.yield, > root.totalFacilityAmount, > root.facility_rate, > root.spec_indiv_est, > root.spec_coll_est, > root.coll_inc_loss, > root.impairment_amount, > root.provision_amount, > root.accumulated_impairment, > root.exclusionFlag, > root.lkupIsHoldingCompany, > root.instrument_startDate, > root.entityResidence, > fxRate.enumerator, > fxRate.lkupFromCurrency, > fxRate.rate, > fxRate.custom1, > fxRate.custom2, > fxRate.custom3, >
[jira] [Commented] (SPARK-30868) Throw Exception if runHive(sql) failed
[ https://issues.apache.org/jira/browse/SPARK-30868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040557#comment-17040557 ] Jackey Lee commented on SPARK-30868: [~Ankitraj] No, I think we should just throw Exception once hive run statement failed. Otherwise, the user cannot detect that this statement has failed. Any jira you have created for this? > Throw Exception if runHive(sql) failed > -- > > Key: SPARK-30868 > URL: https://issues.apache.org/jira/browse/SPARK-30868 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Jackey Lee >Priority: Major > > At present, HiveClientImpl.runHive will not throw an exception when it runs > incorrectly, which will cause it to fail to feedback error information > normally. > Example > {code:scala} > spark.sql("add jar file:///tmp/test.jar").show() > spark.sql("show databases").show() > {code} > /tmp/test.jar doesn't exist, thus add jar is failed. However this code will > run completely without causing application failure. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30837) spark-master-test-k8s is broken
[ https://issues.apache.org/jira/browse/SPARK-30837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040555#comment-17040555 ] Hyukjin Kwon commented on SPARK-30837: -- Thanks! I see green light :-). resolving it. > spark-master-test-k8s is broken > --- > > Key: SPARK-30837 > URL: https://issues.apache.org/jira/browse/SPARK-30837 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/471/console > {code} > + /home/jenkins/bin/session_lock_resource.py minikube > File "/home/jenkins/bin/session_lock_resource.py", line 140 > child_body_func = lambda(success_callback): _lock_and_wait( > ^ > SyntaxError: invalid syntax > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30837) spark-master-test-k8s is broken
[ https://issues.apache.org/jira/browse/SPARK-30837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30837. -- Resolution: Fixed > spark-master-test-k8s is broken > --- > > Key: SPARK-30837 > URL: https://issues.apache.org/jira/browse/SPARK-30837 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/471/console > {code} > + /home/jenkins/bin/session_lock_resource.py minikube > File "/home/jenkins/bin/session_lock_resource.py", line 140 > child_body_func = lambda(success_callback): _lock_and_wait( > ^ > SyntaxError: invalid syntax > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30876) Optimizer cannot infer from inferred constraints with join
[ https://issues.apache.org/jira/browse/SPARK-30876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30876: Summary: Optimizer cannot infer from inferred constraints with join (was: Optimizer cannot infer more constraint) > Optimizer cannot infer from inferred constraints with join > -- > > Key: SPARK-30876 > URL: https://issues.apache.org/jira/browse/SPARK-30876 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Yuming Wang >Priority: Major > > How to reproduce this issue: > {code:sql} > create table t1(a int, b int, c int); > create table t2(a int, b int, c int); > create table t3(a int, b int, c int); > {code} > Spark 2.3+: > {noformat} > == Physical Plan == > *(4) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition, true, [id=#102] >+- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(3) Project > +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight > :- *(3) Project [b#10] > : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight > : :- *(3) Project [a#6] > : : +- *(3) Filter isnotnull(a#6) > : : +- *(3) ColumnarToRow > : :+- FileScan parquet default.t1[a#6] Batched: true, > DataFilters: [isnotnull(a#6)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: > struct > : +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#87] > :+- *(1) Project [b#10] > : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) > : +- *(1) ColumnarToRow > : +- FileScan parquet default.t2[b#10] Batched: > true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], > ReadSchema: struct > +- BroadcastExchange > HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), > [id=#96] >+- *(2) Project [c#14] > +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) > +- *(2) ColumnarToRow > +- FileScan parquet default.t3[c#14] Batched: true, > DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], > ReadSchema: struct > Time taken: 3.785 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2.x: > {noformat} > == Physical Plan == > *HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition >+- *HashAggregate(keys=[], functions=[partial_count(1)]) > +- *Project > +- *SortMergeJoin [b#19], [c#23], Inner > :- *Project [b#19] > : +- *SortMergeJoin [a#15], [b#19], Inner > : :- *Sort [a#15 ASC NULLS FIRST], false, 0 > : : +- Exchange hashpartitioning(a#15, 200) > : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) > : :+- HiveTableScan [a#15], HiveTableRelation > `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, > b#16, c#17] > : +- *Sort [b#19 ASC NULLS FIRST], false, 0 > :+- Exchange hashpartitioning(b#19, 200) > : +- *Filter (isnotnull(b#19) && (b#19 = 1)) > : +- HiveTableScan [b#19], HiveTableRelation > `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, > b#19, c#20] > +- *Sort [c#23 ASC NULLS FIRST], false, 0 >+- Exchange hashpartitioning(c#23, 200) > +- *Filter (isnotnull(c#23) && (c#23 = 1)) > +- HiveTableScan [c#23], HiveTableRelation > `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, > b#22, c#23] > Time taken: 0.728 seconds, Fetched 1 row(s) > {noformat} > Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30882) Inaccurate results with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Summary: Inaccurate results with higher precision ApproximatePercentile results (was: Inaccurate results even with higher precision ApproximatePercentile results ) > Inaccurate results with higher precision ApproximatePercentile results > --- > > Key: SPARK-30882 > URL: https://issues.apache.org/jira/browse/SPARK-30882 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.4 >Reporter: Nandini malempati >Priority: Major > > Results of ApproximatePercentile should have better accuracy with increased > precision as per the documentation provided here: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] > But i'm seeing nondeterministic behavior . On a data set of size 450 with > Accuracy 100 is returning better results than 500 (P25 and P90 looks > fine but P75 is very off). And accuracy with 700 gives exact results. But > this behavior is not consistent. > {code:scala} > // Some comments here > package com.microsoft.teams.test.utils > import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession} > import org.scalatest.{Assertions, FunSpec} > import org.apache.spark.sql.functions.{lit, _} > import > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > import org.apache.spark.sql.types.{IntegerType, StringType, StructField, > StructType} > import scala.collection.mutable.ListBuffer > class PercentilesTest extends FunSpec with Assertions { > it("check percentiles with different precision") { > val schema = List(StructField("MetricName", StringType), > StructField("DataPoint", IntegerType)) > val data = new ListBuffer[Row] > for(i <- 1 to 450) { data += Row("metric", i)} > import spark.implicits._ > val df = createDF(schema, data.toSeq) > val accuracy1000 = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), > ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) > val accuracy1M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) > val accuracy5M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) > val accuracy7M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) > val accuracy10M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) > accuracy1000.show(1, false) > accuracy1M.show(1, false) > accuracy5M.show(1, false) > accuracy7M.show(1, false) > accuracy10M.show(1, false) > } > def percentile_approx(col: Column, percentage: Column, accuracy: Column): > Column = { > val expr = new ApproximatePercentile( > col.expr, percentage.expr, accuracy.expr > ).toAggregateExpression > new Column(expr) > } > def percentile_approx(col: Column, percentage: Column, accuracy: Int): > Column = percentile_approx( > col, percentage, lit(accuracy) > ) > lazy val spark: SparkSession = { > SparkSession > .builder() > .master("local") > .appName("spark tests") > .getOrCreate() > } > def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = { > spark.createDataFrame( > spark.sparkContext.parallelize(data), > StructType(schema)) > } > } > {code} > Above is a test run to reproduce the error. Below are few runs with different > accuracies. P25 and P90 looks fine. But P75 is the max of the column. > +--+--+ > |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 700)| > +--+--+ > |metric |[45, 1125000, 3375000, 405] | > +--+--+ > > +--+--+ > |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| > +--+--+ > |metric |[45, 1125000, 450, 405] | > +--+--+ > > +--+--+ > |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 100)| >
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Description: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500 (P25 and P90 looks fine but P75 is very off). And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utils import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession} import org.scalatest.{Assertions, FunSpec} import org.apache.spark.sql.functions.{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions { it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = { val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = { SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = { spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } } {code} Above is a test run to reproduce the error. In this example with accuracy 500 , P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. Is there a way to define what the correct accuracy would be for a given dataset size ? was: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utils import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession} import org.scalatest.{Assertions, FunSpec} import org.apache.spark.sql.functions.{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Description: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500 (P25 and P90 looks fine but P75 is very off). And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utils import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession} import org.scalatest.{Assertions, FunSpec} import org.apache.spark.sql.functions.{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions { it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = { val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = { SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = { spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } } {code} Above is a test run to reproduce the error. Below are few runs with different accuracies. P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 700)| +--+--+ |metric |[45, 1125000, 3375000, 405] | +--+--+ +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 100)| +--+--+ |metric |[45, 1124998, 3374996, 405] | +--+--+ +--++ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 1)| +--++ |metric |[45, 1124848, 3374638, 405] | +--++ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Description: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions \{ it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } } {code} Above is a test run to reproduce the error. In this example with accuracy 500 , P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. Is there a way to define what the correct accuracy would be for a given dataset size ? was: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:java} // code placeholder {package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions \{ it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType),
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Description: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utils import org.apache.spark.sql.{Column, DataFrame, Row, SparkSession} import org.scalatest.{Assertions, FunSpec} import org.apache.spark.sql.functions.{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.{IntegerType, StringType, StructField, StructType} import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions { it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = { val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = { SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = { spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } } {code} Above is a test run to reproduce the error. In this example with accuracy 500 , P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. Is there a way to define what the correct accuracy would be for a given dataset size ? was: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:scala} // Some comments here package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Description: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:java} // code placeholder {package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions \{ it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } }} Above is a test run to reproduce the error. In this example with accuracy 500 , P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. Is there a way to define what the correct accuracy would be for a given dataset size ? was: Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:java} // code placeholder {code} package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions \{ it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType),
[jira] [Updated] (SPARK-30882) Inaccurate ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Summary: Inaccurate ApproximatePercentile results (was: ApproximatePercentile results ) > Inaccurate ApproximatePercentile results > - > > Key: SPARK-30882 > URL: https://issues.apache.org/jira/browse/SPARK-30882 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.4 >Reporter: Nandini malempati >Priority: Major > > Results of ApproximatePercentile should have better accuracy with increased > precision as per the documentation provided here: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] > But i'm seeing nondeterministic behavior . On a data set of size 450 with > Accuracy 100 is returning better results than 500. And accuracy with > 700 gives exact results. But this behavior is not consistent. > {code:java} > // code placeholder > {code} > package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, > DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} > import org.apache.spark.sql.functions.\{lit, _} import > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, > StructType}import scala.collection.mutable.ListBuffer class PercentilesTest > extends FunSpec with Assertions \{ it("check percentiles with different > precision") { val schema = List(StructField("MetricName", StringType), > StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i > <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df > = createDF(schema, data.toSeq) val accuracy1000 = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), > ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) > accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) > accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: > Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( > col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new > Column(expr) } def percentile_approx(col: Column, percentage: Column, > accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) > lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") > .appName("spark tests") .getOrCreate() } def createDF(schema: > List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( > spark.sparkContext.parallelize(data), StructType(schema)) } } > Above is a test run to reproduce the error. In this example with accuracy > 500 , P25 and P90 looks fine. But P75 is the max of the column. > +--+--+ > |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| > +--+--+ > |metric |[45, 1125000, 450, 405] | > +--+--+ > This is breaking our reports as there is no proper definition of accuracy . > we have data sets of size more than 2700. After studying the pattern > found that inaccurate percentiles always have "max" of the column as value. > P50 and P99 might be right in few cases but P75 can be wrong. > Is there a way to define what the correct accuracy would be for a given > dataset size ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30882) Inaccurate results even with higher precision ApproximatePercentile results
[ https://issues.apache.org/jira/browse/SPARK-30882?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nandini malempati updated SPARK-30882: -- Summary: Inaccurate results even with higher precision ApproximatePercentile results (was: Inaccurate ApproximatePercentile results ) > Inaccurate results even with higher precision ApproximatePercentile results > > > Key: SPARK-30882 > URL: https://issues.apache.org/jira/browse/SPARK-30882 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.4 >Reporter: Nandini malempati >Priority: Major > > Results of ApproximatePercentile should have better accuracy with increased > precision as per the documentation provided here: > [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] > But i'm seeing nondeterministic behavior . On a data set of size 450 with > Accuracy 100 is returning better results than 500. And accuracy with > 700 gives exact results. But this behavior is not consistent. > {code:java} > // code placeholder > {code} > package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, > DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} > import org.apache.spark.sql.functions.\{lit, _} import > org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile > import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, > StructType}import scala.collection.mutable.ListBuffer class PercentilesTest > extends FunSpec with Assertions \{ it("check percentiles with different > precision") { val schema = List(StructField("MetricName", StringType), > StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i > <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df > = createDF(schema, data.toSeq) val accuracy1000 = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), > ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = > df.groupBy("MetricName").agg(percentile_approx($"DataPoint", > typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) > accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) > accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: > Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( > col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new > Column(expr) } def percentile_approx(col: Column, percentage: Column, > accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) > lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") > .appName("spark tests") .getOrCreate() } def createDF(schema: > List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( > spark.sparkContext.parallelize(data), StructType(schema)) } } > Above is a test run to reproduce the error. In this example with accuracy > 500 , P25 and P90 looks fine. But P75 is the max of the column. > +--+--+ > |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| > +--+--+ > |metric |[45, 1125000, 450, 405] | > +--+--+ > This is breaking our reports as there is no proper definition of accuracy . > we have data sets of size more than 2700. After studying the pattern > found that inaccurate percentiles always have "max" of the column as value. > P50 and P99 might be right in few cases but P75 can be wrong. > Is there a way to define what the correct accuracy would be for a given > dataset size ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30882) ApproximatePercentile results
Nandini malempati created SPARK-30882: - Summary: ApproximatePercentile results Key: SPARK-30882 URL: https://issues.apache.org/jira/browse/SPARK-30882 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.4.4 Reporter: Nandini malempati Results of ApproximatePercentile should have better accuracy with increased precision as per the documentation provided here: [https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/ApproximatePercentile.scala] But i'm seeing nondeterministic behavior . On a data set of size 450 with Accuracy 100 is returning better results than 500. And accuracy with 700 gives exact results. But this behavior is not consistent. {code:java} // code placeholder {code} package com.microsoft.teams.test.utilsimport org.apache.spark.sql.\{Column, DataFrame, Row, SparkSession} import org.scalatest.\{Assertions, FunSpec} import org.apache.spark.sql.functions.\{lit, _} import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile import org.apache.spark.sql.types.\{IntegerType, StringType, StructField, StructType}import scala.collection.mutable.ListBuffer class PercentilesTest extends FunSpec with Assertions \{ it("check percentiles with different precision") { val schema = List(StructField("MetricName", StringType), StructField("DataPoint", IntegerType)) val data = new ListBuffer[Row] for(i <- 1 to 450) { data += Row("metric", i)} import spark.implicits._ val df = createDF(schema, data.toSeq) val accuracy1000 = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), ApproximatePercentile.DEFAULT_PERCENTILE_ACCURACY)) val accuracy1M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 100)) val accuracy5M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 500)) val accuracy7M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 700)) val accuracy10M = df.groupBy("MetricName").agg(percentile_approx($"DataPoint", typedLit(Seq(0.1, 0.25, 0.75, 0.9)), 1000)) accuracy1000.show(1, false) accuracy1M.show(1, false) accuracy5M.show(1, false) accuracy7M.show(1, false) accuracy10M.show(1, false) } def percentile_approx(col: Column, percentage: Column, accuracy: Column): Column = \{ val expr = new ApproximatePercentile( col.expr, percentage.expr, accuracy.expr ).toAggregateExpression new Column(expr) } def percentile_approx(col: Column, percentage: Column, accuracy: Int): Column = percentile_approx( col, percentage, lit(accuracy) ) lazy val spark: SparkSession = \{ SparkSession .builder() .master("local") .appName("spark tests") .getOrCreate() } def createDF(schema: List[StructField], data: Seq[Row]): DataFrame = \{ spark.createDataFrame( spark.sparkContext.parallelize(data), StructType(schema)) } } Above is a test run to reproduce the error. In this example with accuracy 500 , P25 and P90 looks fine. But P75 is the max of the column. +--+--+ |MetricName|percentile_approx(DataPoint, [0.1,0.25,0.75,0.9], 500)| +--+--+ |metric |[45, 1125000, 450, 405] | +--+--+ This is breaking our reports as there is no proper definition of accuracy . we have data sets of size more than 2700. After studying the pattern found that inaccurate percentiles always have "max" of the column as value. P50 and P99 might be right in few cases but P75 can be wrong. Is there a way to define what the correct accuracy would be for a given dataset size ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite
[ https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-30802. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27555 [https://github.com/apache/spark/pull/27555] > Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test > suite > --- > > Key: SPARK-30802 > URL: https://issues.apache.org/jira/browse/SPARK-30802 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use > Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. > Also, we should use common code to minimize code duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite
[ https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-30802: - Priority: Minor (was: Major) > Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test > suite > --- > > Key: SPARK-30802 > URL: https://issues.apache.org/jira/browse/SPARK-30802 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > > Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use > Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. > Also, we should use common code to minimize code duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30802) Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite
[ https://issues.apache.org/jira/browse/SPARK-30802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-30802: Assignee: Huaxin Gao > Use Summarizer instead of MultivariateOnlineSummarizer in Aggregator test > suite > --- > > Key: SPARK-30802 > URL: https://issues.apache.org/jira/browse/SPARK-30802 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Similar to https://issues.apache.org/jira/browse/SPARK-29754, we can use > Summarizer instead of MultivariateOnlineSummarizer in Aggregator test suite. > Also, we should use common code to minimize code duplication. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30881) Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold
[ https://issues.apache.org/jira/browse/SPARK-30881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang updated SPARK-30881: --- Priority: Minor (was: Major) > Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold > > > Key: SPARK-30881 > URL: https://issues.apache.org/jira/browse/SPARK-30881 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Minor > > The doc of configuration > "spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on > the part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources". > We should revise it as effective on all the file-based data sources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30881) Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold
Gengliang Wang created SPARK-30881: -- Summary: Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold Key: SPARK-30881 URL: https://issues.apache.org/jira/browse/SPARK-30881 Project: Spark Issue Type: Documentation Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang The doc of configuration "spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on the part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources". We should revise it as effective on all the file-based data sources. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21065) Spark Streaming concurrentJobs + StreamingJobProgressListener conflict
[ https://issues.apache.org/jira/browse/SPARK-21065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040456#comment-17040456 ] Artur Sukhenko commented on SPARK-21065: [~zsxwing] Is `spark.streaming.concurrentJobs` still (2.2/2.3/2.4) risky? > Spark Streaming concurrentJobs + StreamingJobProgressListener conflict > -- > > Key: SPARK-21065 > URL: https://issues.apache.org/jira/browse/SPARK-21065 > Project: Spark > Issue Type: Bug > Components: DStreams, Scheduler, Web UI >Affects Versions: 2.1.0 >Reporter: Dan Dutrow >Priority: Major > > My streaming application has 200+ output operations, many of them stateful > and several of them windowed. In an attempt to reduce the processing times, I > set "spark.streaming.concurrentJobs" to 2+. Initial results are very > positive, cutting our processing time from ~3 minutes to ~1 minute, but > eventually we encounter an exception as follows: > (Note that 149697756 ms is 2017-06-09 03:06:00, so it's trying to get a > batch from 45 minutes before the exception is thrown.) > 2017-06-09 03:50:28,259 [Spark Listener Bus] ERROR > org.apache.spark.streaming.scheduler.StreamingListenerBus - Listener > StreamingJobProgressListener threw an exception > java.util.NoSuchElementException: key not found 149697756 ms > at scala.collection.MalLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:59) > at scala.collection.mutable.HashMap.apply(HashMap.scala:65) > at > org.apache.spark.streaming.ui.StreamingJobProgressListener.onOutputOperationCompleted(StreamingJobProgressListener.scala:128) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:67) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.doPostEvent(StreamingListenerBus.scala:29) > at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:63) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.postToAll(StreamingListenerBus.scala:29) > at > org.apache.spark.streaming.scheduler.StreamingListenerBus.onOtherEvent(StreamingListenerBus.scala:43) > ... > The Spark code causing the exception is here: > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC125 > override def onOutputOperationCompleted( > outputOperationCompleted: StreamingListenerOutputOperationCompleted): > Unit = synchronized { > // This method is called before onBatchCompleted > {color:red}runningBatchUIData(outputOperationCompleted.outputOperationInfo.batchTime).{color} > updateOutputOperationInfo(outputOperationCompleted.outputOperationInfo) > } > It seems to me that it may be caused by that batch being removed earlier. > https://github.com/apache/spark/blob/branch-2.1/streaming/src/main/scala/org/apache/spark/streaming/ui/StreamingJobProgressListener.scala#LC102 > override def onBatchCompleted(batchCompleted: > StreamingListenerBatchCompleted): Unit = { > synchronized { > waitingBatchUIData.remove(batchCompleted.batchInfo.batchTime) > > {color:red}runningBatchUIData.remove(batchCompleted.batchInfo.batchTime){color} > val batchUIData = BatchUIData(batchCompleted.batchInfo) > completedBatchUIData.enqueue(batchUIData) > if (completedBatchUIData.size > batchUIDataLimit) { > val removedBatch = completedBatchUIData.dequeue() > batchTimeToOutputOpIdSparkJobIdPair.remove(removedBatch.batchTime) > } > totalCompletedBatches += 1L > totalProcessedRecords += batchUIData.numRecords > } > } > What is the solution here? Should I make my spark streaming context remember > duration a lot longer? ssc.remember(batchDuration * rememberMultiple) > Otherwise, it seems like there should be some kind of existence check on > runningBatchUIData before dereferencing it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30879) Refine doc-building workflow
[ https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30879: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Refine doc-building workflow > > > Key: SPARK-30879 > URL: https://issues.apache.org/jira/browse/SPARK-30879 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30880) Delete Sphinx Makefile cruft
[ https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30880: -- Description: (was: There are a few rough edges in the workflow for building docs that could be refined: * sudo pip installing stuff * no pinned versions of any doc dependencies * using some deprecated options * race condition with jekyll serve) > Delete Sphinx Makefile cruft > > > Key: SPARK-30880 > URL: https://issues.apache.org/jira/browse/SPARK-30880 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30880) Delete Sphinx Makefile cruft
[ https://issues.apache.org/jira/browse/SPARK-30880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30880: -- Affects Version/s: (was: 3.0.0) 3.1.0 > Delete Sphinx Makefile cruft > > > Key: SPARK-30880 > URL: https://issues.apache.org/jira/browse/SPARK-30880 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30880) Delete Sphinx Makefile cruft
Dongjoon Hyun created SPARK-30880: - Summary: Delete Sphinx Makefile cruft Key: SPARK-30880 URL: https://issues.apache.org/jira/browse/SPARK-30880 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Nicholas Chammas There are a few rough edges in the workflow for building docs that could be refined: * sudo pip installing stuff * no pinned versions of any doc dependencies * using some deprecated options * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30731) Update deprecated Mkdocs option
[ https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-30731. --- Fix Version/s: 2.4.6 Assignee: Nicholas Chammas Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/27626 > Update deprecated Mkdocs option > --- > > Key: SPARK-30731 > URL: https://issues.apache.org/jira/browse/SPARK-30731 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Trivial > Fix For: 2.4.6 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option
[ https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30731: -- Issue Type: Bug (was: Improvement) > Update deprecated Mkdocs option > --- > > Key: SPARK-30731 > URL: https://issues.apache.org/jira/browse/SPARK-30731 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option
[ https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30731: -- Description: (was: There are a few rough edges in the workflow for building docs that could be refined: * sudo pip installing stuff * no pinned versions of any doc dependencies * using some deprecated options * race condition with jekyll serve) > Update deprecated Mkdocs option > --- > > Key: SPARK-30731 > URL: https://issues.apache.org/jira/browse/SPARK-30731 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option
[ https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30731: -- Priority: Trivial (was: Minor) > Update deprecated Mkdocs option > --- > > Key: SPARK-30731 > URL: https://issues.apache.org/jira/browse/SPARK-30731 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Trivial > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30879) Refine doc-building workflow
Dongjoon Hyun created SPARK-30879: - Summary: Refine doc-building workflow Key: SPARK-30879 URL: https://issues.apache.org/jira/browse/SPARK-30879 Project: Spark Issue Type: Improvement Components: Documentation Affects Versions: 3.0.0 Reporter: Nicholas Chammas There are a few rough edges in the workflow for building docs that could be refined: * sudo pip installing stuff * no pinned versions of any doc dependencies * using some deprecated options * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30731) Update deprecated Mkdocs option
[ https://issues.apache.org/jira/browse/SPARK-30731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30731: -- Summary: Update deprecated Mkdocs option (was: Refine doc-building workflow) > Update deprecated Mkdocs option > --- > > Key: SPARK-30731 > URL: https://issues.apache.org/jira/browse/SPARK-30731 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.0.0 >Reporter: Nicholas Chammas >Priority: Minor > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30811) CTE that refers to non-existent table with same name causes StackOverflowError
[ https://issues.apache.org/jira/browse/SPARK-30811?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040299#comment-17040299 ] Dongjoon Hyun commented on SPARK-30811: --- The test cases are landed to `master/branch-3.0` to prevent future regression. - https://github.com/apache/spark/pull/27635 > CTE that refers to non-existent table with same name causes StackOverflowError > -- > > Key: SPARK-30811 > URL: https://issues.apache.org/jira/browse/SPARK-30811 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5 >Reporter: Herman van Hövell >Assignee: Herman van Hövell >Priority: Major > Fix For: 2.4.6 > > > The following query causes a StackOverflowError: > {noformat} > WITH t AS (SELECT 1 FROM nonexist.t) SELECT * FROM t > {noformat} > This only happens when the CTE refers to a non-existent table with the same > name and a database qualifier. This is caused by a couple of things: > * {{CTESubstitution}} runs analysis on the CTE, but this does not throw an > exception because the table has a database qualifier. The reason is that we > don't fail is because we re-attempt to resolve the relation in a later rule. > * {{CTESubstitution}} replace logic does not check if the table it is > replacing has a database, it shouldn't replace the relation if it does. So > now we will happily replace {{nonexist.t}} with {{t}}. > * {{CTESubstitution}} transforms down, this means it will keep replacing > {{t}} with itself, creating an infinite recursion. > This is not an issue for master/3.0. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Burak Yavuz resolved SPARK-29908. - Resolution: Duplicate > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30878) improve the CREATE TABLE document
Wenchen Fan created SPARK-30878: --- Summary: improve the CREATE TABLE document Key: SPARK-30878 URL: https://issues.apache.org/jira/browse/SPARK-30878 Project: Spark Issue Type: Documentation Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30877) Make Aggregate higher order function with Spark Aggregate Expressions
Herman van Hövell created SPARK-30877: - Summary: Make Aggregate higher order function with Spark Aggregate Expressions Key: SPARK-30877 URL: https://issues.apache.org/jira/browse/SPARK-30877 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Herman van Hövell The aggregate higher order function should be able to work with spark's aggregate functions. For example: {{aggregate(xs, x -> avg(x - 1) + 1) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30876) Optimizer cannot infer more constraint
Yuming Wang created SPARK-30876: --- Summary: Optimizer cannot infer more constraint Key: SPARK-30876 URL: https://issues.apache.org/jira/browse/SPARK-30876 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.5, 2.3.4, 3.0.0 Reporter: Yuming Wang How to reproduce this issue: {code:sql} create table t1(a int, b int, c int); create table t2(a int, b int, c int); create table t3(a int, b int, c int); {code} Spark 2.3+: {noformat} == Physical Plan == *(4) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#102] +- *(3) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(3) Project +- *(3) BroadcastHashJoin [b#10], [c#14], Inner, BuildRight :- *(3) Project [b#10] : +- *(3) BroadcastHashJoin [a#6], [b#10], Inner, BuildRight : :- *(3) Project [a#6] : : +- *(3) Filter isnotnull(a#6) : : +- *(3) ColumnarToRow : :+- FileScan parquet default.t1[a#6] Batched: true, DataFilters: [isnotnull(a#6)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema: struct : +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#87] :+- *(1) Project [b#10] : +- *(1) Filter (isnotnull(b#10) AND (b#10 = 1)) : +- *(1) ColumnarToRow : +- FileScan parquet default.t2[b#10] Batched: true, DataFilters: [isnotnull(b#10), (b#10 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(b), EqualTo(b,1)], ReadSchema: struct +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#96] +- *(2) Project [c#14] +- *(2) Filter (isnotnull(c#14) AND (c#14 = 1)) +- *(2) ColumnarToRow +- FileScan parquet default.t3[c#14] Batched: true, DataFilters: [isnotnull(c#14), (c#14 = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c), EqualTo(c,1)], ReadSchema: struct Time taken: 3.785 seconds, Fetched 1 row(s) {noformat} Spark 2.2.x: {noformat} == Physical Plan == *HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition +- *HashAggregate(keys=[], functions=[partial_count(1)]) +- *Project +- *SortMergeJoin [b#19], [c#23], Inner :- *Project [b#19] : +- *SortMergeJoin [a#15], [b#19], Inner : :- *Sort [a#15 ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(a#15, 200) : : +- *Filter (isnotnull(a#15) && (a#15 = 1)) : :+- HiveTableScan [a#15], HiveTableRelation `default`.`t1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#15, b#16, c#17] : +- *Sort [b#19 ASC NULLS FIRST], false, 0 :+- Exchange hashpartitioning(b#19, 200) : +- *Filter (isnotnull(b#19) && (b#19 = 1)) : +- HiveTableScan [b#19], HiveTableRelation `default`.`t2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#18, b#19, c#20] +- *Sort [c#23 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(c#23, 200) +- *Filter (isnotnull(c#23) && (c#23 = 1)) +- HiveTableScan [c#23], HiveTableRelation `default`.`t3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#21, b#22, c#23] Time taken: 0.728 seconds, Fetched 1 row(s) {noformat} Spark 2.2 can infer {{(a#15 = 1)}}, but Spark 2.3+ can't. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24497) ANSI SQL: Recursive query
[ https://issues.apache.org/jira/browse/SPARK-24497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040121#comment-17040121 ] Peter Toth commented on SPARK-24497: Thanks [~dmateusp] for your +1. That PR is mine and I will resolve the conflicts soon. Hopefully it will get some review after that. > ANSI SQL: Recursive query > - > > Key: SPARK-24497 > URL: https://issues.apache.org/jira/browse/SPARK-24497 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > h3. *Examples* > Here is an example for {{WITH RECURSIVE}} clause usage. Table "department" > represents the structure of an organization as an adjacency list. > {code:sql} > CREATE TABLE department ( > id INTEGER PRIMARY KEY, -- department ID > parent_department INTEGER REFERENCES department, -- upper department ID > name TEXT -- department name > ); > INSERT INTO department (id, parent_department, "name") > VALUES > (0, NULL, 'ROOT'), > (1, 0, 'A'), > (2, 1, 'B'), > (3, 2, 'C'), > (4, 2, 'D'), > (5, 0, 'E'), > (6, 4, 'F'), > (7, 5, 'G'); > -- department structure represented here is as follows: > -- > -- ROOT-+->A-+->B-+->C > -- | | > -- | +->D-+->F > -- +->E-+->G > {code} > > To extract all departments under A, you can use the following recursive > query: > {code:sql} > WITH RECURSIVE subdepartment AS > ( > -- non-recursive term > SELECT * FROM department WHERE name = 'A' > UNION ALL > -- recursive term > SELECT d.* > FROM > department AS d > JOIN > subdepartment AS sd > ON (d.parent_department = sd.id) > ) > SELECT * > FROM subdepartment > ORDER BY name; > {code} > More details: > [http://wiki.postgresql.org/wiki/CTEReadme] > [https://info.teradata.com/htmlpubs/DB_TTU_16_00/index.html#page/SQL_Reference/B035-1141-160K/lqe1472241402390.html] > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default
[ https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17040042#comment-17040042 ] Wenchen Fan commented on SPARK-30875: - cc [~maxgekk] [~hyukjin.kwon] > Revisit the decision of writing parquet TIMESTAMP_MICROS by default > --- > > Key: SPARK-30875 > URL: https://issues.apache.org/jira/browse/SPARK-30875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by > default, instead of INT96. This is good in general as Spark can read all > kinds of parquet timestamps, but works better with TIMESTAMP_MICROS. > However, this brings some troubles with hive compatibility. Spark can use > native parquet writer to write hive parquet tables, which may break hive > compatibility if Spark writes TIMESTAMP_MICROS. > We can switch back to INT96 by default, or fix it: > 1. when using native parquet writer to write hive parquet tables, write > timestamp as INT96. > 2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the > parquet table is hive compatible if it has timestamp columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default
[ https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30875: Description: In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by default, instead of INT96. This is good in general as Spark can read all kinds of parquet timestamps, but works better with TIMESTAMP_MICROS. However, this brings some troubles with hive compatibility. Spark can use native parquet writer to write hive parquet tables, which may break hive compatibility if Spark writes TIMESTAMP_MICROS. We can switch back to INT96 by default, or fix it: 1. when using native parquet writer to write hive parquet tables, write timestamp as INT96. 2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the parquet table is hive compatible if it has timestamp columns. was:In Spark 3.0, we write out timestamp values as parquet > Revisit the decision of writing parquet TIMESTAMP_MICROS by default > --- > > Key: SPARK-30875 > URL: https://issues.apache.org/jira/browse/SPARK-30875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark 3.0, we write out timestamp values as parquet TIMESTAMP_MICROS by > default, instead of INT96. This is good in general as Spark can read all > kinds of parquet timestamps, but works better with TIMESTAMP_MICROS. > However, this brings some troubles with hive compatibility. Spark can use > native parquet writer to write hive parquet tables, which may break hive > compatibility if Spark writes TIMESTAMP_MICROS. > We can switch back to INT96 by default, or fix it: > 1. when using native parquet writer to write hive parquet tables, write > timestamp as INT96. > 2. when creating tables in `HiveExternalCatalog.createTable`, don't claim the > parquet table is hive compatible if it has timestamp columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default
[ https://issues.apache.org/jira/browse/SPARK-30875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30875: Description: In Spark 3.0, we write out timestamp values as parquet > Revisit the decision of writing parquet TIMESTAMP_MICROS by default > --- > > Key: SPARK-30875 > URL: https://issues.apache.org/jira/browse/SPARK-30875 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Priority: Major > > In Spark 3.0, we write out timestamp values as parquet -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30875) Revisit the decision of writing parquet TIMESTAMP_MICROS by default
Wenchen Fan created SPARK-30875: --- Summary: Revisit the decision of writing parquet TIMESTAMP_MICROS by default Key: SPARK-30875 URL: https://issues.apache.org/jira/browse/SPARK-30875 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-30874) Support Postgres Kerberos login in JDBC connector
Gabor Somogyi created SPARK-30874: - Summary: Support Postgres Kerberos login in JDBC connector Key: SPARK-30874 URL: https://issues.apache.org/jira/browse/SPARK-30874 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.4.5 Reporter: Gabor Somogyi -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12312) JDBC connection to Kerberos secured databases fails on remote executors
[ https://issues.apache.org/jira/browse/SPARK-12312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039975#comment-17039975 ] Gabor Somogyi commented on SPARK-12312: --- After deep analysis it has turned out each database type needs custom implementation so creating subtasks to handle them. > JDBC connection to Kerberos secured databases fails on remote executors > --- > > Key: SPARK-12312 > URL: https://issues.apache.org/jira/browse/SPARK-12312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2, 2.4.2 >Reporter: nabacg >Priority: Minor > > When loading DataFrames from JDBC datasource with Kerberos authentication, > remote executors (yarn-client/cluster etc. modes) fail to establish a > connection due to lack of Kerberos ticket or ability to generate it. > This is a real issue when trying to ingest data from kerberized data sources > (SQL Server, Oracle) in enterprise environment where exposing simple > authentication access is not an option due to IT policy issues. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30763) Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
[ https://issues.apache.org/jira/browse/SPARK-30763?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30763. - Fix Version/s: 2.4.6 3.0.0 Assignee: jiaan.geng Resolution: Fixed > Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract > - > > Key: SPARK-30763 > URL: https://issues.apache.org/jira/browse/SPARK-30763 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > The current implement of regexp_extract will throws a unprocessed exception > show below: > SELECT regexp_extract('1a 2b 14m', ' > d+') > > {code:java} > [info] org.apache.spark.SparkException: Job aborted due to stage failure: > Task 1 in stage 22.0 failed 1 times, most recent failure: Lost task 1.0 in > stage 22.0 (TID 33, 192.168.1.6, executor driver): > java.lang.IndexOutOfBoundsException: No group 1 > [info] at java.util.regex.Matcher.group(Matcher.java:538) > [info] at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > [info] at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > [info] at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) > [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > [info] at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) > [info] at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1804) > [info] at org.apache.spark.rdd.RDD.$anonfun$count$1(RDD.scala:1227) > [info] at org.apache.spark.rdd.RDD.$anonfun$count$1$adapted(RDD.scala:1227) > [info] at > org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2156) > [info] at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > [info] at org.apache.spark.scheduler.Task.run(Task.scala:127) > [info] at > org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444) > [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) > [info] at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:748) > {code} > > I think should treat this exception well. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers
[ https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-30786. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27539 [https://github.com/apache/spark/pull/27539] > Block replication is not retried on other BlockManagers when it fails on 1 of > the peers > --- > > Key: SPARK-30786 > URL: https://issues.apache.org/jira/browse/SPARK-30786 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Prakhar Jain >Assignee: Prakhar Jain >Priority: Major > Fix For: 3.1.0 > > > When we cache an RDD with replication > 1, Firstly the RDD block is cached > locally on one of the BlockManager and then it is replicated to > (replication-1) number of BlockManagers. While replicating a block, if > replication fails on one of the peers, it is supposed to retry the > replication on some other peer (based on > "spark.storage.maxReplicationFailures" config). But currently this doesn't > happen because of some issue. > Logs of 1 of the executor which is trying to replicate: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host > wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550) > 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found > 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found > 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in > memory (estimated size 33.3 MB, free 44.2 MB) > 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244 > 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took 947 > ms > 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > in 205.849858 ms > 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > in 180.501504 ms > 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 > bytes to 2 peer(s) took 387.381168 ms > 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to > BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), > BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took 423 > ms > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with > replication took 1371 ms > 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). > 2253 bytes result sent to driver > {noformat} > Logs of other executor where the block is being replicated to: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244 > 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in > memory! (computed 4.2 MB so far) > 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB > (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB. > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took 12 ms > 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as > it was not found on disk or in memory > 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without > replication took 13 ms > {noformat} > Note here that the block replication
[jira] [Assigned] (SPARK-30786) Block replication is not retried on other BlockManagers when it fails on 1 of the peers
[ https://issues.apache.org/jira/browse/SPARK-30786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-30786: --- Assignee: Prakhar Jain > Block replication is not retried on other BlockManagers when it fails on 1 of > the peers > --- > > Key: SPARK-30786 > URL: https://issues.apache.org/jira/browse/SPARK-30786 > Project: Spark > Issue Type: Bug > Components: Block Manager, Spark Core >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Prakhar Jain >Assignee: Prakhar Jain >Priority: Major > > When we cache an RDD with replication > 1, Firstly the RDD block is cached > locally on one of the BlockManager and then it is replicated to > (replication-1) number of BlockManagers. While replicating a block, if > replication fails on one of the peers, it is supposed to retry the > replication on some other peer (based on > "spark.storage.maxReplicationFailures" config). But currently this doesn't > happen because of some issue. > Logs of 1 of the executor which is trying to replicate: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 1 on host > wn11-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:45 INFO Executor: Running task 244.0 in stage 3.0 (TID 550) > 20/02/10 09:06:45 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 was not found > 20/02/10 09:06:45 DEBUG BlockManager: Getting remote block rdd_13_244 > 20/02/10 09:06:45 DEBUG BlockManager: Block rdd_13_244 not found > 20/02/10 09:06:46 INFO MemoryStore: Block rdd_13_244 stored as values in > memory (estimated size 33.3 MB, free 44.2 MB) > 20/02/10 09:06:46 DEBUG BlockManager: Told master about block rdd_13_244 > 20/02/10 09:06:46 DEBUG BlockManager: Put block rdd_13_244 locally took 947 > ms > 20/02/10 09:06:46 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:46 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > in 205.849858 ms > 20/02/10 09:06:47 TRACE BlockManager: Trying to replicate rdd_13_244 of > 34908552 bytes to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > 20/02/10 09:06:47 TRACE BlockManager: Replicated rdd_13_244 of 34908552 bytes > to BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None) > in 180.501504 ms > 20/02/10 09:06:47 DEBUG BlockManager: Replicating rdd_13_244 of 34908552 > bytes to 2 peer(s) took 387.381168 ms > 20/02/10 09:06:47 DEBUG BlockManager: block rdd_13_244 replicated to > BlockManagerId(5, > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36463, None), > BlockManagerId(2, > wn10-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net, 36711, None) > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 remotely took 423 > ms > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 with > replication took 1371 ms > 20/02/10 09:06:47 DEBUG BlockManager: Getting local block rdd_13_244 > 20/02/10 09:06:47 DEBUG BlockManager: Level for block rdd_13_244 is > StorageLevel(memory, deserialized, 3 replicas) > 20/02/10 09:06:47 INFO Executor: Finished task 244.0 in stage 3.0 (TID 550). > 2253 bytes result sent to driver > {noformat} > Logs of other executor where the block is being replicated to: > {noformat} > 20/02/10 09:01:47 INFO Executor: Starting executor ID 5 on host > wn2-prakha.mvqvy0u1catevlxn5wwhjss34f.bx.internal.cloudapp.net > . > . > . > 20/02/10 09:06:47 INFO MemoryStore: Will not store rdd_13_244 > 20/02/10 09:06:47 WARN MemoryStore: Not enough space to cache rdd_13_244 in > memory! (computed 4.2 MB so far) > 20/02/10 09:06:47 INFO MemoryStore: Memory use = 4.9 GB (blocks) + 7.3 MB > (scratch space shared across 2 tasks(s)) = 4.9 GB. Storage limit = 4.9 GB. > 20/02/10 09:06:47 DEBUG BlockManager: Put block rdd_13_244 locally took 12 ms > 20/02/10 09:06:47 WARN BlockManager: Block rdd_13_244 could not be removed as > it was not found on disk or in memory > 20/02/10 09:06:47 WARN BlockManager: Putting block rdd_13_244 failed > 20/02/10 09:06:47 DEBUG BlockManager: Putting block rdd_13_244 without > replication took 13 ms > {noformat} > Note here that the block replication failed in Executor-5 with log line "Not > enough space to cache rdd_13_244 in memory!". But Executor-1 shows that block > is
[jira] [Comment Edited] (SPARK-26346) Upgrade parquet to 1.11.1
[ https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039904#comment-17039904 ] Ismaël Mejía edited comment on SPARK-26346 at 2/19/20 10:54 AM: It seems this issue is blocked because Spark is having dependency issues because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues from API changes on Avro. I proposed a patch for Hive with the hope that it gets backported into Hive 2.x to unlock the Avro 1.9 upgrade on Spark (and transitively this issue) but so far I have not received any attention from the Hive community, for more info HIVE-21737 and [my multiples call for attention in the Hive ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].] If someone can ping somebody in the Hive community that can help this advance that would be great! was (Author: iemejia): It seems this issue is blocked because Spark is having dependency issues because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues from API changes on Avro. I proposed a patch for Hive with the hope that it gets backported into Hive 2.x but so far I have not received any attention from the Hive community, for more info HIVE-21737 and [my multiples call for attention in the Hive ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].] If someone can ping somebody in the Hive community that can help this advance that would be great! > Upgrade parquet to 1.11.1 > - > > Key: SPARK-26346 > URL: https://issues.apache.org/jira/browse/SPARK-26346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.1
[ https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039904#comment-17039904 ] Ismaël Mejía commented on SPARK-26346: -- It seems this issue is blocked because Spark is having dependency issues because of its dependency on Hive. Hive depends on both Avro 1.7 (for Hive 1.x) and Avro 1.8 (for Hive 2.x). Parquet depends on Avro 1.9. So there are issues from API changes on Avro. I proposed a patch for Hive with the hope that it gets backported into Hive 2.x but so far I have not received any attention from the Hive community, for more info HIVE-21737 and [my multiples call for attention in the Hive ML|[https://lists.apache.org/thread.html/rc6c672ad4a5e255957d54d80ff83bf48eacece2828a86bc6cedd9c4c%40%3Cdev.hive.apache.org%3E].] If someone can ping somebody in the Hive community that can help this advance that would be great! > Upgrade parquet to 1.11.1 > - > > Key: SPARK-26346 > URL: https://issues.apache.org/jira/browse/SPARK-26346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark
[ https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039884#comment-17039884 ] Saurabh Chawla commented on SPARK-30873: We have raised the WIP PR for this. cc [~holdenkarau] [~itskals][~amargoor] > Handling Node Decommissioning for Yarn cluster manger in Spark > -- > > Key: SPARK-30873 > URL: https://issues.apache.org/jira/browse/SPARK-30873 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.0.0 >Reporter: Saurabh Chawla >Priority: Major > > In many public cloud environments, the node loss (in case of AWS > SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed > activity. > The cloud provider intimates the cluster manager about the possible loss of > node ahead of time. Few examples is listed here: > a) Spot loss in AWS(2 min before event) > b) GCP Pre-emptible VM loss (30 second before event) > c) AWS Spot block loss with info on termination time (generally few tens of > minutes before decommission as configured in Yarn) > This JIRA tries to make spark leverage the knowledge of the node loss in > future, and tries to adjust the scheduling of tasks to minimise the impact on > the application. > It is well known that when a host is lost, the executors, its running tasks, > their caches and also Shuffle data is lost. This could result in wastage of > compute and other resources. > The focus here is to build a framework for YARN, that can be extended for > other cluster managers to handle such scenario. > The framework must handle one or more of the following:- > 1) Prevent new tasks from starting on any executors on decommissioning Nodes. > 2) Decide to kill the running tasks so that they can be restarted elsewhere > (assuming they will not complete within the deadline) OR we can allow them to > continue hoping they will finish within deadline. > 3) Clear the shuffle data entry from MapOutputTracker of decommission node > hostname to prevent the shuffle fetchfailed exception.The most significant > advantage of unregistering shuffle outputs when Spark schedules the first > re-attempt to compute the missing blocks, it notices all of the missing > blocks from decommissioned nodes and recovers in only one attempt. This > speeds up the recovery process significantly over the scheduled Spark > implementation, where stages might be rescheduled multiple times to recompute > missing shuffles from all nodes, and prevent jobs from being stuck for hours > failing and recomputing. > 4) Prevent the stage to abort due to the fetchfailed exception in case of > decommissioning of node. In Spark there is number of consecutive stage > attempts allowed before a stage is aborted.This is controlled by the config > spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due > decommissioning of nodes towards stage failure improves the reliability of > the system. > Main components of change > 1) Get the ClusterInfo update from the Resource Manager -> Application Master > -> Spark Driver. > 2) DecommissionTracker, resides inside driver, tracks all the decommissioned > nodes and take necessary action and state transition. > 3) Based on the decommission node list add hooks at code to achieve > a) No new task on executor > b) Remove shuffle data mapping info for the node to be decommissioned from > the mapOutputTracker > c) Do not count fetchFailure from decommissioned towards stage failure > On the receiving info that node is to be decommissioned, the below action > needs to be performed by DecommissionTracker on driver: > * Add the entry of Nodes in DecommissionTracker with termination time and > node state as "DECOMMISSIONING". > * Stop assigning any new tasks on executors on the nodes which are candidate > for decommission. This makes sure slowly as the tasks finish the usage of > this node would die down. > * Kill all the executors for the decommissioning nodes after configurable > period of time, say "spark.graceful.decommission.executor.leasetimePct". This > killing ensures two things. Firstly, the task failure will be attributed in > job failure count. Second, avoid generation on more shuffle data on the node > that will eventually be lost. The node state is set to > "EXECUTOR_DECOMMISSIONED". > * Mark Shuffle data on the node as unavailable after > "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will > ensure that recomputation of missing shuffle partition is done early, rather > than reducers failing with a time-consuming FetchFailure. The node state is > set to "SHUFFLE_DECOMMISSIONED". > * Mark Node as Terminated after the termination time. Now the state of the > node is "TERMINATED". > * Remove the node
[jira] [Updated] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark
[ https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Saurabh Chawla updated SPARK-30873: --- Description: In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed activity. The cloud provider intimates the cluster manager about the possible loss of node ahead of time. Few examples is listed here: a) Spot loss in AWS(2 min before event) b) GCP Pre-emptible VM loss (30 second before event) c) AWS Spot block loss with info on termination time (generally few tens of minutes before decommission as configured in Yarn) This JIRA tries to make spark leverage the knowledge of the node loss in future, and tries to adjust the scheduling of tasks to minimise the impact on the application. It is well known that when a host is lost, the executors, its running tasks, their caches and also Shuffle data is lost. This could result in wastage of compute and other resources. The focus here is to build a framework for YARN, that can be extended for other cluster managers to handle such scenario. The framework must handle one or more of the following:- 1) Prevent new tasks from starting on any executors on decommissioning Nodes. 2) Decide to kill the running tasks so that they can be restarted elsewhere (assuming they will not complete within the deadline) OR we can allow them to continue hoping they will finish within deadline. 3) Clear the shuffle data entry from MapOutputTracker of decommission node hostname to prevent the shuffle fetchfailed exception.The most significant advantage of unregistering shuffle outputs when Spark schedules the first re-attempt to compute the missing blocks, it notices all of the missing blocks from decommissioned nodes and recovers in only one attempt. This speeds up the recovery process significantly over the scheduled Spark implementation, where stages might be rescheduled multiple times to recompute missing shuffles from all nodes, and prevent jobs from being stuck for hours failing and recomputing. 4) Prevent the stage to abort due to the fetchfailed exception in case of decommissioning of node. In Spark there is number of consecutive stage attempts allowed before a stage is aborted.This is controlled by the config spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due decommissioning of nodes towards stage failure improves the reliability of the system. Main components of change 1) Get the ClusterInfo update from the Resource Manager -> Application Master -> Spark Driver. 2) DecommissionTracker, resides inside driver, tracks all the decommissioned nodes and take necessary action and state transition. 3) Based on the decommission node list add hooks at code to achieve a) No new task on executor b) Remove shuffle data mapping info for the node to be decommissioned from the mapOutputTracker c) Do not count fetchFailure from decommissioned towards stage failure On the receiving info that node is to be decommissioned, the below action needs to be performed by DecommissionTracker on driver: * Add the entry of Nodes in DecommissionTracker with termination time and node state as "DECOMMISSIONING". * Stop assigning any new tasks on executors on the nodes which are candidate for decommission. This makes sure slowly as the tasks finish the usage of this node would die down. * Kill all the executors for the decommissioning nodes after configurable period of time, say "spark.graceful.decommission.executor.leasetimePct". This killing ensures two things. Firstly, the task failure will be attributed in job failure count. Second, avoid generation on more shuffle data on the node that will eventually be lost. The node state is set to "EXECUTOR_DECOMMISSIONED". * Mark Shuffle data on the node as unavailable after "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will ensure that recomputation of missing shuffle partition is done early, rather than reducers failing with a time-consuming FetchFailure. The node state is set to "SHUFFLE_DECOMMISSIONED". * Mark Node as Terminated after the termination time. Now the state of the node is "TERMINATED". * Remove the node entry from Decommission Tracker if the same host name is reused.(This is not uncommon in many public cloud environments). was: In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed activity. The cloud provider intimates the cluster manager about the possible loss of node ahead of time. Few exmaples is listed here: a) Spot loss in AWS(2 min before event) b) GCP Pre-emptible VM loss (30 second before event) c) AWS Spot block loss with info on termination time (generally few tens of minutes before decommission as configured in Yarn) This JIRA tries to make spark leverage the
[jira] [Created] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark
Saurabh Chawla created SPARK-30873: -- Summary: Handling Node Decommissioning for Yarn cluster manger in Spark Key: SPARK-30873 URL: https://issues.apache.org/jira/browse/SPARK-30873 Project: Spark Issue Type: Improvement Components: Spark Core, YARN Affects Versions: 3.0.0 Reporter: Saurabh Chawla In many public cloud environments, the node loss (in case of AWS SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed activity. The cloud provider intimates the cluster manager about the possible loss of node ahead of time. Few exmaples is listed here: a) Spot loss in AWS(2 min before event) b) GCP Pre-emptible VM loss (30 second before event) c) AWS Spot block loss with info on termination time (generally few tens of minutes before decommission as configured in Yarn) This JIRA tries to make spark leverage the knowledge of the node loss in future, and tries to adjust the scheduling of tasks to minimise the impact on the application. It is well known that when a host is lost, the executors, its running tasks, their caches and also Shuffle data is lost. This could result in wastage of compute and other resources. The focus here is to build a framework for YARN, that can be extended for other cluster managers to handle such scenario. The framework must handle one or more of the following:- 1) Prevent new tasks from starting on any executors on decommissioning Nodes. 2) Decide to kill the running tasks so that they can be restarted elsewhere (assuming they will not complete within the deadline) OR we can allow them to continue hoping they will finish within deadline. 3) Clear the shuffle data entry from MapOutputTracker of decommission node hostname to prevent the shuffle fetchfailed exception.The most significant advantage of unregistering shuffle outputs when Spark schedules the first re-attempt to compute the missing blocks, it notices all of the missing blocks from decommissioned nodes and recovers in only one attempt. This speeds up the recovery process significantly over the scheduled Spark implementation, where stages might be rescheduled multiple times to recompute missing shuffles from all nodes, and prevent jobs from being stuck for hours failing and recomputing. 4) Prevent the stage to abort due to the fetchfailed execption in case of decomissioning of node. In Spark there is number of consecutive stage attempts allowed before a stage is aborted.This is controlled by the config spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due decommissioning of nodes towards stage failure improves the reliability of the system. Main components of change 1) Get the ClusterInfo update from the Resource Manager -> Application Master -> Spark Driver. 2) DecommissionTracker, resides inside driver, tracks all the decommissioned nodes and take necessary action and state transition. 3) Based on the decommission node list add hooks at code to achieve a) No new task on executor b) Remove shuffle data mapping info for the node to be decommissioned from the mapOutputTracker c) Do not count fetchFailure from decommissioned towards stage failure On the receiving info that node is to be decommissioned, the below action needs to be performed by DecommissionTracker on driver: * Add the entry of Nodes in DecommissionTracker with termination time and nodestate as "DECOMMISSIONING". * Stop assigning any new tasks on executors on the nodes which are candidate for decommission. This makes sure slowly as the tasks finish the usage of this node would die down. * Kill all the executors for the decommissioning nodes after configurable period of time, say "spark.graceful.decommission.executor.leasetimePct". This killing ensures two things. Firstly, the task failure will be attributed in job failure count. Second, avoid generation on more shuffle data on the node that will eventually be lost. The node state is set to "EXECUTOR_DECOMMISSIONED". * Mark Shuffle data on the node as unavailable after "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will ensure that recomputation of missing shuffle partition is done early, rather than reducers failing with a time-consuming FetchFailure. The node state is set to "SHUFFLE_DECOMMISSIONED". * Mark Node as Terminated after the termination time. Now the state of the node is "TERMINATED". * Remove the node entry from Decommission Tracker if the same host name is reused.(This is not uncommon in many public cloud environments). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26346) Upgrade parquet to 1.11.1
[ https://issues.apache.org/jira/browse/SPARK-26346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039871#comment-17039871 ] Gabor Szadovszky commented on SPARK-26346: -- [~h-vetinari], Parquet 1.11.1 was initiated to support the Spark integration. I would not do the release until Spark can confirm that 1.11.1-SNAPSHOT works correctly. I've added PARQUET-1796 to 1.11.1. > Upgrade parquet to 1.11.1 > - > > Key: SPARK-26346 > URL: https://issues.apache.org/jira/browse/SPARK-26346 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30872) Constraints inferred from inferred attributes
[ https://issues.apache.org/jira/browse/SPARK-30872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30872: Affects Version/s: (was: 3.0.0) 3.1.0 > Constraints inferred from inferred attributes > - > > Key: SPARK-30872 > URL: https://issues.apache.org/jira/browse/SPARK-30872 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > scala> spark.range(20).selectExpr("id as a", "id as b", "id as > c").write.saveAsTable("t1") > scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or > c = 13)").explain(false) > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition, true, [id=#76] >+- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(1) Project > +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = > 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND > (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13))) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: > true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), > isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, > Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c), > Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), > Or(EqualTo(c,3),EqualT..., ReadSchema: struct > {code} > We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30872) Constraints inferred from inferred attributes
[ https://issues.apache.org/jira/browse/SPARK-30872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-30872: Description: {code:scala} scala> spark.range(20).selectExpr("id as a", "id as b", "id as c").write.saveAsTable("t1") scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or c = 13)").explain(false) == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#76] +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) Project +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13))) +- *(1) ColumnarToRow +- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c), Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), Or(EqualTo(c,3),EqualT..., ReadSchema: struct {code} We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}. was: {code:scala} scala> spark.range(20).selectExpr("id as a", "id as b", "id as c").write.saveAsTable("t1") scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or c = 13)").explain(false) == Physical Plan == *(2) HashAggregate(keys=[], functions=[count(1)]) +- Exchange SinglePartition, true, [id=#76] +- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) +- *(1) Project +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13))) +- *(1) ColumnarToRow +- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., PartitionFilters: [], PushedFilters: [IsNotNull(c), Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), Or(EqualTo(c,3),EqualT..., ReadSchema: struct {code} > Constraints inferred from inferred attributes > - > > Key: SPARK-30872 > URL: https://issues.apache.org/jira/browse/SPARK-30872 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > {code:scala} > scala> spark.range(20).selectExpr("id as a", "id as b", "id as > c").write.saveAsTable("t1") > scala> spark.sql("select count(*) from t1 where a = b and b = c and (c = 3 or > c = 13)").explain(false) > == Physical Plan == > *(2) HashAggregate(keys=[], functions=[count(1)]) > +- Exchange SinglePartition, true, [id=#76] >+- *(1) HashAggregate(keys=[], functions=[partial_count(1)]) > +- *(1) Project > +- *(1) Filter (((isnotnull(c#36L) AND ((b#35L = 3) OR (b#35L = > 13))) AND isnotnull(b#35L)) AND (a#34L = c#36L)) AND isnotnull(a#34L)) AND > (a#34L = b#35L)) AND (b#35L = c#36L)) AND ((c#36L = 3) OR (c#36L = 13))) > +- *(1) ColumnarToRow >+- FileScan parquet default.t1[a#34L,b#35L,c#36L] Batched: > true, DataFilters: [isnotnull(c#36L), ((b#35L = 3) OR (b#35L = 13)), > isnotnull(b#35L), (a#34L = c#36L), isnotnull(a#..., Format: Parquet, > Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c), > Or(EqualTo(b,3),EqualTo(b,13)), IsNotNull(b), IsNotNull(a), > Or(EqualTo(c,3),EqualT..., ReadSchema: struct > {code} > We can infer more constraints: {{(a#34L = 3) OR (a#34L = 13)}}. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30349) The result is wrong when joining tables with selecting the same columns
[ https://issues.apache.org/jira/browse/SPARK-30349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30349. -- Resolution: Cannot Reproduce Resolving this due to no feedback from the author. > The result is wrong when joining tables with selecting the same columns > --- > > Key: SPARK-30349 > URL: https://issues.apache.org/jira/browse/SPARK-30349 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: cen yuhai >Priority: Blocker > Labels: correctness > Attachments: screenshot-1.png, screenshot-2.png > > > {code:sql} > // code placeholder > with tmp as( > select > log_date, > buvid, > manga_id, > sum(readtime) readtime > from > manga.dwd_app_readtime_xt_dt > where > log_date >= 20191220 > group by > log_date, > buvid, > manga_id > ) > select > t.log_date, > GET_JSON_OBJECT(t.extended_fields, '$.type'), > count(distinct t.buvid), > count(distinct t0.buvid), > count(distinct t1.buvid), > count(distinct t2.buvid), > count( > distinct case > when t1.buvid = t0.buvid then t1.buvid > end > ), > count( > distinct case > when t1.buvid = t0.buvid > and t1.buvid = t2.buvid then t1.buvid > end > ), > count( > distinct case > when t0.buvid = t2.buvid then t0.buvid > end > ), > sum(readtime), > avg(readtime), > sum( > case > when t0.buvid = t3.buvid then readtime > end > ), > avg( > case > when t0.buvid = t3.buvid then readtime > end > ) > from > manga.manga_tfc_app_ubt_d t > join manga.manga_tfc_app_ubt_d t1 on t.buvid = t1.buvid > and t1.log_date >= 20191220 > and t1.event_id = 'manga.manga-detail.0.0.pv' > and to_date(t.stime) = TO_DATE(t1.stime) > and GET_JSON_OBJECT(t1.extended_fields, '$.manga_id') = > GET_JSON_OBJECT(t.extended_fields, '$.manga_id') > left join manga.manga_buvid_minlog t0 on t.buvid = t0.buvid > and t0.log_date = 20191223 > and t0.minlog >= '2019-12-20' > and to_date(t.stime) = TO_DATE(t0.minlog) > left join manga.dwb_tfc_app_launch_df t2 on t.buvid = t2.buvid > and t2.log_date >= 20191220 > and DATE_ADD(to_date(t.stime), 1) = to_date(t2.stime) > left join tmp t3 on t1.buvid = t3.buvid > and t3.log_date >= 20191220 > and t3.manga_id = GET_JSON_OBJECT(t.extended_fields, '$.manga_id') > where > t.log_date >= 20191220 > and t.event_id = 'manga.homepage-recommend.detail.0.click' > group by > t.log_date, > GET_JSON_OBJECT(t.extended_fields, '$.type') > {code} > !screenshot-1.png! > The result of hive 2.3 is ok > !screenshot-2.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29908) Support partitioning for DataSource V2 tables in DataFrameWriter.save
[ https://issues.apache.org/jira/browse/SPARK-29908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039784#comment-17039784 ] Hyukjin Kwon commented on SPARK-29908: -- [~brkyvz], the PRs were closed. Do we still need this? can you take an action to the JIRA too? > Support partitioning for DataSource V2 tables in DataFrameWriter.save > - > > Key: SPARK-29908 > URL: https://issues.apache.org/jira/browse/SPARK-29908 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Burak Yavuz >Priority: Blocker > > Currently, any data source that that upgrades to DataSource V2 loses the > partition transform information when using DataFrameWriter.save. The main > reason is the lack of an API for "creating" a table with partitioning and > schema information for V2 tables without a catalog. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29699: - Priority: Critical (was: Blocker) > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29699: - Target Version/s: (was: 3.0.0) > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29699: - Labels: (was: correctness) > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Blocker > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29699) Different answers in nested aggregates with window functions
[ https://issues.apache.org/jira/browse/SPARK-29699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17039783#comment-17039783 ] Hyukjin Kwon commented on SPARK-29699: -- I lowered to Critical+ for now. > Different answers in nested aggregates with window functions > > > Key: SPARK-29699 > URL: https://issues.apache.org/jira/browse/SPARK-29699 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Critical > > A nested aggregate below with a window function seems to have different > answers in the `rsum` column between PgSQL and Spark; > {code:java} > postgres=# create table gstest2 (a integer, b integer, c integer, d integer, > e integer, f integer, g integer, h integer); > postgres=# insert into gstest2 values > postgres-# (1, 1, 1, 1, 1, 1, 1, 1), > postgres-# (1, 1, 1, 1, 1, 1, 1, 2), > postgres-# (1, 1, 1, 1, 1, 1, 2, 2), > postgres-# (1, 1, 1, 1, 1, 2, 2, 2), > postgres-# (1, 1, 1, 1, 2, 2, 2, 2), > postgres-# (1, 1, 1, 2, 2, 2, 2, 2), > postgres-# (1, 1, 2, 2, 2, 2, 2, 2), > postgres-# (1, 2, 2, 2, 2, 2, 2, 2), > postgres-# (2, 2, 2, 2, 2, 2, 2, 2); > INSERT 0 9 > postgres=# > postgres=# select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > postgres-# from gstest2 group by rollup (a,b) order by rsum, a, b; > a | b | sum | rsum > ---+---+-+-- > 1 | 1 | 8 |8 > 1 | 2 | 2 | 10 > 1 | | 10 | 20 > 2 | 2 | 2 | 22 > 2 | | 2 | 24 >| | 12 | 36 > (6 rows) > {code} > {code:java} > scala> sql(""" > | select a, b, sum(c), sum(sum(c)) over (order by a,b) as rsum > | from gstest2 group by rollup (a,b) order by rsum, a, b > | """).show() > +++--++ > > | a| b|sum(c)|rsum| > +++--++ > |null|null|12| 12| > | 1|null|10| 22| > | 1| 1| 8| 30| > | 1| 2| 2| 32| > | 2|null| 2| 34| > | 2| 2| 2| 36| > +++--++ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org