[jira] [Created] (SPARK-41558) Disable Coverage in python.pyspark.tests.test_memory_profiler
Hyukjin Kwon created SPARK-41558: Summary: Disable Coverage in python.pyspark.tests.test_memory_profiler Key: SPARK-41558 URL: https://issues.apache.org/jira/browse/SPARK-41558 Project: Spark Issue Type: Test Components: PySpark Affects Versions: 3.4.0 Reporter: Hyukjin Kwon https://github.com/apache/spark/actions/runs/3712125552/jobs/6293347848 {code} == FAIL [13.173s]: test_memory_profiler (pyspark.tests.test_memory_profiler.MemoryProfilerTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 56, in test_memory_profiler self.assertTrue("plus_one" in fake_out.getvalue()) AssertionError: False is not true == FAIL [3.986s]: test_profile_pandas_function_api (pyspark.tests.test_memory_profiler.MemoryProfilerTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 87, in test_profile_pandas_function_api self.assertTrue(f_name in fake_out.getvalue()) AssertionError: False is not true == FAIL [3.722s]: test_profile_pandas_udf (pyspark.tests.test_memory_profiler.MemoryProfilerTests) -- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 69, in test_profile_pandas_udf self.assertTrue(f_name in fake_out.getvalue()) AssertionError: False is not true -- Ran 3 tests in 20.882s {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
[ https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41426: Assignee: (was: Apache Spark) > Protobuf serializer for ResourceProfileWrapper > -- > > Key: SPARK-41426 > URL: https://issues.apache.org/jira/browse/SPARK-41426 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
[ https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648870#comment-17648870 ] Apache Spark commented on SPARK-41426: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39105 > Protobuf serializer for ResourceProfileWrapper > -- > > Key: SPARK-41426 > URL: https://issues.apache.org/jira/browse/SPARK-41426 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
[ https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41426: Assignee: Apache Spark > Protobuf serializer for ResourceProfileWrapper > -- > > Key: SPARK-41426 > URL: https://issues.apache.org/jira/browse/SPARK-41426 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
[ https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648868#comment-17648868 ] Apache Spark commented on SPARK-41426: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39105 > Protobuf serializer for ResourceProfileWrapper > -- > > Key: SPARK-41426 > URL: https://issues.apache.org/jira/browse/SPARK-41426 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper
[ https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648867#comment-17648867 ] Sandeep Singh commented on SPARK-41426: --- underlying serializer already there, working on a PR. > Protobuf serializer for ResourceProfileWrapper > -- > > Key: SPARK-41426 > URL: https://issues.apache.org/jira/browse/SPARK-41426 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41557) Union of tables with and without metadata column fails when used in join
[ https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648866#comment-17648866 ] Shardul Mahadik commented on SPARK-41557: - cc: [~Gengliang.Wang] [~cloud_fan] > Union of tables with and without metadata column fails when used in join > > > Key: SPARK-41557 > URL: https://issues.apache.org/jira/browse/SPARK-41557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Shardul Mahadik >Priority: Major > > Here is a test case that can be added to {{MetadataColumnSuite}} to > demonstrate the issue > {code:scala} > test("SPARK-41557: Union of tables with and without metadata column should > work") { > withTable(tbl) { > sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") > checkAnswer( > spark.sql( > s""" > SELECT b.* > FROM RANGE(1) > LEFT JOIN ( > SELECT id FROM $tbl > UNION ALL > SELECT id FROM RANGE(10) > ) b USING(id) > """), > Seq(Row(0)) > ) > } > } > {code} > Here a table with metadata columns {{$tbl}} is unioned with a table without > metdata columns {{RANGE(10)}}. If this result is later used in a join, query > analysis fails saying mismatch in the number of columns of the union caused > by the metadata columns. However, here we can see that we explicitly project > only one column during the union, so the union should be valid. > {code} > org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only > be performed on inputs with the same number of columns, but the first input > has 3 columns and the second input has 1 columns.; line 5 pos 16; > 'Project [id#26L] > +- 'Project [id#26L, id#26L] >+- 'Project [id#28L, id#26L] > +- 'Join LeftOuter, (id#28L = id#26L) > :- Range (0, 1, step=1, splits=None) > +- 'SubqueryAlias b > +- 'Union false, false >:- Project [id#26L, index#30, _partition#31] >: +- SubqueryAlias testcat.t >: +- RelationV2[id#26L, data#27, index#30, _partition#31] > testcat.t testcat.t >+- Project [id#29L] > +- Range (0, 10, step=1, splits=None) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41557) Union of tables with and without metadata column fails when used in join
[ https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-41557: Description: Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate the issue {code:scala} test("SPARK-X: Union of tables with and without metadata column should work") { withTable(tbl) { sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") checkAnswer( spark.sql( s""" SELECT b.* FROM RANGE(1) LEFT JOIN ( SELECT id FROM $tbl UNION ALL SELECT id FROM RANGE(10) ) b USING(id) """), Seq(Row(0)) ) } } {code} Here a table with metadata columns {{$tbl}} is unioned with a table without metdata columns {{RANGE(10)}}. If this result is later used in a join, query analysis fails saying mismatch in the number of columns of the union caused by the metadata columns. However, here we can see that we explicitly project only one column during the union, so the union should be valid. {code} org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only be performed on inputs with the same number of columns, but the first input has 3 columns and the second input has 1 columns.; line 5 pos 16; 'Project [id#26L] +- 'Project [id#26L, id#26L] +- 'Project [id#28L, id#26L] +- 'Join LeftOuter, (id#28L = id#26L) :- Range (0, 1, step=1, splits=None) +- 'SubqueryAlias b +- 'Union false, false :- Project [id#26L, index#30, _partition#31] : +- SubqueryAlias testcat.t : +- RelationV2[id#26L, data#27, index#30, _partition#31] testcat.t testcat.t +- Project [id#29L] +- Range (0, 10, step=1, splits=None) {code} was: Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate the issue {code:scala} test("SPARK-X: Union of tables with and without metadata column should work") { withTable(tbl) { sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") checkAnswer( spark.sql( s""" SELECT b.* FROM RANGE(1) LEFT JOIN ( SELECT id FROM $tbl UNION ALL SELECT id FROM RANGE(10) ) b USING(id) """), Seq(Row(0)) ) } } {code} Here a table with metadata columns {{$tbl}} is unioned with a table without metdata columns {{RANGE(10)}}. If this result is later used in a join, query analysis fails saying mismatch in the number of columns of the union caused by the metadata columns. However, here we can see that we explicitly project only one column during the union, so the union should be valid. {code} org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only be performed on inputs with the same number of columns, but the first input has 3 columns and the second input has 1 columns.; line 5 pos 16; 'Project [id#26L] +- 'Project [id#26L, id#26L] +- 'Project [id#28L, id#26L] +- 'Join LeftOuter, (id#28L = id#26L) :- Range (0, 1, step=1, splits=None) +- 'SubqueryAlias b +- 'Union false, false :- Project [id#26L, index#30, _partition#31] : +- SubqueryAlias testcat.t : +- RelationV2[id#26L, data#27, index#30, _partition#31] testcat.t testcat.t +- Project [id#29L] +- Range (0, 10, step=1, splits=None) {code} > Union of tables with and without metadata column fails when used in join > > > Key: SPARK-41557 > URL: https://issues.apache.org/jira/browse/SPARK-41557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Shardul Mahadik >Priority: Major > > Here is a test case that can be added to {{MetadataColumnSuite}} to > demonstrate the issue > {code:scala} > test("SPARK-X: Union of tables with and without metadata column should > work") { > withTable(tbl) { > sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") > checkAnswer( > spark.sql( > s""" > SELECT b.* > FROM RANGE(1) > LEFT JOIN ( > SELECT id FROM $tbl > UNION ALL > SELECT id FROM RANGE(10) > ) b USING(id) > """), > Seq(Row(0)) > ) > } > } > {code} > Here a table with metadata columns {{$tbl}} is unioned with a table without > metdata columns {{RANGE(10)}}. If this result is later used in a join, query > a
[jira] [Updated] (SPARK-41557) Union of tables with and without metadata column fails when used in join
[ https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-41557: Description: Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate the issue {code:scala} test("SPARK-41557: Union of tables with and without metadata column should work") { withTable(tbl) { sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") checkAnswer( spark.sql( s""" SELECT b.* FROM RANGE(1) LEFT JOIN ( SELECT id FROM $tbl UNION ALL SELECT id FROM RANGE(10) ) b USING(id) """), Seq(Row(0)) ) } } {code} Here a table with metadata columns {{$tbl}} is unioned with a table without metdata columns {{RANGE(10)}}. If this result is later used in a join, query analysis fails saying mismatch in the number of columns of the union caused by the metadata columns. However, here we can see that we explicitly project only one column during the union, so the union should be valid. {code} org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only be performed on inputs with the same number of columns, but the first input has 3 columns and the second input has 1 columns.; line 5 pos 16; 'Project [id#26L] +- 'Project [id#26L, id#26L] +- 'Project [id#28L, id#26L] +- 'Join LeftOuter, (id#28L = id#26L) :- Range (0, 1, step=1, splits=None) +- 'SubqueryAlias b +- 'Union false, false :- Project [id#26L, index#30, _partition#31] : +- SubqueryAlias testcat.t : +- RelationV2[id#26L, data#27, index#30, _partition#31] testcat.t testcat.t +- Project [id#29L] +- Range (0, 10, step=1, splits=None) {code} was: Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate the issue {code:scala} test("SPARK-X: Union of tables with and without metadata column should work") { withTable(tbl) { sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") checkAnswer( spark.sql( s""" SELECT b.* FROM RANGE(1) LEFT JOIN ( SELECT id FROM $tbl UNION ALL SELECT id FROM RANGE(10) ) b USING(id) """), Seq(Row(0)) ) } } {code} Here a table with metadata columns {{$tbl}} is unioned with a table without metdata columns {{RANGE(10)}}. If this result is later used in a join, query analysis fails saying mismatch in the number of columns of the union caused by the metadata columns. However, here we can see that we explicitly project only one column during the union, so the union should be valid. {code} org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only be performed on inputs with the same number of columns, but the first input has 3 columns and the second input has 1 columns.; line 5 pos 16; 'Project [id#26L] +- 'Project [id#26L, id#26L] +- 'Project [id#28L, id#26L] +- 'Join LeftOuter, (id#28L = id#26L) :- Range (0, 1, step=1, splits=None) +- 'SubqueryAlias b +- 'Union false, false :- Project [id#26L, index#30, _partition#31] : +- SubqueryAlias testcat.t : +- RelationV2[id#26L, data#27, index#30, _partition#31] testcat.t testcat.t +- Project [id#29L] +- Range (0, 10, step=1, splits=None) {code} > Union of tables with and without metadata column fails when used in join > > > Key: SPARK-41557 > URL: https://issues.apache.org/jira/browse/SPARK-41557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.2, 3.4.0 >Reporter: Shardul Mahadik >Priority: Major > > Here is a test case that can be added to {{MetadataColumnSuite}} to > demonstrate the issue > {code:scala} > test("SPARK-41557: Union of tables with and without metadata column should > work") { > withTable(tbl) { > sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") > checkAnswer( > spark.sql( > s""" > SELECT b.* > FROM RANGE(1) > LEFT JOIN ( > SELECT id FROM $tbl > UNION ALL > SELECT id FROM RANGE(10) > ) b USING(id) > """), > Seq(Row(0)) > ) > } > } > {code} > Here a table with metadata columns {{$tbl}} is unioned with a table without > metdata columns {{RANGE(10)}}. If this result is later used in a join, query > ana
[jira] [Created] (SPARK-41557) Union of tables with and without metadata column fails when used in join
Shardul Mahadik created SPARK-41557: --- Summary: Union of tables with and without metadata column fails when used in join Key: SPARK-41557 URL: https://issues.apache.org/jira/browse/SPARK-41557 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.2, 3.4.0 Reporter: Shardul Mahadik Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate the issue {code:scala} test("SPARK-X: Union of tables with and without metadata column should work") { withTable(tbl) { sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)") checkAnswer( spark.sql( s""" SELECT b.* FROM RANGE(1) LEFT JOIN ( SELECT id FROM $tbl UNION ALL SELECT id FROM RANGE(10) ) b USING(id) """), Seq(Row(0)) ) } } {code} Here a table with metadata columns {{$tbl}} is unioned with a table without metdata columns {{RANGE(10)}}. If this result is later used in a join, query analysis fails saying mismatch in the number of columns of the union caused by the metadata columns. However, here we can see that we explicitly project only one column during the union, so the union should be valid. {code} org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only be performed on inputs with the same number of columns, but the first input has 3 columns and the second input has 1 columns.; line 5 pos 16; 'Project [id#26L] +- 'Project [id#26L, id#26L] +- 'Project [id#28L, id#26L] +- 'Join LeftOuter, (id#28L = id#26L) :- Range (0, 1, step=1, splits=None) +- 'SubqueryAlias b +- 'Union false, false :- Project [id#26L, index#30, _partition#31] : +- SubqueryAlias testcat.t : +- RelationV2[id#26L, data#27, index#30, _partition#31] testcat.t testcat.t +- Project [id#29L] +- Range (0, 10, step=1, splits=None) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41425: Assignee: (was: Apache Spark) > Protobuf serializer for RDDStorageInfoWrapper > - > > Key: SPARK-41425 > URL: https://issues.apache.org/jira/browse/SPARK-41425 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648862#comment-17648862 ] Apache Spark commented on SPARK-41425: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39104 > Protobuf serializer for RDDStorageInfoWrapper > - > > Key: SPARK-41425 > URL: https://issues.apache.org/jira/browse/SPARK-41425 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41425: Assignee: Apache Spark > Protobuf serializer for RDDStorageInfoWrapper > - > > Key: SPARK-41425 > URL: https://issues.apache.org/jira/browse/SPARK-41425 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648860#comment-17648860 ] Sandeep Singh commented on SPARK-41425: --- Working on this > Protobuf serializer for RDDStorageInfoWrapper > - > > Key: SPARK-41425 > URL: https://issues.apache.org/jira/browse/SPARK-41425 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41556) input_file_positon
[ https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648856#comment-17648856 ] gabrywu edited comment on SPARK-41556 at 12/17/22 5:18 AM: --- [~yumwang] [~petertoth] What do you think of it? was (Author: gabry.wu): [~yumwang] [~ptoth] What do you think of it? > input_file_positon > -- > > Key: SPARK-41556 > URL: https://issues.apache.org/jira/browse/SPARK-41556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.1 >Reporter: gabrywu >Priority: Trivial > > As for now, we have 3 built-in UDFs related to input files and blocks. So > can we provide a new UDF to return current record position of a file or > block? Sometimes, it's useful and we can consider this position (called ROWID > in oracle) as a physical primary key. > > |input_file_block_length()|Returns the length of the block being read, or -1 > if not available.| > |input_file_block_start()|Returns the start offset of the block being read, > or -1 if not available.| > |input_file_name()|Returns the name of the file being read, or empty string > if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41556) input_file_positon
[ https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648856#comment-17648856 ] gabrywu commented on SPARK-41556: - [~yumwang] [~ptoth] What do you think of it? > input_file_positon > -- > > Key: SPARK-41556 > URL: https://issues.apache.org/jira/browse/SPARK-41556 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.3.1 >Reporter: gabrywu >Priority: Trivial > > As for now, we have 3 built-in UDFs related to input files and blocks. So > can we provide a new UDF to return current record position of a file or > block? Sometimes, it's useful and we can consider this position (called ROWID > in oracle) as a physical primary key. > > |input_file_block_length()|Returns the length of the block being read, or -1 > if not available.| > |input_file_block_start()|Returns the start offset of the block being read, > or -1 if not available.| > |input_file_name()|Returns the name of the file being read, or empty string > if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations
[ https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] shuyouZZ updated SPARK-41447: - Description: When restarting the history server, In history server log, we can see many {{INFO FsHIstoryProvider: Finished parsing application_xxx}} followed by {{INFO FsHIstoryProvider: Deleting expired event log for application_xxx}}. This is caused by the expired and unlisted log files also be parsed, and then execute {{checkAndCleanLog}} to delete parsed info, this means that the parsing is unnecessary. If there are a large number of expired log files in the log directory, it will affect the speed of replay. In order to avoid this, we can add a logic to clean up these expired log files before calling {{doMergeApplicationListing}}. was: When restarting the history server, the previous logic is to execute {{checkForLogs}} first, which will cause the expired event log files to be parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: Deleting expired event log for application_xxx{}}}. If there are a large number of expired log files in the log directory, it will affect the speed of replay. In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}. In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, when the history server is starting, the expired log info may not exist in the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}. > Reduce the number of doMergeApplicationListing invocations > -- > > Key: SPARK-41447 > URL: https://issues.apache.org/jira/browse/SPARK-41447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Minor > Fix For: 3.4.0 > > > When restarting the history server, In history server log, > we can see many {{INFO FsHIstoryProvider: Finished parsing application_xxx}} > followed by > {{INFO FsHIstoryProvider: Deleting expired event log for application_xxx}}. > This is caused by the expired and unlisted log files also be parsed, and then > execute {{checkAndCleanLog}} to delete parsed info, this means that the > parsing is unnecessary. > If there are a large number of expired log files in the log directory, it > will affect the speed of replay. > In order to avoid this, we can add a logic to clean up these expired log > files before calling {{doMergeApplicationListing}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41556) input_file_positon
gabrywu created SPARK-41556: --- Summary: input_file_positon Key: SPARK-41556 URL: https://issues.apache.org/jira/browse/SPARK-41556 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.3.1 Reporter: gabrywu As for now, we have 3 built-in UDFs related to input files and blocks. So can we provide a new UDF to return current record position of a file or block? Sometimes, it's useful and we can consider this position (called ROWID in oracle) as a physical primary key. |input_file_block_length()|Returns the length of the block being read, or -1 if not available.| |input_file_block_start()|Returns the start offset of the block being read, or -1 if not available.| |input_file_name()|Returns the name of the file being read, or empty string if not available.| -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.
[ https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648848#comment-17648848 ] Apache Spark commented on SPARK-41546: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39103 > pyspark_types_to_proto_types should supports StructType. > > > Key: SPARK-41546 > URL: https://issues.apache.org/jira/browse/SPARK-41546 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > pyspark_types_to_proto_types doesn't support StructType now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.
[ https://issues.apache.org/jira/browse/SPARK-41546 ] jiaan.geng deleted comment on SPARK-41546: was (Author: beliefer): I'm working on. > pyspark_types_to_proto_types should supports StructType. > > > Key: SPARK-41546 > URL: https://issues.apache.org/jira/browse/SPARK-41546 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > pyspark_types_to_proto_types doesn't support StructType now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.
[ https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648847#comment-17648847 ] Apache Spark commented on SPARK-41546: -- User 'beliefer' has created a pull request for this issue: https://github.com/apache/spark/pull/39103 > pyspark_types_to_proto_types should supports StructType. > > > Key: SPARK-41546 > URL: https://issues.apache.org/jira/browse/SPARK-41546 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > pyspark_types_to_proto_types doesn't support StructType now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.
[ https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41546: Assignee: Apache Spark > pyspark_types_to_proto_types should supports StructType. > > > Key: SPARK-41546 > URL: https://issues.apache.org/jira/browse/SPARK-41546 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Assignee: Apache Spark >Priority: Major > > pyspark_types_to_proto_types doesn't support StructType now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.
[ https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41546: Assignee: (was: Apache Spark) > pyspark_types_to_proto_types should supports StructType. > > > Key: SPARK-41546 > URL: https://issues.apache.org/jira/browse/SPARK-41546 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: jiaan.geng >Priority: Major > > pyspark_types_to_proto_types doesn't support StructType now. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Priority: Minor (was: Major) > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Minor > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > case class TestData(ind: Int, name: String) > case class TestData2(ind2: Int, name: String) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-41421. Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39096 [https://github.com/apache/spark/pull/39096] > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-41421: -- Assignee: Sandeep Singh > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Sandeep Singh >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates
[ https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shardul Mahadik updated SPARK-41162: Labels: correctness (was: ) > Anti-join must not be pushed below aggregation with ambiguous predicates > > > Key: SPARK-41162 > URL: https://issues.apache.org/jira/browse/SPARK-41162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > Labels: correctness > > The following query should return a single row as all values for {{id}} > except for the largest will be eliminated by the anti-join: > {code} > val ids = Seq(1, 2, 3).toDF("id").distinct() > val result = ids.withColumn("id", $"id" + 1).join(ids, "id", > "left_anti").collect() > assert(result.length == 1) > {code} > Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the > assertion should still hold but is false. > Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left > {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never > be true. > {code} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === > !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], > [(id#750 + 1) AS id#752] > !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, > ((id#750 + 1) = id#750) > !: +- LocalRelation [id#750] :- LocalRelation > [id#750] > !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], > [id#750] > ! +- LocalRelation [id#750]+- LocalRelation > [id#750] > {code} > The optimizer then rightly removes the left-anti join altogether, returning > the left child only. > Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that > reference left *and* right child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Description: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: {code:java} // code placeholder def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } case class TestData(ind: Int, name: String) case class TestData2(ind2: Int, name: String) {code} was: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: {code:java} // code placeholder def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } {code} > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > case class TestData(ind: Int, name: String) > case class TestData2(ind2: Int, name: String) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648843#comment-17648843 ] Apache Spark commented on SPARK-41555: -- User 'monkeyboy123' has created a pull request for this issue: https://github.com/apache/spark/pull/39102 > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648841#comment-17648841 ] Apache Spark commented on SPARK-41555: -- User 'monkeyboy123' has created a pull request for this issue: https://github.com/apache/spark/pull/39101 > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41555: Assignee: (was: Apache Spark) > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41555: Assignee: Apache Spark > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Assignee: Apache Spark >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648840#comment-17648840 ] Apache Spark commented on SPARK-41555: -- User 'monkeyboy123' has created a pull request for this issue: https://github.com/apache/spark/pull/39101 > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Description: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: {code:java} // code placeholder def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } {code} was: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > {code:java} > // code placeholder > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Description: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } was: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: ``` def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } ``` > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Description: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. code like this: ``` def main(args: Array[String]): Unit = { val sparkConf = new SparkConf() .setAppName("demo") .setMaster("local[*]") val spark = SparkSession.builder() .config(sparkConf) .getOrCreate() setDefaultSession(null) setActiveSession(null) val spark2 = SparkSession.builder() .config(sparkConf) .getOrCreate() import spark.implicits._ val testData = spark.sparkContext .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() testData.createOrReplaceTempView("testTable") val testData2 = spark.sparkContext.parallelize( TestData2(1, "1") :: TestData2(1, "2") :: TestData2(2, "1") :: TestData2(2, "2") :: TestData2(3, "1") :: TestData2(3, "2") :: Nil, 2).toDF() testData2.createOrReplaceTempView("testTable2") val query = "select ind2,count(*) from ( select * from testTable2 join testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group by ind2" spark.sql(query).collect() Thread.sleep(50) spark.stop() } ``` was: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. > code like this: > ``` > def main(args: Array[String]): Unit = { > val sparkConf = new SparkConf() > .setAppName("demo") > .setMaster("local[*]") > val spark = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > setDefaultSession(null) > setActiveSession(null) > val spark2 = SparkSession.builder() > .config(sparkConf) > .getOrCreate() > import spark.implicits._ > val testData = spark.sparkContext > .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF() > testData.createOrReplaceTempView("testTable") > val testData2 = spark.sparkContext.parallelize( > TestData2(1, "1") :: > TestData2(1, "2") :: > TestData2(2, "1") :: > TestData2(2, "2") :: > TestData2(3, "1") :: > TestData2(3, "2") :: > Nil, 2).toDF() > testData2.createOrReplaceTempView("testTable2") > val query = "select ind2,count(*) from ( select * from testTable2 join > testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') > group by ind2" > spark.sql(query).collect() > Thread.sleep(50) > spark.stop() > } > ``` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Description: In spark , if we create multi sparkSession in the program, we will get multi-SQLTab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. was: In spark , if we create multi sparkSession in the program, we will get multi-SQL tab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQLTab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
[ https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiahong.li updated SPARK-41555: --- Attachment: muti-sqltab.png muti-SQLStore.png > Multi sparkSession should share single SQLAppStatusStore > > > Key: SPARK-41555 > URL: https://issues.apache.org/jira/browse/SPARK-41555 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.1, 3.2.1, 3.3.0 >Reporter: jiahong.li >Priority: Major > Attachments: muti-SQLStore.png, muti-sqltab.png > > > In spark , if we create multi sparkSession in the program, we will get > multi-SQL tab in UI, > At the same time, we will get muti-SQLAppStatusListener object, it is waste > of memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore
jiahong.li created SPARK-41555: -- Summary: Multi sparkSession should share single SQLAppStatusStore Key: SPARK-41555 URL: https://issues.apache.org/jira/browse/SPARK-41555 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.0, 3.2.1, 3.1.1 Reporter: jiahong.li In spark , if we create multi sparkSession in the program, we will get multi-SQL tab in UI, At the same time, we will get muti-SQLAppStatusListener object, it is waste of memory. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648835#comment-17648835 ] Gengliang Wang edited comment on SPARK-41422 at 12/17/22 12:33 AM: --- [~techaddict] I have a PR for this one already. Sorry I didn't claim it. I will claim next time. The ExecutorMetrics is a bit tricky, so I am doing it by myself. was (Author: gengliang.wang): [~techaddict] I have a PR for this one already. Sorry I didn't claim it. > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41422: Assignee: Apache Spark > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648837#comment-17648837 ] Apache Spark commented on SPARK-41422: -- User 'gengliangwang' has created a pull request for this issue: https://github.com/apache/spark/pull/39100 > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41422: Assignee: (was: Apache Spark) > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates
[ https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648836#comment-17648836 ] SHU WANG commented on SPARK-41162: -- [~shardulm] Yes. Checked with Spark 3.1.2, and it's also an issue. > Anti-join must not be pushed below aggregation with ambiguous predicates > > > Key: SPARK-41162 > URL: https://issues.apache.org/jira/browse/SPARK-41162 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Enrico Minack >Priority: Major > > The following query should return a single row as all values for {{id}} > except for the largest will be eliminated by the anti-join: > {code} > val ids = Seq(1, 2, 3).toDF("id").distinct() > val result = ids.withColumn("id", $"id" + 1).join(ids, "id", > "left_anti").collect() > assert(result.length == 1) > {code} > Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the > assertion should still hold but is false. > Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left > {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never > be true. > {code} > === Applying Rule > org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin === > !Join LeftAnti, (id#752 = id#750) 'Aggregate [id#750], > [(id#750 + 1) AS id#752] > !:- Aggregate [id#750], [(id#750 + 1) AS id#752] +- 'Join LeftAnti, > ((id#750 + 1) = id#750) > !: +- LocalRelation [id#750] :- LocalRelation > [id#750] > !+- Aggregate [id#750], [id#750] +- Aggregate [id#750], > [id#750] > ! +- LocalRelation [id#750]+- LocalRelation > [id#750] > {code} > The optimizer then rightly removes the left-anti join altogether, returning > the left child only. > Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that > reference left *and* right child. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648835#comment-17648835 ] Gengliang Wang commented on SPARK-41422: [~techaddict] I have a PR for this one already. Sorry I didn't claim it. > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper
[ https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648832#comment-17648832 ] Sandeep Singh commented on SPARK-41422: --- Working on this, will create a PR soon. > Protobuf serializer for ExecutorSummaryWrapper > -- > > Key: SPARK-41422 > URL: https://issues.apache.org/jira/browse/SPARK-41422 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41530) Rename MedianHeap to PercentileMap and support percentile
[ https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41530: -- Summary: Rename MedianHeap to PercentileMap and support percentile (was: extend MedianHeap to support percentile) > Rename MedianHeap to PercentileMap and support percentile > - > > Key: SPARK-41530 > URL: https://issues.apache.org/jira/browse/SPARK-41530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41530) extend MedianHeap to support percentile
[ https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41530: - Assignee: Wenchen Fan > extend MedianHeap to support percentile > --- > > Key: SPARK-41530 > URL: https://issues.apache.org/jira/browse/SPARK-41530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41530) extend MedianHeap to support percentile
[ https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41530. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39076 [https://github.com/apache/spark/pull/39076] > extend MedianHeap to support percentile > --- > > Key: SPARK-41530 > URL: https://issues.apache.org/jira/browse/SPARK-41530 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648830#comment-17648830 ] Apache Spark commented on SPARK-41554: -- User 'fe2s' has created a pull request for this issue: https://github.com/apache/spark/pull/39099 > Decimal.changePrecision produces ArrayIndexOutOfBoundsException > --- > > Key: SPARK-41554 > URL: https://issues.apache.org/jira/browse/SPARK-41554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Oleksiy Dyagilev >Priority: Major > > {{Reducing Decimal scale by more than 18 produces exception.}} > {code:java} > Decimal(1, 38, 19).changePrecision(38, 0){code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 19 > at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) > at > org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code} > Reproducing with SQL query: > {code:java} > sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as > decimal(38,37)) as decimal(38,17)) from range(3)").show{code} > The bug exists for {{Decimal}} that is stored using compact long only, it > works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41554: Assignee: Apache Spark > Decimal.changePrecision produces ArrayIndexOutOfBoundsException > --- > > Key: SPARK-41554 > URL: https://issues.apache.org/jira/browse/SPARK-41554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Oleksiy Dyagilev >Assignee: Apache Spark >Priority: Major > > {{Reducing Decimal scale by more than 18 produces exception.}} > {code:java} > Decimal(1, 38, 19).changePrecision(38, 0){code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 19 > at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) > at > org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code} > Reproducing with SQL query: > {code:java} > sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as > decimal(38,37)) as decimal(38,17)) from range(3)").show{code} > The bug exists for {{Decimal}} that is stored using compact long only, it > works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41554: Assignee: (was: Apache Spark) > Decimal.changePrecision produces ArrayIndexOutOfBoundsException > --- > > Key: SPARK-41554 > URL: https://issues.apache.org/jira/browse/SPARK-41554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Oleksiy Dyagilev >Priority: Major > > {{Reducing Decimal scale by more than 18 produces exception.}} > {code:java} > Decimal(1, 38, 19).changePrecision(38, 0){code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 19 > at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) > at > org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code} > Reproducing with SQL query: > {code:java} > sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as > decimal(38,37)) as decimal(38,17)) from range(3)").show{code} > The bug exists for {{Decimal}} that is stored using compact long only, it > works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648829#comment-17648829 ] Apache Spark commented on SPARK-41554: -- User 'fe2s' has created a pull request for this issue: https://github.com/apache/spark/pull/39099 > Decimal.changePrecision produces ArrayIndexOutOfBoundsException > --- > > Key: SPARK-41554 > URL: https://issues.apache.org/jira/browse/SPARK-41554 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.1 >Reporter: Oleksiy Dyagilev >Priority: Major > > {{Reducing Decimal scale by more than 18 produces exception.}} > {code:java} > Decimal(1, 38, 19).changePrecision(38, 0){code} > {code:java} > java.lang.ArrayIndexOutOfBoundsException: 19 > at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) > at > org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code} > Reproducing with SQL query: > {code:java} > sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as > decimal(38,37)) as decimal(38,17)) from range(3)").show{code} > The bug exists for {{Decimal}} that is stored using compact long only, it > works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations
[ https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41447: -- Priority: Minor (was: Major) > Reduce the number of doMergeApplicationListing invocations > -- > > Key: SPARK-41447 > URL: https://issues.apache.org/jira/browse/SPARK-41447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Minor > Fix For: 3.4.0 > > > When restarting the history server, the previous logic is to execute > {{checkForLogs}} first, which will cause the expired event log files to be > parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is > unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: > Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: > Deleting expired event log for application_xxx{}}}. If there are a large > number of expired log files in the log directory, it will affect the speed of > replay. > In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}. > In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, > when the history server is starting, the expired log info may not exist in > the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations
[ https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-41447: -- Summary: Reduce the number of doMergeApplicationListing invocations (was: clean up expired event log files that don't exist in listing db) > Reduce the number of doMergeApplicationListing invocations > -- > > Key: SPARK-41447 > URL: https://issues.apache.org/jira/browse/SPARK-41447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > Fix For: 3.4.0 > > > When restarting the history server, the previous logic is to execute > {{checkForLogs}} first, which will cause the expired event log files to be > parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is > unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: > Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: > Deleting expired event log for application_xxx{}}}. If there are a large > number of expired log files in the log directory, it will affect the speed of > replay. > In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}. > In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, > when the history server is starting, the expired log info may not exist in > the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41447) clean up expired event log files that don't exist in listing db
[ https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41447. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 38983 [https://github.com/apache/spark/pull/38983] > clean up expired event log files that don't exist in listing db > --- > > Key: SPARK-41447 > URL: https://issues.apache.org/jira/browse/SPARK-41447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > Fix For: 3.4.0 > > > When restarting the history server, the previous logic is to execute > {{checkForLogs}} first, which will cause the expired event log files to be > parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is > unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: > Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: > Deleting expired event log for application_xxx{}}}. If there are a large > number of expired log files in the log directory, it will affect the speed of > replay. > In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}. > In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, > when the history server is starting, the expired log info may not exist in > the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41447) clean up expired event log files that don't exist in listing db
[ https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41447: - Assignee: shuyouZZ > clean up expired event log files that don't exist in listing db > --- > > Key: SPARK-41447 > URL: https://issues.apache.org/jira/browse/SPARK-41447 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: shuyouZZ >Assignee: shuyouZZ >Priority: Major > > When restarting the history server, the previous logic is to execute > {{checkForLogs}} first, which will cause the expired event log files to be > parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is > unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: > Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: > Deleting expired event log for application_xxx{}}}. If there are a large > number of expired log files in the log directory, it will affect the speed of > replay. > In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}. > In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, > when the history server is starting, the expired log info may not exist in > the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException
Oleksiy Dyagilev created SPARK-41554: Summary: Decimal.changePrecision produces ArrayIndexOutOfBoundsException Key: SPARK-41554 URL: https://issues.apache.org/jira/browse/SPARK-41554 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1 Reporter: Oleksiy Dyagilev {{Reducing Decimal scale by more than 18 produces exception.}} {code:java} Decimal(1, 38, 19).changePrecision(38, 0){code} {code:java} java.lang.ArrayIndexOutOfBoundsException: 19 at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377) at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code} Reproducing with SQL query: {code:java} sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as decimal(38,37)) as decimal(38,17)) from range(3)").show{code} The bug exists for {{Decimal}} that is stored using compact long only, it works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
[ https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41552. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 39094 [https://github.com/apache/spark/pull/39094] > Upgrade kubernetes-client to 6.3.1 > -- > > Key: SPARK-41552 > URL: https://issues.apache.org/jira/browse/SPARK-41552 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41553) Change num_files to repartition
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648795#comment-17648795 ] Apache Spark commented on SPARK-41553: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/39098 > Change num_files to repartition > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Functions have this signature. > > def to_json( > (..) > num_files: Optional[int] = None, > > > .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and > writes > multiple `part-...` files in the directory when `path` is specified. > This behavior was inherited from Apache Spark. The number of files can > be controlled by `num_files`. > > > > if num_files is not None: > warnings.warn( > "`num_files` has been deprecated and might be removed in a future version. " > "Use `DataFrame.spark.repartition` instead.", > FutureWarning, > ) > > > I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41553) Change num_files to repartition
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41553: Assignee: Apache Spark > Change num_files to repartition > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Assignee: Apache Spark >Priority: Major > > Functions have this signature. > > def to_json( > (..) > num_files: Optional[int] = None, > > > .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and > writes > multiple `part-...` files in the directory when `path` is specified. > This behavior was inherited from Apache Spark. The number of files can > be controlled by `num_files`. > > > > if num_files is not None: > warnings.warn( > "`num_files` has been deprecated and might be removed in a future version. " > "Use `DataFrame.spark.repartition` instead.", > FutureWarning, > ) > > > I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41553) Change num_files to repartition
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648793#comment-17648793 ] Apache Spark commented on SPARK-41553: -- User 'bjornjorgensen' has created a pull request for this issue: https://github.com/apache/spark/pull/39098 > Change num_files to repartition > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Functions have this signature. > > def to_json( > (..) > num_files: Optional[int] = None, > > > .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and > writes > multiple `part-...` files in the directory when `path` is specified. > This behavior was inherited from Apache Spark. The number of files can > be controlled by `num_files`. > > > > if num_files is not None: > warnings.warn( > "`num_files` has been deprecated and might be removed in a future version. " > "Use `DataFrame.spark.repartition` instead.", > FutureWarning, > ) > > > I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41553) Change num_files to repartition
[ https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41553: Assignee: (was: Apache Spark) > Change num_files to repartition > --- > > Key: SPARK-41553 > URL: https://issues.apache.org/jira/browse/SPARK-41553 > Project: Spark > Issue Type: Improvement > Components: Pandas API on Spark >Affects Versions: 3.4.0 >Reporter: Bjørn Jørgensen >Priority: Major > > Functions have this signature. > > def to_json( > (..) > num_files: Optional[int] = None, > > > .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and > writes > multiple `part-...` files in the directory when `path` is specified. > This behavior was inherited from Apache Spark. The number of files can > be controlled by `num_files`. > > > > if num_files is not None: > warnings.warn( > "`num_files` has been deprecated and might be removed in a future version. " > "Use `DataFrame.spark.repartition` instead.", > FutureWarning, > ) > > > I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41049: Assignee: Apache Spark > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Assignee: Apache Spark >Priority: Major > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648792#comment-17648792 ] Apache Spark commented on SPARK-41049: -- User 'NarekDW' has created a pull request for this issue: https://github.com/apache/spark/pull/39097 > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Priority: Major > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions
[ https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41049: Assignee: (was: Apache Spark) > Nondeterministic expressions have unstable values if they are children of > CodegenFallback expressions > - > > Key: SPARK-41049 > URL: https://issues.apache.org/jira/browse/SPARK-41049 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2 >Reporter: Guy Boo >Priority: Major > > h2. Expectation > For a given row, Nondeterministic expressions are expected to have stable > values. > {code:scala} > import org.apache.spark.sql.functions._ > val df = sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > df.select(v1, v1).collect{code} > Returns a set like this: > |8777|8777| > |1357|1357| > |3435|3435| > |9204|9204| > |3870|3870| > where both columns always have the same value, but what that value is changes > from row to row. This is different from the following: > {code:scala} > df.select(rand(), rand()).collect{code} > In this case, because the rand() calls are distinct, the values in both > columns should be different. > h2. Problem > This expectation does not appear to be stable in the event that any > subsequent expression is a CodegenFallback. This program: > {code:scala} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > import org.apache.spark.sql.functions._ > val sparkSession = SparkSession.builder().getOrCreate() > val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x") > val v1 = rand().*(lit(1)).cast(IntegerType) > val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback > df.select(v1, v1, v2, v2).collect {code} > produces output like this: > |8159|8159|8159|{color:#ff}2028{color}| > |8320|8320|8320|{color:#ff}1640{color}| > |7937|7937|7937|{color:#ff}769{color}| > |436|436|436|{color:#ff}8924{color}| > |8924|8924|2827|{color:#ff}2731{color}| > Not sure why the first call via the CodegenFallback path should be correct > while subsequent calls aren't. > h2. Workaround > If the Nondeterministic expression is moved to a separate, earlier select() > call, so the CodegenFallback instead only refers to a column reference, then > the problem seems to go away. But this workaround may not be reliable if > optimization is ever able to restructure adjacent select()s. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41553) Change num_files to repartition
Bjørn Jørgensen created SPARK-41553: --- Summary: Change num_files to repartition Key: SPARK-41553 URL: https://issues.apache.org/jira/browse/SPARK-41553 Project: Spark Issue Type: Improvement Components: Pandas API on Spark Affects Versions: 3.4.0 Reporter: Bjørn Jørgensen Functions have this signature. def to_json( (..) num_files: Optional[int] = None, .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and writes multiple `part-...` files in the directory when `path` is specified. This behavior was inherited from Apache Spark. The number of files can be controlled by `num_files`. if num_files is not None: warnings.warn( "`num_files` has been deprecated and might be removed in a future version. " "Use `DataFrame.spark.repartition` instead.", FutureWarning, ) I will change num_files to repartition -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648786#comment-17648786 ] Gengliang Wang edited comment on SPARK-41053 at 12/16/22 8:30 PM: -- [~techaddict] Thanks! Feel free to take anyone from the subtasks and leave a comment that you are working on it You can follow the PR for TaskDataWrapper: [https://github.com/apache/spark/pull/39048] was (Author: gengliang.wang): [~techaddict] Thanks! Feel free to take anyone from the subtasks. You can follow the PR for TaskDataWrapper: https://github.com/apache/spark/pull/39048 > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648786#comment-17648786 ] Gengliang Wang commented on SPARK-41053: [~techaddict] Thanks! Feel free to take anyone from the subtasks. You can follow the PR for TaskDataWrapper: https://github.com/apache/spark/pull/39048 > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648785#comment-17648785 ] Apache Spark commented on SPARK-41421: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39096 > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648784#comment-17648784 ] Apache Spark commented on SPARK-41421: -- User 'techaddict' has created a pull request for this issue: https://github.com/apache/spark/pull/39096 > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41421: Assignee: Apache Spark > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper
[ https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41421: Assignee: (was: Apache Spark) > Protobuf serializer for ApplicationEnvironmentInfoWrapper > - > > Key: SPARK-41421 > URL: https://issues.apache.org/jira/browse/SPARK-41421 > Project: Spark > Issue Type: Sub-task > Components: Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications
[ https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648782#comment-17648782 ] Sandeep Singh commented on SPARK-41053: --- [~Gengliang.Wang] I'm willing to take some tasks from this list > Better Spark UI scalability and Driver stability for large applications > --- > > Key: SPARK-41053 > URL: https://issues.apache.org/jira/browse/SPARK-41053 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 3.4.0 >Reporter: Gengliang Wang >Priority: Major > Attachments: Better Spark UI scalability and Driver stability for > large applications.pdf > > > After SPARK-18085, the Spark history server(SHS) becomes more scalable for > processing large applications by supporting a persistent > KV-store(LevelDB/RocksDB) as the storage layer. > As for the live Spark UI, all the data is still stored in memory, which can > bring memory pressures to the Spark driver for large applications. > For better Spark UI scalability and Driver stability, I propose to > * {*}Support storing all the UI data in a persistent KV store{*}. > RocksDB/LevelDB provides low memory overhead. Their write/read performance is > fast enough to serve the write/read workload for live UI. SHS can leverage > the persistent KV store to fasten its startup. > * *Support a new Protobuf serializer for all the UI data.* The new > serializer is supposed to be faster, according to benchmarks. It will be the > default serializer for the persistent KV store of live UI. As for event logs, > it is optional. The current serializer for UI data is JSON. When writing > persistent KV-store, there is GZip compression. Since there is compression > support in RocksDB/LevelDB, the new serializer won’t compress the output > before writing to the persistent KV store. Here is a benchmark of > writing/reading 100,000 SQLExecutionUIData to/from RocksDB: > > |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total > Size(MB)*|*Result total size in memory(MB)*| > |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868| > |*Protobuf*|109.9|34.3|858|2105| > I am also proposing to support RocksDB instead of both LevelDB & RocksDB in > the live UI. > SPIP: > [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing] > SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions
[ https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41365: - Assignee: miracle > Stages UI page fails to load for proxy in some yarn versions > - > > Key: SPARK-41365 > URL: https://issues.apache.org/jira/browse/SPARK-41365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.1 > Environment: as above >Reporter: Mars >Assignee: miracle >Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-12-02-17-53-03-003.png > > > My environment CDH 5.8 , click to enter the spark UI from the yarn interface > when visit the stage URI, it fails to load, URI is > {code:java} > http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0 > {code} > !image-2022-12-02-17-53-03-003.png|width=430,height=697! > Server error stack trace: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206) > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code} > > The issue is similar to, the final phenomenon of the issue is the same, > because the parameter encode twice > https://issues.apache.org/jira/browse/SPARK-32467 > https://issues.apache.org/jira/browse/SPARK-33611 > The two issues solve two scenarios to avoid encode twice: > 1. https redirect proxy > 2. set reverse proxy enabled (spark.ui.reverseProxy) in Nginx > But if encode twice due to other reasons, such as this issue (yarn proxy), it > will also fail -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions
[ https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41365: - Assignee: Mars (was: miracle) > Stages UI page fails to load for proxy in some yarn versions > - > > Key: SPARK-41365 > URL: https://issues.apache.org/jira/browse/SPARK-41365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.1 > Environment: as above >Reporter: Mars >Assignee: Mars >Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-12-02-17-53-03-003.png > > > My environment CDH 5.8 , click to enter the spark UI from the yarn interface > when visit the stage URI, it fails to load, URI is > {code:java} > http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0 > {code} > !image-2022-12-02-17-53-03-003.png|width=430,height=697! > Server error stack trace: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206) > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code} > > The issue is similar to, the final phenomenon of the issue is the same, > because the parameter encode twice > https://issues.apache.org/jira/browse/SPARK-32467 > https://issues.apache.org/jira/browse/SPARK-33611 > The two issues solve two scenarios to avoid encode twice: > 1. https redirect proxy > 2. set reverse proxy enabled (spark.ui.reverseProxy) in Nginx > But if encode twice due to other reasons, such as this issue (yarn proxy), it > will also fail -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions
[ https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41365: - Assignee: (was: miracle) > Stages UI page fails to load for proxy in some yarn versions > - > > Key: SPARK-41365 > URL: https://issues.apache.org/jira/browse/SPARK-41365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.1 > Environment: as above >Reporter: Mars >Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-12-02-17-53-03-003.png > > > My environment CDH 5.8 , click to enter the spark UI from the yarn interface > when visit the stage URI, it fails to load, URI is > {code:java} > http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0 > {code} > !image-2022-12-02-17-53-03-003.png|width=430,height=697! > Server error stack trace: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206) > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code} > > The issue is similar to, the final phenomenon of the issue is the same, > because the parameter encode twice > https://issues.apache.org/jira/browse/SPARK-32467 > https://issues.apache.org/jira/browse/SPARK-33611 > The two issues solve two scenarios to avoid encode twice: > 1. https redirect proxy > 2. set reverse proxy enabled (spark.ui.reverseProxy) in Nginx > But if encode twice due to other reasons, such as this issue (yarn proxy), it > will also fail -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions
[ https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41365: - Assignee: miracle > Stages UI page fails to load for proxy in some yarn versions > - > > Key: SPARK-41365 > URL: https://issues.apache.org/jira/browse/SPARK-41365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.1 > Environment: as above >Reporter: Mars >Assignee: miracle >Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-12-02-17-53-03-003.png > > > My environment CDH 5.8 , click to enter the spark UI from the yarn interface > when visit the stage URI, it fails to load, URI is > {code:java} > http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0 > {code} > !image-2022-12-02-17-53-03-003.png|width=430,height=697! > Server error stack trace: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206) > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code} > > The issue is similar to, the final phenomenon of the issue is the same, > because the parameter encode twice > https://issues.apache.org/jira/browse/SPARK-32467 > https://issues.apache.org/jira/browse/SPARK-33611 > The two issues solve two scenarios to avoid encode twice: > 1. https redirect proxy > 2. set reverse proxy enabled (spark.ui.reverseProxy) in Nginx > But if encode twice due to other reasons, such as this issue (yarn proxy), it > will also fail -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions
[ https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-41365. --- Fix Version/s: 3.3.2 3.4.0 Target Version/s: (was: 3.3.1) Resolution: Fixed > Stages UI page fails to load for proxy in some yarn versions > - > > Key: SPARK-41365 > URL: https://issues.apache.org/jira/browse/SPARK-41365 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 3.3.1 > Environment: as above >Reporter: Mars >Priority: Major > Fix For: 3.3.2, 3.4.0 > > Attachments: image-2022-12-02-17-53-03-003.png > > > My environment CDH 5.8 , click to enter the spark UI from the yarn interface > when visit the stage URI, it fails to load, URI is > {code:java} > http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0 > {code} > !image-2022-12-02-17-53-03-003.png|width=430,height=697! > Server error stack trace: > {code:java} > Caused by: java.lang.NullPointerException > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206) > at > org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161) > at > org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142) > at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137) > at > org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135) > at > org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31) > at > org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code} > > The issue is similar to, the final phenomenon of the issue is the same, > because the parameter encode twice > https://issues.apache.org/jira/browse/SPARK-32467 > https://issues.apache.org/jira/browse/SPARK-33611 > The two issues solve two scenarios to avoid encode twice: > 1. https redirect proxy > 2. set reverse proxy enabled (spark.ui.reverseProxy) in Nginx > But if encode twice due to other reasons, such as this issue (yarn proxy), it > will also fail -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
[ https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-41552: - Assignee: Dongjoon Hyun > Upgrade kubernetes-client to 6.3.1 > -- > > Key: SPARK-41552 > URL: https://issues.apache.org/jira/browse/SPARK-41552 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-38062) FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname
[ https://issues.apache.org/jira/browse/SPARK-38062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-38062: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname > > > Key: SPARK-38062 > URL: https://issues.apache.org/jira/browse/SPARK-38062 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Erik Krogen >Assignee: Erik Krogen >Priority: Major > Fix For: 3.3.0 > > > {{FallbackStorage}} uses a placeholder block manager ID: > {code:scala} > private[spark] object FallbackStorage extends Logging { > /** We use one block manager id as a place holder. */ > val FALLBACK_BLOCK_MANAGER_ID: BlockManagerId = BlockManagerId("fallback", > "remote", 7337) > {code} > That second argument is normally interpreted as a hostname, but is passed as > the string "remote" in this case. > {{BlockManager}} will consider this placeholder as one of the peers in some > cases: > {code:language=scala|title=BlockManager.scala} > private[storage] def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = { > peerFetchLock.synchronized { > ... > if (cachedPeers.isEmpty && > > conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) { > Seq(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID) > } else { > cachedPeers > } > } > } > {code} > {{BlockManagerDecommissioner.ShuffleMigrationRunnable}} will then attempt to > perform an upload to this placeholder ID: > {code:scala} > try { > blocks.foreach { case (blockId, buffer) => > logDebug(s"Migrating sub-block ${blockId}") > bm.blockTransferService.uploadBlockSync( > peer.host, > peer.port, > peer.executorId, > blockId, > buffer, > StorageLevel.DISK_ONLY, > null) // class tag, we don't need for shuffle > logDebug(s"Migrated sub-block $blockId") > } > logInfo(s"Migrated $shuffleBlockInfo to $peer") > } catch { > case e: IOException => > ... > if > (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < > blocks.size) { > logWarning(s"Skipping block $shuffleBlockInfo, block > deleted.") > } else if (fallbackStorage.isDefined) { > fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm)) > } else { > logError(s"Error occurred during migrating > $shuffleBlockInfo", e) > keepRunning = false > } > {code} > Since "remote" is not expected to be a resolvable hostname, an > {{IOException}} occurs, and {{fallbackStorage}} is used. But, we shouldn't > try to resolve this. First off, it's completely unnecessary and strange to be > treating the placeholder ID as a resolvable hostname, relying on an exception > to realize that we need to use the {{fallbackStorage}}. > To make matters worse, in some network environments, "remote" may be a > resolvable hostname, completely breaking this functionality. In the > particular environment that I use for running automated tests, there is a DNS > entry for "remote" which, when you attempt to connect to it, will hang for a > long period of time. This essentially hangs the executor decommission > process, and in the case of unit tests, breaks {{FallbackStorageSuite}} as it > exceeds its timeouts. I'm not sure, but it's possible this is related to > SPARK-35584 as well (if sometimes in the GA environment, it takes a long time > for the OS to decide that "remote" is not a valid hostname). > We shouldn't attempt to treat this placeholder ID as a real hostname. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40060: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40060) Add numberDecommissioningExecutors metric
[ https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648762#comment-17648762 ] Dongjoon Hyun commented on SPARK-40060: --- I collected this as a subtask of SPARK-41550. > Add numberDecommissioningExecutors metric > - > > Key: SPARK-40060 > URL: https://issues.apache.org/jira/browse/SPARK-40060 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > The num of decommissioning executor should exposed as metric -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648761#comment-17648761 ] Dongjoon Hyun commented on SPARK-40269: --- I collected this as a subtask of SPARK-41550. > Randomize the orders of peer in BlockManagerDecommissioner > -- > > Key: SPARK-40269 > URL: https://issues.apache.org/jira/browse/SPARK-40269 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating > data to the same set of nodes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40269: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Randomize the orders of peer in BlockManagerDecommissioner > -- > > Key: SPARK-40269 > URL: https://issues.apache.org/jira/browse/SPARK-40269 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating > data to the same set of nodes -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648760#comment-17648760 ] Dongjoon Hyun commented on SPARK-40636: --- I collected this as a subtask of SPARK-41550. > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Sub-task > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > BlockManagerDecommissioner should log correct remained shuffles. > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648759#comment-17648759 ] Dongjoon Hyun commented on SPARK-40481: --- I collected this as a subtask of SPARK-41550. > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor
[ https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40481: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Ignore stage fetch failure caused by decommissioned executor > > > Key: SPARK-40481 > URL: https://issues.apache.org/jira/browse/SPARK-40481 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.4.0 > > > When executor decommission is enabled, there would be many stage failure > caused by FetchFailed from decommissioned executor, further causing whole > job's failure. It would be better not to count such failure in > `spark.stage.maxConsecutiveAttempts` -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner
[ https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40636: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Bug) > Fix wrong remained shuffles log in BlockManagerDecommissioner > - > > Key: SPARK-40636 > URL: https://issues.apache.org/jira/browse/SPARK-40636 > Project: Spark > Issue Type: Sub-task > Components: Block Manager >Affects Versions: 3.3.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Minor > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > BlockManagerDecommissioner should log correct remained shuffles. > {code:java} > 4 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:15.035 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained. > 2022-09-30 17:42:45.069 PDT > 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40596) Populate ExecutorDecommission with more informative messages
[ https://issues.apache.org/jira/browse/SPARK-40596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648758#comment-17648758 ] Dongjoon Hyun commented on SPARK-40596: --- I collected this as a subtask of SPARK-41550 > Populate ExecutorDecommission with more informative messages > > > Key: SPARK-40596 > URL: https://issues.apache.org/jira/browse/SPARK-40596 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > > Currently the message in {{ExecutorDecommission}} is a fixed value > {{{}"Executor decommission."{}}}, and it is the same for all cases, including > spot instance interruptions and auto-scaling down. We should put a detailed > message in {{ExecutorDecommission}} to better differentiate those cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40979) Keep removed executor info in decommission state
[ https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648757#comment-17648757 ] Dongjoon Hyun commented on SPARK-40979: --- I collected this as a subtask of SPARK-41550 > Keep removed executor info in decommission state > > > Key: SPARK-40979 > URL: https://issues.apache.org/jira/browse/SPARK-40979 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Major > Fix For: 3.4.0 > > > Removed executor due to decommission should be kept in a separate set. To > avoid OOM, set size will be limited to 1K or 10K. > FetchFailed caused by decom executor could be divided into 2 categories: > # When FetchFailed reached DAGScheduler, the executor is still alive or is > lost but the lost info hasn't reached TaskSchedulerImpl. This is already > handled in SPARK-40979 > # FetchFailed is caused by decom executor loss, so the decom info is already > removed in TaskSchedulerImpl. If we keep such info in a short period, it is > good enough. Even we limit the size of removed executors to 10K, it could be > only at most 10MB memory usage. In real case, it's rare to have cluster size > of over 10K and the chance that all these executors decomed and lost at the > same time would be small. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40596) Populate ExecutorDecommission with more informative messages
[ https://issues.apache.org/jira/browse/SPARK-40596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40596: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Populate ExecutorDecommission with more informative messages > > > Key: SPARK-40596 > URL: https://issues.apache.org/jira/browse/SPARK-40596 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Bo Zhang >Assignee: Bo Zhang >Priority: Major > > Currently the message in {{ExecutorDecommission}} is a fixed value > {{{}"Executor decommission."{}}}, and it is the same for all cases, including > spot instance interruptions and auto-scaling down. We should put a detailed > message in {{ExecutorDecommission}} to better differentiate those cases. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state
[ https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40979: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Keep removed executor info in decommission state > > > Key: SPARK-40979 > URL: https://issues.apache.org/jira/browse/SPARK-40979 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Zhongwei Zhu >Assignee: Zhongwei Zhu >Priority: Major > Fix For: 3.4.0 > > > Removed executor due to decommission should be kept in a separate set. To > avoid OOM, set size will be limited to 1K or 10K. > FetchFailed caused by decom executor could be divided into 2 categories: > # When FetchFailed reached DAGScheduler, the executor is still alive or is > lost but the lost info hasn't reached TaskSchedulerImpl. This is already > handled in SPARK-40979 > # FetchFailed is caused by decom executor loss, so the decom info is already > removed in TaskSchedulerImpl. If we keep such info in a short period, it is > good enough. Even we limit the size of removed executors to 10K, it could be > only at most 10MB memory usage. In real case, it's rare to have cluster size > of over 10K and the chance that all these executors decomed and lost at the > same time would be small. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648756#comment-17648756 ] Dongjoon Hyun commented on SPARK-40379: --- Hi, [~holden]. We want to go GA with `Dynamic Allocation on K8s`. I collected this individual task there as a subtask because this is good. Please let me know if you want to collect this into somewhere else. > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s
[ https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-40379: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Improvement) > Propagate decommission executor loss reason during onDisconnect in K8s > -- > > Key: SPARK-40379 > URL: https://issues.apache.org/jira/browse/SPARK-40379 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes, Spark Core >Affects Versions: 3.4.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.4.0 > > > Currently if an executor has been sent a decommission message and then it > disconnects from the scheduler we only disable the executor depending on the > K8s status events to drive the rest of the state transitions. However, the > K8s status events can become overwhelmed on large clusters so we should check > if an executor is in a decommissioning state when it is disconnected and use > that reason instead of waiting on the K8s status events so we have more > accurate logging information. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
[ https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41552: Assignee: (was: Apache Spark) > Upgrade kubernetes-client to 6.3.1 > -- > > Key: SPARK-41552 > URL: https://issues.apache.org/jira/browse/SPARK-41552 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
[ https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648752#comment-17648752 ] Apache Spark commented on SPARK-41552: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/39094 > Upgrade kubernetes-client to 6.3.1 > -- > > Key: SPARK-41552 > URL: https://issues.apache.org/jira/browse/SPARK-41552 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
[ https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41552: Assignee: Apache Spark > Upgrade kubernetes-client to 6.3.1 > -- > > Key: SPARK-41552 > URL: https://issues.apache.org/jira/browse/SPARK-41552 > Project: Spark > Issue Type: Improvement > Components: Build, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41552) Upgrade kubernetes-client to 6.3.1
Dongjoon Hyun created SPARK-41552: - Summary: Upgrade kubernetes-client to 6.3.1 Key: SPARK-41552 URL: https://issues.apache.org/jira/browse/SPARK-41552 Project: Spark Issue Type: Improvement Components: Build, Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-39322) Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled`
[ https://issues.apache.org/jira/browse/SPARK-39322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-39322: -- Parent: SPARK-41550 Issue Type: Sub-task (was: Documentation) > Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled` > > > Key: SPARK-39322 > URL: https://issues.apache.org/jira/browse/SPARK-39322 > Project: Spark > Issue Type: Sub-task > Components: Documentation >Affects Versions: 3.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org