[jira] [Created] (SPARK-41558) Disable Coverage in python.pyspark.tests.test_memory_profiler

2022-12-16 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-41558:


 Summary: Disable Coverage in 
python.pyspark.tests.test_memory_profiler
 Key: SPARK-41558
 URL: https://issues.apache.org/jira/browse/SPARK-41558
 Project: Spark
  Issue Type: Test
  Components: PySpark
Affects Versions: 3.4.0
Reporter: Hyukjin Kwon


https://github.com/apache/spark/actions/runs/3712125552/jobs/6293347848

{code}
==
FAIL [13.173s]: test_memory_profiler 
(pyspark.tests.test_memory_profiler.MemoryProfilerTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 
56, in test_memory_profiler
self.assertTrue("plus_one" in fake_out.getvalue())
AssertionError: False is not true

==
FAIL [3.986s]: test_profile_pandas_function_api 
(pyspark.tests.test_memory_profiler.MemoryProfilerTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 
87, in test_profile_pandas_function_api
self.assertTrue(f_name in fake_out.getvalue())
AssertionError: False is not true

==
FAIL [3.722s]: test_profile_pandas_udf 
(pyspark.tests.test_memory_profiler.MemoryProfilerTests)
--
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/tests/test_memory_profiler.py", line 
69, in test_profile_pandas_udf
self.assertTrue(f_name in fake_out.getvalue())
AssertionError: False is not true

--
Ran 3 tests in 20.882s
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41426:


Assignee: (was: Apache Spark)

> Protobuf serializer for ResourceProfileWrapper
> --
>
> Key: SPARK-41426
> URL: https://issues.apache.org/jira/browse/SPARK-41426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648870#comment-17648870
 ] 

Apache Spark commented on SPARK-41426:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39105

> Protobuf serializer for ResourceProfileWrapper
> --
>
> Key: SPARK-41426
> URL: https://issues.apache.org/jira/browse/SPARK-41426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41426:


Assignee: Apache Spark

> Protobuf serializer for ResourceProfileWrapper
> --
>
> Key: SPARK-41426
> URL: https://issues.apache.org/jira/browse/SPARK-41426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648868#comment-17648868
 ] 

Apache Spark commented on SPARK-41426:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39105

> Protobuf serializer for ResourceProfileWrapper
> --
>
> Key: SPARK-41426
> URL: https://issues.apache.org/jira/browse/SPARK-41426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41426) Protobuf serializer for ResourceProfileWrapper

2022-12-16 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648867#comment-17648867
 ] 

Sandeep Singh commented on SPARK-41426:
---

underlying serializer already there, working on a PR.

> Protobuf serializer for ResourceProfileWrapper
> --
>
> Key: SPARK-41426
> URL: https://issues.apache.org/jira/browse/SPARK-41426
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41557) Union of tables with and without metadata column fails when used in join

2022-12-16 Thread Shardul Mahadik (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648866#comment-17648866
 ] 

Shardul Mahadik commented on SPARK-41557:
-

cc: [~Gengliang.Wang] [~cloud_fan]


> Union of tables with and without metadata column fails when used in join
> 
>
> Key: SPARK-41557
> URL: https://issues.apache.org/jira/browse/SPARK-41557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Here is a test case that can be added to {{MetadataColumnSuite}} to 
> demonstrate the issue
> {code:scala}
>   test("SPARK-41557: Union of tables with and without metadata column should 
> work") {
> withTable(tbl) {
>   sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
>   checkAnswer(
> spark.sql(
>   s"""
> SELECT b.*
> FROM RANGE(1)
>   LEFT JOIN (
> SELECT id FROM $tbl
> UNION ALL
> SELECT id FROM RANGE(10)
>   ) b USING(id)
>   """),
> Seq(Row(0))
>   )
> }
>   }
>  {code}
> Here a table with metadata columns {{$tbl}} is unioned with a table without 
> metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
> analysis fails saying mismatch in the number of columns of the union caused 
> by the metadata columns. However, here we can see that we explicitly project 
> only one column during the union, so the union should be valid.
> {code}
> org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
> be performed on inputs with the same number of columns, but the first input 
> has 3 columns and the second input has 1 columns.; line 5 pos 16;
> 'Project [id#26L]
> +- 'Project [id#26L, id#26L]
>+- 'Project [id#28L, id#26L]
>   +- 'Join LeftOuter, (id#28L = id#26L)
>  :- Range (0, 1, step=1, splits=None)
>  +- 'SubqueryAlias b
> +- 'Union false, false
>:- Project [id#26L, index#30, _partition#31]
>:  +- SubqueryAlias testcat.t
>: +- RelationV2[id#26L, data#27, index#30, _partition#31] 
> testcat.t testcat.t
>+- Project [id#29L]
>   +- Range (0, 10, step=1, splits=None)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41557) Union of tables with and without metadata column fails when used in join

2022-12-16 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-41557:

Description: 
Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate 
the issue
{code:scala}
  test("SPARK-X: Union of tables with and without metadata column should 
work") {
withTable(tbl) {
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  checkAnswer(
spark.sql(
  s"""
SELECT b.*
FROM RANGE(1)
  LEFT JOIN (
SELECT id FROM $tbl
UNION ALL
SELECT id FROM RANGE(10)
  ) b USING(id)
  """),
Seq(Row(0))
  )
}
  }
 {code}

Here a table with metadata columns {{$tbl}} is unioned with a table without 
metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
analysis fails saying mismatch in the number of columns of the union caused by 
the metadata columns. However, here we can see that we explicitly project only 
one column during the union, so the union should be valid.

{code}
org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
be performed on inputs with the same number of columns, but the first input has 
3 columns and the second input has 1 columns.; line 5 pos 16;
'Project [id#26L]
+- 'Project [id#26L, id#26L]
   +- 'Project [id#28L, id#26L]
  +- 'Join LeftOuter, (id#28L = id#26L)
 :- Range (0, 1, step=1, splits=None)
 +- 'SubqueryAlias b
+- 'Union false, false
   :- Project [id#26L, index#30, _partition#31]
   :  +- SubqueryAlias testcat.t
   : +- RelationV2[id#26L, data#27, index#30, _partition#31] 
testcat.t testcat.t
   +- Project [id#29L]
  +- Range (0, 10, step=1, splits=None)
{code}

  was:
Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate 
the issue
{code:scala}
test("SPARK-X: Union of tables with and without metadata column should 
work") {
withTable(tbl) {
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  checkAnswer(
spark.sql(
  s"""
SELECT b.*
FROM RANGE(1)
  LEFT JOIN (
SELECT id FROM $tbl
UNION ALL
SELECT id FROM RANGE(10)
  ) b USING(id)
  """),
Seq(Row(0))
  )
}
  }
 {code}

Here a table with metadata columns {{$tbl}} is unioned with a table without 
metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
analysis fails saying mismatch in the number of columns of the union caused by 
the metadata columns. However, here we can see that we explicitly project only 
one column during the union, so the union should be valid.

{code}
org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
be performed on inputs with the same number of columns, but the first input has 
3 columns and the second input has 1 columns.; line 5 pos 16;
'Project [id#26L]
+- 'Project [id#26L, id#26L]
   +- 'Project [id#28L, id#26L]
  +- 'Join LeftOuter, (id#28L = id#26L)
 :- Range (0, 1, step=1, splits=None)
 +- 'SubqueryAlias b
+- 'Union false, false
   :- Project [id#26L, index#30, _partition#31]
   :  +- SubqueryAlias testcat.t
   : +- RelationV2[id#26L, data#27, index#30, _partition#31] 
testcat.t testcat.t
   +- Project [id#29L]
  +- Range (0, 10, step=1, splits=None)
{code}


> Union of tables with and without metadata column fails when used in join
> 
>
> Key: SPARK-41557
> URL: https://issues.apache.org/jira/browse/SPARK-41557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Here is a test case that can be added to {{MetadataColumnSuite}} to 
> demonstrate the issue
> {code:scala}
>   test("SPARK-X: Union of tables with and without metadata column should 
> work") {
> withTable(tbl) {
>   sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
>   checkAnswer(
> spark.sql(
>   s"""
> SELECT b.*
> FROM RANGE(1)
>   LEFT JOIN (
> SELECT id FROM $tbl
> UNION ALL
> SELECT id FROM RANGE(10)
>   ) b USING(id)
>   """),
> Seq(Row(0))
>   )
> }
>   }
>  {code}
> Here a table with metadata columns {{$tbl}} is unioned with a table without 
> metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
> a

[jira] [Updated] (SPARK-41557) Union of tables with and without metadata column fails when used in join

2022-12-16 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-41557:

Description: 
Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate 
the issue
{code:scala}
  test("SPARK-41557: Union of tables with and without metadata column should 
work") {
withTable(tbl) {
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  checkAnswer(
spark.sql(
  s"""
SELECT b.*
FROM RANGE(1)
  LEFT JOIN (
SELECT id FROM $tbl
UNION ALL
SELECT id FROM RANGE(10)
  ) b USING(id)
  """),
Seq(Row(0))
  )
}
  }
 {code}

Here a table with metadata columns {{$tbl}} is unioned with a table without 
metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
analysis fails saying mismatch in the number of columns of the union caused by 
the metadata columns. However, here we can see that we explicitly project only 
one column during the union, so the union should be valid.

{code}
org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
be performed on inputs with the same number of columns, but the first input has 
3 columns and the second input has 1 columns.; line 5 pos 16;
'Project [id#26L]
+- 'Project [id#26L, id#26L]
   +- 'Project [id#28L, id#26L]
  +- 'Join LeftOuter, (id#28L = id#26L)
 :- Range (0, 1, step=1, splits=None)
 +- 'SubqueryAlias b
+- 'Union false, false
   :- Project [id#26L, index#30, _partition#31]
   :  +- SubqueryAlias testcat.t
   : +- RelationV2[id#26L, data#27, index#30, _partition#31] 
testcat.t testcat.t
   +- Project [id#29L]
  +- Range (0, 10, step=1, splits=None)
{code}

  was:
Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate 
the issue
{code:scala}
  test("SPARK-X: Union of tables with and without metadata column should 
work") {
withTable(tbl) {
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  checkAnswer(
spark.sql(
  s"""
SELECT b.*
FROM RANGE(1)
  LEFT JOIN (
SELECT id FROM $tbl
UNION ALL
SELECT id FROM RANGE(10)
  ) b USING(id)
  """),
Seq(Row(0))
  )
}
  }
 {code}

Here a table with metadata columns {{$tbl}} is unioned with a table without 
metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
analysis fails saying mismatch in the number of columns of the union caused by 
the metadata columns. However, here we can see that we explicitly project only 
one column during the union, so the union should be valid.

{code}
org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
be performed on inputs with the same number of columns, but the first input has 
3 columns and the second input has 1 columns.; line 5 pos 16;
'Project [id#26L]
+- 'Project [id#26L, id#26L]
   +- 'Project [id#28L, id#26L]
  +- 'Join LeftOuter, (id#28L = id#26L)
 :- Range (0, 1, step=1, splits=None)
 +- 'SubqueryAlias b
+- 'Union false, false
   :- Project [id#26L, index#30, _partition#31]
   :  +- SubqueryAlias testcat.t
   : +- RelationV2[id#26L, data#27, index#30, _partition#31] 
testcat.t testcat.t
   +- Project [id#29L]
  +- Range (0, 10, step=1, splits=None)
{code}


> Union of tables with and without metadata column fails when used in join
> 
>
> Key: SPARK-41557
> URL: https://issues.apache.org/jira/browse/SPARK-41557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.2, 3.4.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> Here is a test case that can be added to {{MetadataColumnSuite}} to 
> demonstrate the issue
> {code:scala}
>   test("SPARK-41557: Union of tables with and without metadata column should 
> work") {
> withTable(tbl) {
>   sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
>   checkAnswer(
> spark.sql(
>   s"""
> SELECT b.*
> FROM RANGE(1)
>   LEFT JOIN (
> SELECT id FROM $tbl
> UNION ALL
> SELECT id FROM RANGE(10)
>   ) b USING(id)
>   """),
> Seq(Row(0))
>   )
> }
>   }
>  {code}
> Here a table with metadata columns {{$tbl}} is unioned with a table without 
> metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
> ana

[jira] [Created] (SPARK-41557) Union of tables with and without metadata column fails when used in join

2022-12-16 Thread Shardul Mahadik (Jira)
Shardul Mahadik created SPARK-41557:
---

 Summary: Union of tables with and without metadata column fails 
when used in join
 Key: SPARK-41557
 URL: https://issues.apache.org/jira/browse/SPARK-41557
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.2, 3.4.0
Reporter: Shardul Mahadik


Here is a test case that can be added to {{MetadataColumnSuite}} to demonstrate 
the issue
{code:scala}
test("SPARK-X: Union of tables with and without metadata column should 
work") {
withTable(tbl) {
  sql(s"CREATE TABLE $tbl (id bigint, data string) PARTITIONED BY (id)")
  checkAnswer(
spark.sql(
  s"""
SELECT b.*
FROM RANGE(1)
  LEFT JOIN (
SELECT id FROM $tbl
UNION ALL
SELECT id FROM RANGE(10)
  ) b USING(id)
  """),
Seq(Row(0))
  )
}
  }
 {code}

Here a table with metadata columns {{$tbl}} is unioned with a table without 
metdata columns {{RANGE(10)}}. If this result is later used in a join, query 
analysis fails saying mismatch in the number of columns of the union caused by 
the metadata columns. However, here we can see that we explicitly project only 
one column during the union, so the union should be valid.

{code}
org.apache.spark.sql.AnalysisException: [NUM_COLUMNS_MISMATCH] UNION can only 
be performed on inputs with the same number of columns, but the first input has 
3 columns and the second input has 1 columns.; line 5 pos 16;
'Project [id#26L]
+- 'Project [id#26L, id#26L]
   +- 'Project [id#28L, id#26L]
  +- 'Join LeftOuter, (id#28L = id#26L)
 :- Range (0, 1, step=1, splits=None)
 +- 'SubqueryAlias b
+- 'Union false, false
   :- Project [id#26L, index#30, _partition#31]
   :  +- SubqueryAlias testcat.t
   : +- RelationV2[id#26L, data#27, index#30, _partition#31] 
testcat.t testcat.t
   +- Project [id#29L]
  +- Range (0, 10, step=1, splits=None)
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41425:


Assignee: (was: Apache Spark)

> Protobuf serializer for RDDStorageInfoWrapper
> -
>
> Key: SPARK-41425
> URL: https://issues.apache.org/jira/browse/SPARK-41425
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648862#comment-17648862
 ] 

Apache Spark commented on SPARK-41425:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39104

> Protobuf serializer for RDDStorageInfoWrapper
> -
>
> Key: SPARK-41425
> URL: https://issues.apache.org/jira/browse/SPARK-41425
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41425:


Assignee: Apache Spark

> Protobuf serializer for RDDStorageInfoWrapper
> -
>
> Key: SPARK-41425
> URL: https://issues.apache.org/jira/browse/SPARK-41425
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41425) Protobuf serializer for RDDStorageInfoWrapper

2022-12-16 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648860#comment-17648860
 ] 

Sandeep Singh commented on SPARK-41425:
---

Working on this

> Protobuf serializer for RDDStorageInfoWrapper
> -
>
> Key: SPARK-41425
> URL: https://issues.apache.org/jira/browse/SPARK-41425
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648856#comment-17648856
 ] 

gabrywu edited comment on SPARK-41556 at 12/17/22 5:18 AM:
---

[~yumwang] [~petertoth]  What do you think of it?


was (Author: gabry.wu):
[~yumwang] [~ptoth] What do you think of it?

> input_file_positon
> --
>
> Key: SPARK-41556
> URL: https://issues.apache.org/jira/browse/SPARK-41556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: gabrywu
>Priority: Trivial
>
> As for now, we have 3 built-in UDFs related to input files and blocks.  So 
> can we provide a new UDF to return current record position of a file or 
> block? Sometimes, it's useful and we can consider this position (called ROWID 
> in oracle) as a physical primary key.
>  
> |input_file_block_length()|Returns the length of the block being read, or -1 
> if not available.|
> |input_file_block_start()|Returns the start offset of the block being read, 
> or -1 if not available.|
> |input_file_name()|Returns the name of the file being read, or empty string 
> if not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648856#comment-17648856
 ] 

gabrywu commented on SPARK-41556:
-

[~yumwang] [~ptoth] What do you think of it?

> input_file_positon
> --
>
> Key: SPARK-41556
> URL: https://issues.apache.org/jira/browse/SPARK-41556
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: gabrywu
>Priority: Trivial
>
> As for now, we have 3 built-in UDFs related to input files and blocks.  So 
> can we provide a new UDF to return current record position of a file or 
> block? Sometimes, it's useful and we can consider this position (called ROWID 
> in oracle) as a physical primary key.
>  
> |input_file_block_length()|Returns the length of the block being read, or -1 
> if not available.|
> |input_file_block_start()|Returns the start offset of the block being read, 
> or -1 if not available.|
> |input_file_name()|Returns the name of the file being read, or empty string 
> if not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations

2022-12-16 Thread shuyouZZ (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

shuyouZZ updated SPARK-41447:
-
Description: 
When restarting the history server, In history server log,
we can see many {{INFO FsHIstoryProvider: Finished parsing application_xxx}} 
followed by
{{INFO FsHIstoryProvider: Deleting expired event log for application_xxx}}.
This is caused by the expired and unlisted log files also be parsed, and then 
execute {{checkAndCleanLog}} to delete parsed info, this means that the parsing 
is unnecessary.

If there are a large number of expired log files in the log directory, it will 
affect the speed of replay.

In order to avoid this, we can add a logic to clean up these expired log files 
before calling {{doMergeApplicationListing}}.

  was:
When restarting the history server, the previous logic is to execute 
{{checkForLogs}} first, which will cause the expired event log files to be 
parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is 
unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: 
Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: 
Deleting expired event log for application_xxx{}}}. If there are a large number 
of expired log files in the log directory, it will affect the speed of replay.

In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}.

In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, when 
the history server is starting, the expired log info may not exist in the 
listing db, so we need to clean up these log files in {{{}cleanLogs{}}}.


> Reduce the number of doMergeApplicationListing invocations
> --
>
> Key: SPARK-41447
> URL: https://issues.apache.org/jira/browse/SPARK-41447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Minor
> Fix For: 3.4.0
>
>
> When restarting the history server, In history server log,
> we can see many {{INFO FsHIstoryProvider: Finished parsing application_xxx}} 
> followed by
> {{INFO FsHIstoryProvider: Deleting expired event log for application_xxx}}.
> This is caused by the expired and unlisted log files also be parsed, and then 
> execute {{checkAndCleanLog}} to delete parsed info, this means that the 
> parsing is unnecessary.
> If there are a large number of expired log files in the log directory, it 
> will affect the speed of replay.
> In order to avoid this, we can add a logic to clean up these expired log 
> files before calling {{doMergeApplicationListing}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41556) input_file_positon

2022-12-16 Thread gabrywu (Jira)
gabrywu created SPARK-41556:
---

 Summary: input_file_positon
 Key: SPARK-41556
 URL: https://issues.apache.org/jira/browse/SPARK-41556
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.3.1
Reporter: gabrywu


As for now, we have 3 built-in UDFs related to input files and blocks.  So can 
we provide a new UDF to return current record position of a file or block? 
Sometimes, it's useful and we can consider this position (called ROWID in 
oracle) as a physical primary key.

 
|input_file_block_length()|Returns the length of the block being read, or -1 if 
not available.|
|input_file_block_start()|Returns the start offset of the block being read, or 
-1 if not available.|
|input_file_name()|Returns the name of the file being read, or empty string if 
not available.|



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648848#comment-17648848
 ] 

Apache Spark commented on SPARK-41546:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39103

> pyspark_types_to_proto_types should supports StructType.
> 
>
> Key: SPARK-41546
> URL: https://issues.apache.org/jira/browse/SPARK-41546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> pyspark_types_to_proto_types doesn't support StructType now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.

2022-12-16 Thread jiaan.geng (Jira)


[ https://issues.apache.org/jira/browse/SPARK-41546 ]


jiaan.geng deleted comment on SPARK-41546:


was (Author: beliefer):
I'm working on.

> pyspark_types_to_proto_types should supports StructType.
> 
>
> Key: SPARK-41546
> URL: https://issues.apache.org/jira/browse/SPARK-41546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> pyspark_types_to_proto_types doesn't support StructType now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648847#comment-17648847
 ] 

Apache Spark commented on SPARK-41546:
--

User 'beliefer' has created a pull request for this issue:
https://github.com/apache/spark/pull/39103

> pyspark_types_to_proto_types should supports StructType.
> 
>
> Key: SPARK-41546
> URL: https://issues.apache.org/jira/browse/SPARK-41546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> pyspark_types_to_proto_types doesn't support StructType now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41546:


Assignee: Apache Spark

> pyspark_types_to_proto_types should supports StructType.
> 
>
> Key: SPARK-41546
> URL: https://issues.apache.org/jira/browse/SPARK-41546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Assignee: Apache Spark
>Priority: Major
>
> pyspark_types_to_proto_types doesn't support StructType now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41546) pyspark_types_to_proto_types should supports StructType.

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41546:


Assignee: (was: Apache Spark)

> pyspark_types_to_proto_types should supports StructType.
> 
>
> Key: SPARK-41546
> URL: https://issues.apache.org/jira/browse/SPARK-41546
> Project: Spark
>  Issue Type: Sub-task
>  Components: Connect
>Affects Versions: 3.4.0
>Reporter: jiaan.geng
>Priority: Major
>
> pyspark_types_to_proto_types doesn't support StructType now.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Priority: Minor  (was: Major)

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Minor
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> case class TestData(ind: Int, name: String)
> case class TestData2(ind2: Int, name: String) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-41421.

Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39096
[https://github.com/apache/spark/pull/39096]

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Gengliang Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-41421:
--

Assignee: Sandeep Singh

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Sandeep Singh
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates

2022-12-16 Thread Shardul Mahadik (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shardul Mahadik updated SPARK-41162:

Labels: correctness  (was: )

> Anti-join must not be pushed below aggregation with ambiguous predicates
> 
>
> Key: SPARK-41162
> URL: https://issues.apache.org/jira/browse/SPARK-41162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>  Labels: correctness
>
> The following query should return a single row as all values for {{id}} 
> except for the largest will be eliminated by the anti-join:
> {code}
> val ids = Seq(1, 2, 3).toDF("id").distinct()
> val result = ids.withColumn("id", $"id" + 1).join(ids, "id", 
> "left_anti").collect()
> assert(result.length == 1)
> {code}
> Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the 
> assertion should still hold but is false.
> Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left 
> {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never 
> be true.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
> !Join LeftAnti, (id#752 = id#750)  'Aggregate [id#750], 
> [(id#750 + 1) AS id#752]
> !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, 
> ((id#750 + 1) = id#750)
> !:  +- LocalRelation [id#750] :- LocalRelation 
> [id#750]
> !+- Aggregate [id#750], [id#750]  +- Aggregate [id#750], 
> [id#750]
> !   +- LocalRelation [id#750]+- LocalRelation 
> [id#750]
> {code}
> The optimizer then rightly removes the left-anti join altogether, returning 
> the left child only.
> Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that 
> reference left *and* right child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Description: 
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

 
{code:java}
// code placeholder
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}
case class TestData(ind: Int, name: String)
case class TestData2(ind2: Int, name: String) {code}

  was:
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

 
{code:java}
// code placeholder
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}
{code}


> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> case class TestData(ind: Int, name: String)
> case class TestData2(ind2: Int, name: String) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648843#comment-17648843
 ] 

Apache Spark commented on SPARK-41555:
--

User 'monkeyboy123' has created a pull request for this issue:
https://github.com/apache/spark/pull/39102

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648841#comment-17648841
 ] 

Apache Spark commented on SPARK-41555:
--

User 'monkeyboy123' has created a pull request for this issue:
https://github.com/apache/spark/pull/39101

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41555:


Assignee: (was: Apache Spark)

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41555:


Assignee: Apache Spark

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Assignee: Apache Spark
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648840#comment-17648840
 ] 

Apache Spark commented on SPARK-41555:
--

User 'monkeyboy123' has created a pull request for this issue:
https://github.com/apache/spark/pull/39101

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Description: 
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

 
{code:java}
// code placeholder
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}
{code}

  was:
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

 

def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}

 


> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> {code:java}
> // code placeholder
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Description: 
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

 

def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}

 

  was:
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

```

def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}

```


> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
>  
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Description: 
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

code like this:

```

def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.setAppName("demo")
.setMaster("local[*]")

val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

setDefaultSession(null)
setActiveSession(null)

val spark2 = SparkSession.builder()
.config(sparkConf)
.getOrCreate()

import spark.implicits._
val testData = spark.sparkContext
.parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
testData.createOrReplaceTempView("testTable")
val testData2 = spark.sparkContext.parallelize(
TestData2(1, "1") ::
TestData2(1, "2") ::
TestData2(2, "1") ::
TestData2(2, "2") ::
TestData2(3, "1") ::
TestData2(3, "2") ::
Nil, 2).toDF()

testData2.createOrReplaceTempView("testTable2")
val query = "select ind2,count(*) from ( select * from testTable2 join 
testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') group 
by ind2"
spark.sql(query).collect()

Thread.sleep(50)
spark.stop()
}

```

  was:
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.


> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.
> code like this:
> ```
> def main(args: Array[String]): Unit = {
> val sparkConf = new SparkConf()
> .setAppName("demo")
> .setMaster("local[*]")
> val spark = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> setDefaultSession(null)
> setActiveSession(null)
> val spark2 = SparkSession.builder()
> .config(sparkConf)
> .getOrCreate()
> import spark.implicits._
> val testData = spark.sparkContext
> .parallelize((1 to 3).map(i => TestData(i, i.toString))).toDF()
> testData.createOrReplaceTempView("testTable")
> val testData2 = spark.sparkContext.parallelize(
> TestData2(1, "1") ::
> TestData2(1, "2") ::
> TestData2(2, "1") ::
> TestData2(2, "2") ::
> TestData2(3, "1") ::
> TestData2(3, "2") ::
> Nil, 2).toDF()
> testData2.createOrReplaceTempView("testTable2")
> val query = "select ind2,count(*) from ( select * from testTable2 join 
> testTable on testTable.ind = testTable2.ind2 where testTable.name <> '1') 
> group by ind2"
> spark.sql(query).collect()
> Thread.sleep(50)
> spark.stop()
> }
> ```



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Description: 
In spark , if we create multi sparkSession in the program, we will get 
multi-SQLTab in UI, 

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.

  was:
In spark , if we create multi sparkSession in the program, we will get 
multi-SQL tab in UI,

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.


> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQLTab in UI, 
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiahong.li updated SPARK-41555:
---
Attachment: muti-sqltab.png
muti-SQLStore.png

> Multi sparkSession should share single SQLAppStatusStore
> 
>
> Key: SPARK-41555
> URL: https://issues.apache.org/jira/browse/SPARK-41555
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.1, 3.3.0
>Reporter: jiahong.li
>Priority: Major
> Attachments: muti-SQLStore.png, muti-sqltab.png
>
>
> In spark , if we create multi sparkSession in the program, we will get 
> multi-SQL tab in UI,
> At the same time, we will get muti-SQLAppStatusListener object, it is waste 
> of memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41555) Multi sparkSession should share single SQLAppStatusStore

2022-12-16 Thread jiahong.li (Jira)
jiahong.li created SPARK-41555:
--

 Summary: Multi sparkSession should share single SQLAppStatusStore
 Key: SPARK-41555
 URL: https://issues.apache.org/jira/browse/SPARK-41555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0, 3.2.1, 3.1.1
Reporter: jiahong.li


In spark , if we create multi sparkSession in the program, we will get 
multi-SQL tab in UI,

At the same time, we will get muti-SQLAppStatusListener object, it is waste of 
memory.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648835#comment-17648835
 ] 

Gengliang Wang edited comment on SPARK-41422 at 12/17/22 12:33 AM:
---

[~techaddict] I have a PR for this one already. Sorry I didn't claim it. 

I will claim next time. The ExecutorMetrics is a bit tricky, so I am doing it 
by myself.


was (Author: gengliang.wang):
[~techaddict] I have a PR for this one already. Sorry I didn't claim it. 

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41422:


Assignee: Apache Spark

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648837#comment-17648837
 ] 

Apache Spark commented on SPARK-41422:
--

User 'gengliangwang' has created a pull request for this issue:
https://github.com/apache/spark/pull/39100

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41422:


Assignee: (was: Apache Spark)

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41162) Anti-join must not be pushed below aggregation with ambiguous predicates

2022-12-16 Thread SHU WANG (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648836#comment-17648836
 ] 

SHU WANG commented on SPARK-41162:
--

[~shardulm] Yes. Checked with Spark 3.1.2, and it's also an issue.

> Anti-join must not be pushed below aggregation with ambiguous predicates
> 
>
> Key: SPARK-41162
> URL: https://issues.apache.org/jira/browse/SPARK-41162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Enrico Minack
>Priority: Major
>
> The following query should return a single row as all values for {{id}} 
> except for the largest will be eliminated by the anti-join:
> {code}
> val ids = Seq(1, 2, 3).toDF("id").distinct()
> val result = ids.withColumn("id", $"id" + 1).join(ids, "id", 
> "left_anti").collect()
> assert(result.length == 1)
> {code}
> Without the {{distinct()}}, the assertion is true. With {{distinct()}}, the 
> assertion should still hold but is false.
> Rule {{PushDownLeftSemiAntiJoin}} pushes the {{Join}} below the left 
> {{Aggregate}} with join condition {{(id#750 + 1) = id#750}}, which can never 
> be true.
> {code}
> === Applying Rule 
> org.apache.spark.sql.catalyst.optimizer.PushDownLeftSemiAntiJoin ===
> !Join LeftAnti, (id#752 = id#750)  'Aggregate [id#750], 
> [(id#750 + 1) AS id#752]
> !:- Aggregate [id#750], [(id#750 + 1) AS id#752]   +- 'Join LeftAnti, 
> ((id#750 + 1) = id#750)
> !:  +- LocalRelation [id#750] :- LocalRelation 
> [id#750]
> !+- Aggregate [id#750], [id#750]  +- Aggregate [id#750], 
> [id#750]
> !   +- LocalRelation [id#750]+- LocalRelation 
> [id#750]
> {code}
> The optimizer then rightly removes the left-anti join altogether, returning 
> the left child only.
> Rule {{PushDownLeftSemiAntiJoin}} should not push down predicates that 
> reference left *and* right child.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648835#comment-17648835
 ] 

Gengliang Wang commented on SPARK-41422:


[~techaddict] I have a PR for this one already. Sorry I didn't claim it. 

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41422) Protobuf serializer for ExecutorSummaryWrapper

2022-12-16 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41422?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648832#comment-17648832
 ] 

Sandeep Singh commented on SPARK-41422:
---

Working on this, will create a PR soon.

> Protobuf serializer for ExecutorSummaryWrapper
> --
>
> Key: SPARK-41422
> URL: https://issues.apache.org/jira/browse/SPARK-41422
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41530) Rename MedianHeap to PercentileMap and support percentile

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41530:
--
Summary: Rename MedianHeap to PercentileMap and support percentile  (was: 
extend MedianHeap to support percentile)

> Rename MedianHeap to PercentileMap and support percentile
> -
>
> Key: SPARK-41530
> URL: https://issues.apache.org/jira/browse/SPARK-41530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41530) extend MedianHeap to support percentile

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41530:
-

Assignee: Wenchen Fan

> extend MedianHeap to support percentile
> ---
>
> Key: SPARK-41530
> URL: https://issues.apache.org/jira/browse/SPARK-41530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41530) extend MedianHeap to support percentile

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41530.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39076
[https://github.com/apache/spark/pull/39076]

> extend MedianHeap to support percentile
> ---
>
> Key: SPARK-41530
> URL: https://issues.apache.org/jira/browse/SPARK-41530
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648830#comment-17648830
 ] 

Apache Spark commented on SPARK-41554:
--

User 'fe2s' has created a pull request for this issue:
https://github.com/apache/spark/pull/39099

> Decimal.changePrecision produces ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-41554
> URL: https://issues.apache.org/jira/browse/SPARK-41554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Oleksiy Dyagilev
>Priority: Major
>
> {{Reducing Decimal scale by more than 18 produces exception.}}
> {code:java}
> Decimal(1, 38, 19).changePrecision(38, 0){code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 19
>     at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377)
>     at 
> org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code}
> Reproducing with SQL query:
> {code:java}
> sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as 
> decimal(38,37)) as decimal(38,17)) from range(3)").show{code}
> The bug exists for {{Decimal}} that is stored using compact long only, it 
> works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41554:


Assignee: Apache Spark

> Decimal.changePrecision produces ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-41554
> URL: https://issues.apache.org/jira/browse/SPARK-41554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Oleksiy Dyagilev
>Assignee: Apache Spark
>Priority: Major
>
> {{Reducing Decimal scale by more than 18 produces exception.}}
> {code:java}
> Decimal(1, 38, 19).changePrecision(38, 0){code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 19
>     at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377)
>     at 
> org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code}
> Reproducing with SQL query:
> {code:java}
> sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as 
> decimal(38,37)) as decimal(38,17)) from range(3)").show{code}
> The bug exists for {{Decimal}} that is stored using compact long only, it 
> works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41554:


Assignee: (was: Apache Spark)

> Decimal.changePrecision produces ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-41554
> URL: https://issues.apache.org/jira/browse/SPARK-41554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Oleksiy Dyagilev
>Priority: Major
>
> {{Reducing Decimal scale by more than 18 produces exception.}}
> {code:java}
> Decimal(1, 38, 19).changePrecision(38, 0){code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 19
>     at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377)
>     at 
> org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code}
> Reproducing with SQL query:
> {code:java}
> sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as 
> decimal(38,37)) as decimal(38,17)) from range(3)").show{code}
> The bug exists for {{Decimal}} that is stored using compact long only, it 
> works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648829#comment-17648829
 ] 

Apache Spark commented on SPARK-41554:
--

User 'fe2s' has created a pull request for this issue:
https://github.com/apache/spark/pull/39099

> Decimal.changePrecision produces ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-41554
> URL: https://issues.apache.org/jira/browse/SPARK-41554
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.1
>Reporter: Oleksiy Dyagilev
>Priority: Major
>
> {{Reducing Decimal scale by more than 18 produces exception.}}
> {code:java}
> Decimal(1, 38, 19).changePrecision(38, 0){code}
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: 19
>     at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377)
>     at 
> org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code}
> Reproducing with SQL query:
> {code:java}
> sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as 
> decimal(38,37)) as decimal(38,17)) from range(3)").show{code}
> The bug exists for {{Decimal}} that is stored using compact long only, it 
> works fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41447:
--
Priority: Minor  (was: Major)

> Reduce the number of doMergeApplicationListing invocations
> --
>
> Key: SPARK-41447
> URL: https://issues.apache.org/jira/browse/SPARK-41447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Minor
> Fix For: 3.4.0
>
>
> When restarting the history server, the previous logic is to execute 
> {{checkForLogs}} first, which will cause the expired event log files to be 
> parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is 
> unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: 
> Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: 
> Deleting expired event log for application_xxx{}}}. If there are a large 
> number of expired log files in the log directory, it will affect the speed of 
> replay.
> In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}.
> In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, 
> when the history server is starting, the expired log info may not exist in 
> the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41447) Reduce the number of doMergeApplicationListing invocations

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-41447:
--
Summary: Reduce the number of doMergeApplicationListing invocations  (was: 
clean up expired event log files that don't exist in listing db)

> Reduce the number of doMergeApplicationListing invocations
> --
>
> Key: SPARK-41447
> URL: https://issues.apache.org/jira/browse/SPARK-41447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 3.4.0
>
>
> When restarting the history server, the previous logic is to execute 
> {{checkForLogs}} first, which will cause the expired event log files to be 
> parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is 
> unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: 
> Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: 
> Deleting expired event log for application_xxx{}}}. If there are a large 
> number of expired log files in the log directory, it will affect the speed of 
> replay.
> In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}.
> In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, 
> when the history server is starting, the expired log info may not exist in 
> the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41447) clean up expired event log files that don't exist in listing db

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41447.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 38983
[https://github.com/apache/spark/pull/38983]

> clean up expired event log files that don't exist in listing db
> ---
>
> Key: SPARK-41447
> URL: https://issues.apache.org/jira/browse/SPARK-41447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
> Fix For: 3.4.0
>
>
> When restarting the history server, the previous logic is to execute 
> {{checkForLogs}} first, which will cause the expired event log files to be 
> parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is 
> unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: 
> Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: 
> Deleting expired event log for application_xxx{}}}. If there are a large 
> number of expired log files in the log directory, it will affect the speed of 
> replay.
> In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}.
> In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, 
> when the history server is starting, the expired log info may not exist in 
> the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41447) clean up expired event log files that don't exist in listing db

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41447:
-

Assignee: shuyouZZ

> clean up expired event log files that don't exist in listing db
> ---
>
> Key: SPARK-41447
> URL: https://issues.apache.org/jira/browse/SPARK-41447
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: shuyouZZ
>Assignee: shuyouZZ
>Priority: Major
>
> When restarting the history server, the previous logic is to execute 
> {{checkForLogs}} first, which will cause the expired event log files to be 
> parsed, and then execute {{checkAndCleanLog}} to delete parsed info, which is 
> unnecessary. In history server log, we can see many {{INFO FsHIstoryProvider: 
> Finished parsing application_xxx}} followed by {{{}INFO FsHIstoryProvider: 
> Deleting expired event log for application_xxx{}}}. If there are a large 
> number of expired log files in the log directory, it will affect the speed of 
> replay.
> In order to avoid this, we can put {{cleanLogs}} before {{{}checkForLogs{}}}.
> In addition, since {{cleanLogs}} is executed before {{{}checkForLogs{}}}, 
> when the history server is starting, the expired log info may not exist in 
> the listing db, so we need to clean up these log files in {{{}cleanLogs{}}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41554) Decimal.changePrecision produces ArrayIndexOutOfBoundsException

2022-12-16 Thread Oleksiy Dyagilev (Jira)
Oleksiy Dyagilev created SPARK-41554:


 Summary: Decimal.changePrecision produces 
ArrayIndexOutOfBoundsException
 Key: SPARK-41554
 URL: https://issues.apache.org/jira/browse/SPARK-41554
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.1
Reporter: Oleksiy Dyagilev


{{Reducing Decimal scale by more than 18 produces exception.}}
{code:java}
Decimal(1, 38, 19).changePrecision(38, 0){code}
{code:java}
java.lang.ArrayIndexOutOfBoundsException: 19
    at org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:377)
    at 
org.apache.spark.sql.types.Decimal.changePrecision(Decimal.scala:328){code}
Reproducing with SQL query:
{code:java}
sql("select cast(cast(cast(cast(id as decimal(38,15)) as decimal(38,30)) as 
decimal(38,37)) as decimal(38,17)) from range(3)").show{code}
The bug exists for {{Decimal}} that is stored using compact long only, it works 
fine with {{Decimal}} that uses {{scala.math.BigDecimal}} internally.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41552.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

Issue resolved by pull request 39094
[https://github.com/apache/spark/pull/39094]

> Upgrade kubernetes-client to 6.3.1
> --
>
> Key: SPARK-41552
> URL: https://issues.apache.org/jira/browse/SPARK-41552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.4.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41553) Change num_files to repartition

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648795#comment-17648795
 ] 

Apache Spark commented on SPARK-41553:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/39098

> Change num_files to repartition
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Functions have this signature. 
>  
> def to_json(
> (..)
> num_files: Optional[int] = None,
>  
>  
> .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
> writes
> multiple `part-...` files in the directory when `path` is specified.
> This behavior was inherited from Apache Spark. The number of files can
> be controlled by `num_files`.
>  
>  
>  
> if num_files is not None:
> warnings.warn(
> "`num_files` has been deprecated and might be removed in a future version. "
> "Use `DataFrame.spark.repartition` instead.",
> FutureWarning,
> )
>  
>  
> I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41553) Change num_files to repartition

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41553:


Assignee: Apache Spark

> Change num_files to repartition
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Assignee: Apache Spark
>Priority: Major
>
> Functions have this signature. 
>  
> def to_json(
> (..)
> num_files: Optional[int] = None,
>  
>  
> .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
> writes
> multiple `part-...` files in the directory when `path` is specified.
> This behavior was inherited from Apache Spark. The number of files can
> be controlled by `num_files`.
>  
>  
>  
> if num_files is not None:
> warnings.warn(
> "`num_files` has been deprecated and might be removed in a future version. "
> "Use `DataFrame.spark.repartition` instead.",
> FutureWarning,
> )
>  
>  
> I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41553) Change num_files to repartition

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648793#comment-17648793
 ] 

Apache Spark commented on SPARK-41553:
--

User 'bjornjorgensen' has created a pull request for this issue:
https://github.com/apache/spark/pull/39098

> Change num_files to repartition
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Functions have this signature. 
>  
> def to_json(
> (..)
> num_files: Optional[int] = None,
>  
>  
> .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
> writes
> multiple `part-...` files in the directory when `path` is specified.
> This behavior was inherited from Apache Spark. The number of files can
> be controlled by `num_files`.
>  
>  
>  
> if num_files is not None:
> warnings.warn(
> "`num_files` has been deprecated and might be removed in a future version. "
> "Use `DataFrame.spark.repartition` instead.",
> FutureWarning,
> )
>  
>  
> I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41553) Change num_files to repartition

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41553:


Assignee: (was: Apache Spark)

> Change num_files to repartition
> ---
>
> Key: SPARK-41553
> URL: https://issues.apache.org/jira/browse/SPARK-41553
> Project: Spark
>  Issue Type: Improvement
>  Components: Pandas API on Spark
>Affects Versions: 3.4.0
>Reporter: Bjørn Jørgensen
>Priority: Major
>
> Functions have this signature. 
>  
> def to_json(
> (..)
> num_files: Optional[int] = None,
>  
>  
> .. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
> writes
> multiple `part-...` files in the directory when `path` is specified.
> This behavior was inherited from Apache Spark. The number of files can
> be controlled by `num_files`.
>  
>  
>  
> if num_files is not None:
> warnings.warn(
> "`num_files` has been deprecated and might be removed in a future version. "
> "Use `DataFrame.spark.repartition` instead.",
> FutureWarning,
> )
>  
>  
> I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41049:


Assignee: Apache Spark

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Assignee: Apache Spark
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648792#comment-17648792
 ] 

Apache Spark commented on SPARK-41049:
--

User 'NarekDW' has created a pull request for this issue:
https://github.com/apache/spark/pull/39097

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41049) Nondeterministic expressions have unstable values if they are children of CodegenFallback expressions

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41049?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41049:


Assignee: (was: Apache Spark)

> Nondeterministic expressions have unstable values if they are children of 
> CodegenFallback expressions
> -
>
> Key: SPARK-41049
> URL: https://issues.apache.org/jira/browse/SPARK-41049
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2
>Reporter: Guy Boo
>Priority: Major
>
> h2. Expectation
> For a given row, Nondeterministic expressions are expected to have stable 
> values.
> {code:scala}
> import org.apache.spark.sql.functions._
> val df = sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> df.select(v1, v1).collect{code}
> Returns a set like this:
> |8777|8777|
> |1357|1357|
> |3435|3435|
> |9204|9204|
> |3870|3870|
> where both columns always have the same value, but what that value is changes 
> from row to row. This is different from the following:
> {code:scala}
> df.select(rand(), rand()).collect{code}
> In this case, because the rand() calls are distinct, the values in both 
> columns should be different.
> h2. Problem
> This expectation does not appear to be stable in the event that any 
> subsequent expression is a CodegenFallback. This program:
> {code:scala}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.functions._
> val sparkSession = SparkSession.builder().getOrCreate()
> val df = sparkSession.sparkContext.parallelize(1 to 5).toDF("x")
> val v1 = rand().*(lit(1)).cast(IntegerType)
> val v2 = to_csv(struct(v1.as("a"))) // to_csv is CodegenFallback
> df.select(v1, v1, v2, v2).collect {code}
> produces output like this:
> |8159|8159|8159|{color:#ff}2028{color}|
> |8320|8320|8320|{color:#ff}1640{color}|
> |7937|7937|7937|{color:#ff}769{color}|
> |436|436|436|{color:#ff}8924{color}|
> |8924|8924|2827|{color:#ff}2731{color}|
> Not sure why the first call via the CodegenFallback path should be correct 
> while subsequent calls aren't.
> h2. Workaround
> If the Nondeterministic expression is moved to a separate, earlier select() 
> call, so the CodegenFallback instead only refers to a column reference, then 
> the problem seems to go away. But this workaround may not be reliable if 
> optimization is ever able to restructure adjacent select()s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41553) Change num_files to repartition

2022-12-16 Thread Jira
Bjørn Jørgensen created SPARK-41553:
---

 Summary: Change num_files to repartition
 Key: SPARK-41553
 URL: https://issues.apache.org/jira/browse/SPARK-41553
 Project: Spark
  Issue Type: Improvement
  Components: Pandas API on Spark
Affects Versions: 3.4.0
Reporter: Bjørn Jørgensen


Functions have this signature. 

 
def to_json(
(..)
num_files: Optional[int] = None,
 
 
.. note:: pandas-on-Spark writes JSON files into the directory, `path`, and 
writes
multiple `part-...` files in the directory when `path` is specified.
This behavior was inherited from Apache Spark. The number of files can
be controlled by `num_files`.
 
 
 
if num_files is not None:
warnings.warn(
"`num_files` has been deprecated and might be removed in a future version. "
"Use `DataFrame.spark.repartition` instead.",
FutureWarning,
)
 
 
I will change num_files to repartition



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-12-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648786#comment-17648786
 ] 

Gengliang Wang edited comment on SPARK-41053 at 12/16/22 8:30 PM:
--

[~techaddict] Thanks! Feel free to take anyone from the subtasks and leave a 
comment that you are working on it

You can follow the PR for TaskDataWrapper: 
[https://github.com/apache/spark/pull/39048]


was (Author: gengliang.wang):
[~techaddict] Thanks! Feel free to take anyone from the subtasks. You can 
follow the PR for TaskDataWrapper: https://github.com/apache/spark/pull/39048

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-12-16 Thread Gengliang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648786#comment-17648786
 ] 

Gengliang Wang commented on SPARK-41053:


[~techaddict] Thanks! Feel free to take anyone from the subtasks. You can 
follow the PR for TaskDataWrapper: https://github.com/apache/spark/pull/39048

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648785#comment-17648785
 ] 

Apache Spark commented on SPARK-41421:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39096

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648784#comment-17648784
 ] 

Apache Spark commented on SPARK-41421:
--

User 'techaddict' has created a pull request for this issue:
https://github.com/apache/spark/pull/39096

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41421:


Assignee: Apache Spark

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41421) Protobuf serializer for ApplicationEnvironmentInfoWrapper

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41421:


Assignee: (was: Apache Spark)

> Protobuf serializer for ApplicationEnvironmentInfoWrapper
> -
>
> Key: SPARK-41421
> URL: https://issues.apache.org/jira/browse/SPARK-41421
> Project: Spark
>  Issue Type: Sub-task
>  Components: Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41053) Better Spark UI scalability and Driver stability for large applications

2022-12-16 Thread Sandeep Singh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648782#comment-17648782
 ] 

Sandeep Singh commented on SPARK-41053:
---

[~Gengliang.Wang] I'm willing to take some tasks from this list

> Better Spark UI scalability and Driver stability for large applications
> ---
>
> Key: SPARK-41053
> URL: https://issues.apache.org/jira/browse/SPARK-41053
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 3.4.0
>Reporter: Gengliang Wang
>Priority: Major
> Attachments: Better Spark UI scalability and Driver stability for 
> large applications.pdf
>
>
> After SPARK-18085, the Spark history server(SHS) becomes more scalable for 
> processing large applications by supporting a persistent 
> KV-store(LevelDB/RocksDB) as the storage layer.
> As for the live Spark UI, all the data is still stored in memory, which can 
> bring memory pressures to the Spark driver for large applications.
> For better Spark UI scalability and Driver stability, I propose to
>  * {*}Support storing all the UI data in a persistent KV store{*}. 
> RocksDB/LevelDB provides low memory overhead. Their write/read performance is 
> fast enough to serve the write/read workload for live UI. SHS can leverage 
> the persistent KV store to fasten its startup.
>  * *Support a new Protobuf serializer for all the UI data.* The new 
> serializer is supposed to be faster, according to benchmarks. It will be the 
> default serializer for the persistent KV store of live UI. As for event logs, 
> it is optional. The current serializer for UI data is JSON. When writing 
> persistent KV-store, there is GZip compression. Since there is compression 
> support in RocksDB/LevelDB, the new serializer won’t compress the output 
> before writing to the persistent KV store. Here is a benchmark of 
> writing/reading 100,000 SQLExecutionUIData to/from RocksDB:
>  
> |*Serializer*|*Avg Write time(μs)*|*Avg Read time(μs)*|*RocksDB File Total 
> Size(MB)*|*Result total size in memory(MB)*|
> |*Spark’s KV Serializer(JSON+gzip)*|352.2|119.26|837|868|
> |*Protobuf*|109.9|34.3|858|2105|
> I am also proposing to support RocksDB instead of both LevelDB & RocksDB in 
> the live UI.
> SPIP: 
> [https://docs.google.com/document/d/1cuKnFwlTodyVhUQPMuakq2YDaLH05jaY9FRu_aD1zMo/edit?usp=sharing]
> SPIP vote: https://lists.apache.org/thread/lom4zcob6237q6nnj46jylkzwmmsxvgj



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41365:
-

Assignee: miracle

> Stages UI page fails to load for proxy in some yarn versions 
> -
>
> Key: SPARK-41365
> URL: https://issues.apache.org/jira/browse/SPARK-41365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.1
> Environment: as above
>Reporter: Mars
>Assignee: miracle
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
> Attachments: image-2022-12-02-17-53-03-003.png
>
>
> My environment CDH 5.8 , click to enter the spark UI from the yarn interface
> when visit the stage URI, it fails to load,  URI is
> {code:java}
> http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0
>  {code}
> !image-2022-12-02-17-53-03-003.png|width=430,height=697!
> Server error stack trace:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206)
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code}
>  
> The issue is similar to, the final phenomenon of the issue is the same, 
> because the parameter encode twice
> https://issues.apache.org/jira/browse/SPARK-32467
> https://issues.apache.org/jira/browse/SPARK-33611
> The two issues solve two scenarios to avoid encode twice:
> 1. https redirect proxy
> 2. set reverse proxy enabled (spark.ui.reverseProxy)  in Nginx 
> But if encode twice due to other reasons, such as this issue (yarn proxy), it 
> will also fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41365:
-

Assignee: Mars  (was: miracle)

> Stages UI page fails to load for proxy in some yarn versions 
> -
>
> Key: SPARK-41365
> URL: https://issues.apache.org/jira/browse/SPARK-41365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.1
> Environment: as above
>Reporter: Mars
>Assignee: Mars
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
> Attachments: image-2022-12-02-17-53-03-003.png
>
>
> My environment CDH 5.8 , click to enter the spark UI from the yarn interface
> when visit the stage URI, it fails to load,  URI is
> {code:java}
> http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0
>  {code}
> !image-2022-12-02-17-53-03-003.png|width=430,height=697!
> Server error stack trace:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206)
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code}
>  
> The issue is similar to, the final phenomenon of the issue is the same, 
> because the parameter encode twice
> https://issues.apache.org/jira/browse/SPARK-32467
> https://issues.apache.org/jira/browse/SPARK-33611
> The two issues solve two scenarios to avoid encode twice:
> 1. https redirect proxy
> 2. set reverse proxy enabled (spark.ui.reverseProxy)  in Nginx 
> But if encode twice due to other reasons, such as this issue (yarn proxy), it 
> will also fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41365:
-

Assignee: (was: miracle)

> Stages UI page fails to load for proxy in some yarn versions 
> -
>
> Key: SPARK-41365
> URL: https://issues.apache.org/jira/browse/SPARK-41365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.1
> Environment: as above
>Reporter: Mars
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
> Attachments: image-2022-12-02-17-53-03-003.png
>
>
> My environment CDH 5.8 , click to enter the spark UI from the yarn interface
> when visit the stage URI, it fails to load,  URI is
> {code:java}
> http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0
>  {code}
> !image-2022-12-02-17-53-03-003.png|width=430,height=697!
> Server error stack trace:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206)
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code}
>  
> The issue is similar to, the final phenomenon of the issue is the same, 
> because the parameter encode twice
> https://issues.apache.org/jira/browse/SPARK-32467
> https://issues.apache.org/jira/browse/SPARK-33611
> The two issues solve two scenarios to avoid encode twice:
> 1. https redirect proxy
> 2. set reverse proxy enabled (spark.ui.reverseProxy)  in Nginx 
> But if encode twice due to other reasons, such as this issue (yarn proxy), it 
> will also fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41365:
-

Assignee: miracle

> Stages UI page fails to load for proxy in some yarn versions 
> -
>
> Key: SPARK-41365
> URL: https://issues.apache.org/jira/browse/SPARK-41365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.1
> Environment: as above
>Reporter: Mars
>Assignee: miracle
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
> Attachments: image-2022-12-02-17-53-03-003.png
>
>
> My environment CDH 5.8 , click to enter the spark UI from the yarn interface
> when visit the stage URI, it fails to load,  URI is
> {code:java}
> http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0
>  {code}
> !image-2022-12-02-17-53-03-003.png|width=430,height=697!
> Server error stack trace:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206)
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code}
>  
> The issue is similar to, the final phenomenon of the issue is the same, 
> because the parameter encode twice
> https://issues.apache.org/jira/browse/SPARK-32467
> https://issues.apache.org/jira/browse/SPARK-33611
> The two issues solve two scenarios to avoid encode twice:
> 1. https redirect proxy
> 2. set reverse proxy enabled (spark.ui.reverseProxy)  in Nginx 
> But if encode twice due to other reasons, such as this issue (yarn proxy), it 
> will also fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-41365) Stages UI page fails to load for proxy in some yarn versions

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-41365.
---
   Fix Version/s: 3.3.2
  3.4.0
Target Version/s:   (was: 3.3.1)
  Resolution: Fixed

> Stages UI page fails to load for proxy in some yarn versions 
> -
>
> Key: SPARK-41365
> URL: https://issues.apache.org/jira/browse/SPARK-41365
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.1
> Environment: as above
>Reporter: Mars
>Priority: Major
> Fix For: 3.3.2, 3.4.0
>
> Attachments: image-2022-12-02-17-53-03-003.png
>
>
> My environment CDH 5.8 , click to enter the spark UI from the yarn interface
> when visit the stage URI, it fails to load,  URI is
> {code:java}
> http://:8088/proxy/application_1669877165233_0021/stages/stage/?id=0&attempt=0
>  {code}
> !image-2022-12-02-17-53-03-003.png|width=430,height=697!
> Server error stack trace:
> {code:java}
> Caused by: java.lang.NullPointerException
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$doPagination$1(StagesResource.scala:207)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.doPagination(StagesResource.scala:206)
>   at 
> org.apache.spark.status.api.v1.StagesResource.$anonfun$taskTable$1(StagesResource.scala:161)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.$anonfun$withUI$1(ApiRootResource.scala:142)
>   at org.apache.spark.ui.SparkUI.withSparkUI(SparkUI.scala:147)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI(ApiRootResource.scala:137)
>   at 
> org.apache.spark.status.api.v1.BaseAppResource.withUI$(ApiRootResource.scala:135)
>   at 
> org.apache.spark.status.api.v1.StagesResource.withUI(StagesResource.scala:31)
>   at 
> org.apache.spark.status.api.v1.StagesResource.taskTable(StagesResource.scala:145)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62){code}
>  
> The issue is similar to, the final phenomenon of the issue is the same, 
> because the parameter encode twice
> https://issues.apache.org/jira/browse/SPARK-32467
> https://issues.apache.org/jira/browse/SPARK-33611
> The two issues solve two scenarios to avoid encode twice:
> 1. https redirect proxy
> 2. set reverse proxy enabled (spark.ui.reverseProxy)  in Nginx 
> But if encode twice due to other reasons, such as this issue (yarn proxy), it 
> will also fail



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-41552:
-

Assignee: Dongjoon Hyun

> Upgrade kubernetes-client to 6.3.1
> --
>
> Key: SPARK-41552
> URL: https://issues.apache.org/jira/browse/SPARK-41552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-38062) FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-38062:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname
> 
>
> Key: SPARK-38062
> URL: https://issues.apache.org/jira/browse/SPARK-38062
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Erik Krogen
>Assignee: Erik Krogen
>Priority: Major
> Fix For: 3.3.0
>
>
> {{FallbackStorage}} uses a placeholder block manager ID:
> {code:scala}
> private[spark] object FallbackStorage extends Logging {
>   /** We use one block manager id as a place holder. */
>   val FALLBACK_BLOCK_MANAGER_ID: BlockManagerId = BlockManagerId("fallback", 
> "remote", 7337)
> {code}
> That second argument is normally interpreted as a hostname, but is passed as 
> the string "remote" in this case.
> {{BlockManager}} will consider this placeholder as one of the peers in some 
> cases:
> {code:language=scala|title=BlockManager.scala}
>   private[storage] def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = {
> peerFetchLock.synchronized {
>   ...
>   if (cachedPeers.isEmpty &&
>   
> conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
> Seq(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID)
>   } else {
> cachedPeers
>   }
> }
>   }
> {code}
> {{BlockManagerDecommissioner.ShuffleMigrationRunnable}} will then attempt to 
> perform an upload to this placeholder ID:
> {code:scala}
> try {
>   blocks.foreach { case (blockId, buffer) =>
> logDebug(s"Migrating sub-block ${blockId}")
> bm.blockTransferService.uploadBlockSync(
>   peer.host,
>   peer.port,
>   peer.executorId,
>   blockId,
>   buffer,
>   StorageLevel.DISK_ONLY,
>   null) // class tag, we don't need for shuffle
> logDebug(s"Migrated sub-block $blockId")
>   }
>   logInfo(s"Migrated $shuffleBlockInfo to $peer")
> } catch {
>   case e: IOException =>
> ...
> if 
> (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < 
> blocks.size) {
>   logWarning(s"Skipping block $shuffleBlockInfo, block 
> deleted.")
> } else if (fallbackStorage.isDefined) {
>   fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm))
> } else {
>   logError(s"Error occurred during migrating 
> $shuffleBlockInfo", e)
>   keepRunning = false
> }
> {code}
> Since "remote" is not expected to be a resolvable hostname, an 
> {{IOException}} occurs, and {{fallbackStorage}} is used. But, we shouldn't 
> try to resolve this. First off, it's completely unnecessary and strange to be 
> treating the placeholder ID as a resolvable hostname, relying on an exception 
> to realize that we need to use the {{fallbackStorage}}.
> To make matters worse, in some network environments, "remote" may be a 
> resolvable hostname, completely breaking this functionality. In the 
> particular environment that I use for running automated tests, there is a DNS 
> entry for "remote" which, when you attempt to connect to it, will hang for a 
> long period of time. This essentially hangs the executor decommission 
> process, and in the case of unit tests, breaks {{FallbackStorageSuite}} as it 
> exceeds its timeouts. I'm not sure, but it's possible this is related to 
> SPARK-35584 as well (if sometimes in the GA environment, it takes a long time 
> for the OS to decide that "remote" is not a valid hostname).
> We shouldn't attempt to treat this placeholder ID as a real hostname.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40060) Add numberDecommissioningExecutors metric

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40060:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Add numberDecommissioningExecutors metric
> -
>
> Key: SPARK-40060
> URL: https://issues.apache.org/jira/browse/SPARK-40060
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> The num of decommissioning executor should exposed as metric



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40060) Add numberDecommissioningExecutors metric

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648762#comment-17648762
 ] 

Dongjoon Hyun commented on SPARK-40060:
---

I collected this as a subtask of SPARK-41550.

> Add numberDecommissioningExecutors metric
> -
>
> Key: SPARK-40060
> URL: https://issues.apache.org/jira/browse/SPARK-40060
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> The num of decommissioning executor should exposed as metric



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648761#comment-17648761
 ] 

Dongjoon Hyun commented on SPARK-40269:
---

I collected this as a subtask of SPARK-41550.

> Randomize the orders of peer in BlockManagerDecommissioner
> --
>
> Key: SPARK-40269
> URL: https://issues.apache.org/jira/browse/SPARK-40269
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
> data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40269) Randomize the orders of peer in BlockManagerDecommissioner

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40269:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Randomize the orders of peer in BlockManagerDecommissioner
> --
>
> Key: SPARK-40269
> URL: https://issues.apache.org/jira/browse/SPARK-40269
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> Randomize the orders of peer in BlockManagerDecommissioner to avoid migrating 
> data to the same set of nodes



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648760#comment-17648760
 ] 

Dongjoon Hyun commented on SPARK-40636:
---

I collected this as a subtask of SPARK-41550.

> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
>  BlockManagerDecommissioner should log correct remained shuffles.
> {code:java}
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648759#comment-17648759
 ] 

Dongjoon Hyun commented on SPARK-40481:
---

I collected this as a subtask of SPARK-41550.

> Ignore stage fetch failure caused by decommissioned executor
> 
>
> Key: SPARK-40481
> URL: https://issues.apache.org/jira/browse/SPARK-40481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> When executor decommission is enabled, there would be many stage failure 
> caused by FetchFailed from decommissioned executor, further causing whole 
> job's failure. It would be better not to count such failure in 
> `spark.stage.maxConsecutiveAttempts`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40481) Ignore stage fetch failure caused by decommissioned executor

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40481:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Ignore stage fetch failure caused by decommissioned executor
> 
>
> Key: SPARK-40481
> URL: https://issues.apache.org/jira/browse/SPARK-40481
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.4.0
>
>
> When executor decommission is enabled, there would be many stage failure 
> caused by FetchFailed from decommissioned executor, further causing whole 
> job's failure. It would be better not to count such failure in 
> `spark.stage.maxConsecutiveAttempts`



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40636) Fix wrong remained shuffles log in BlockManagerDecommissioner

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40636:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Bug)

> Fix wrong remained shuffles log in BlockManagerDecommissioner
> -
>
> Key: SPARK-40636
> URL: https://issues.apache.org/jira/browse/SPARK-40636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Block Manager
>Affects Versions: 3.3.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Minor
> Fix For: 3.3.1, 3.2.3, 3.4.0
>
>
>  BlockManagerDecommissioner should log correct remained shuffles.
> {code:java}
> 4 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:15.035 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.
> 2022-09-30 17:42:45.069 PDT
> 0 of 24 local shuffles are added. In total, 24 shuffles are remained.{code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40596) Populate ExecutorDecommission with more informative messages

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648758#comment-17648758
 ] 

Dongjoon Hyun commented on SPARK-40596:
---

I collected this as a subtask of SPARK-41550

> Populate ExecutorDecommission with more informative messages
> 
>
> Key: SPARK-40596
> URL: https://issues.apache.org/jira/browse/SPARK-40596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>
> Currently the message in {{ExecutorDecommission}} is a fixed value 
> {{{}"Executor decommission."{}}}, and it is the same for all cases, including 
> spot instance interruptions and auto-scaling down. We should put a detailed 
> message in {{ExecutorDecommission}} to better differentiate those cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40979) Keep removed executor info in decommission state

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648757#comment-17648757
 ] 

Dongjoon Hyun commented on SPARK-40979:
---

I collected this as a subtask of SPARK-41550

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Major
> Fix For: 3.4.0
>
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40596) Populate ExecutorDecommission with more informative messages

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40596?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40596:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Populate ExecutorDecommission with more informative messages
> 
>
> Key: SPARK-40596
> URL: https://issues.apache.org/jira/browse/SPARK-40596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Bo Zhang
>Assignee: Bo Zhang
>Priority: Major
>
> Currently the message in {{ExecutorDecommission}} is a fixed value 
> {{{}"Executor decommission."{}}}, and it is the same for all cases, including 
> spot instance interruptions and auto-scaling down. We should put a detailed 
> message in {{ExecutorDecommission}} to better differentiate those cases.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40979) Keep removed executor info in decommission state

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40979:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Keep removed executor info in decommission state
> 
>
> Key: SPARK-40979
> URL: https://issues.apache.org/jira/browse/SPARK-40979
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.4.0
>Reporter: Zhongwei Zhu
>Assignee: Zhongwei Zhu
>Priority: Major
> Fix For: 3.4.0
>
>
> Removed executor due to decommission should be kept in a separate set. To 
> avoid OOM, set size will be limited to 1K or 10K.
> FetchFailed caused by decom executor could be divided into 2 categories:
>  # When FetchFailed reached DAGScheduler, the executor is still alive or is 
> lost but the lost info hasn't reached TaskSchedulerImpl. This is already 
> handled in SPARK-40979
>  # FetchFailed is caused by decom executor loss, so the decom info is already 
> removed in TaskSchedulerImpl. If we keep such info in a short period, it is 
> good enough. Even we limit the size of removed executors to 10K, it could be 
> only at most 10MB memory usage. In real case, it's rare to have cluster size 
> of over 10K and the chance that all these executors decomed and lost at the 
> same time would be small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-12-16 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648756#comment-17648756
 ] 

Dongjoon Hyun commented on SPARK-40379:
---

Hi, [~holden]. We want to go GA with `Dynamic Allocation on K8s`. I collected 
this individual task there as a subtask because this is good. Please let me 
know if you want to collect this into somewhere else.

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-40379) Propagate decommission executor loss reason during onDisconnect in K8s

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-40379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-40379:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Improvement)

> Propagate decommission executor loss reason during onDisconnect in K8s
> --
>
> Key: SPARK-40379
> URL: https://issues.apache.org/jira/browse/SPARK-40379
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes, Spark Core
>Affects Versions: 3.4.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.4.0
>
>
> Currently if an executor has been sent a decommission message and then it 
> disconnects from the scheduler we only disable the executor depending on the 
> K8s status events to drive the rest of the state transitions. However, the 
> K8s status events can become overwhelmed on large clusters so we should check 
> if an executor is in a decommissioning state when it is disconnected and use 
> that reason instead of waiting on the K8s status events so we have more 
> accurate logging information.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41552:


Assignee: (was: Apache Spark)

> Upgrade kubernetes-client to 6.3.1
> --
>
> Key: SPARK-41552
> URL: https://issues.apache.org/jira/browse/SPARK-41552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17648752#comment-17648752
 ] 

Apache Spark commented on SPARK-41552:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/39094

> Upgrade kubernetes-client to 6.3.1
> --
>
> Key: SPARK-41552
> URL: https://issues.apache.org/jira/browse/SPARK-41552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-41552:


Assignee: Apache Spark

> Upgrade kubernetes-client to 6.3.1
> --
>
> Key: SPARK-41552
> URL: https://issues.apache.org/jira/browse/SPARK-41552
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, Kubernetes
>Affects Versions: 3.4.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41552) Upgrade kubernetes-client to 6.3.1

2022-12-16 Thread Dongjoon Hyun (Jira)
Dongjoon Hyun created SPARK-41552:
-

 Summary: Upgrade kubernetes-client to 6.3.1
 Key: SPARK-41552
 URL: https://issues.apache.org/jira/browse/SPARK-41552
 Project: Spark
  Issue Type: Improvement
  Components: Build, Kubernetes
Affects Versions: 3.4.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39322) Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled`

2022-12-16 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-39322:
--
Parent: SPARK-41550
Issue Type: Sub-task  (was: Documentation)

> Remove `Experimental` from `spark.dynamicAllocation.shuffleTracking.enabled`
> 
>
> Key: SPARK-39322
> URL: https://issues.apache.org/jira/browse/SPARK-39322
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 3.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >