[jira] [Commented] (SPARK-44881) Executor stucked on retrying to fetch shuffle data when `java.lang.OutOfMemoryError. unable to create native thread` exception occurred.

2023-08-19 Thread Ignite TC Bot (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756445#comment-17756445
 ] 

Ignite TC Bot commented on SPARK-44881:
---

User 'hgs19921112' has created a pull request for this issue:
https://github.com/apache/spark/pull/42572

> Executor stucked on retrying to fetch shuffle data when 
> `java.lang.OutOfMemoryError. unable to create native thread` exception 
> occurred.
> 
>
> Key: SPARK-44881
> URL: https://issues.apache.org/jira/browse/SPARK-44881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: hgs
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44881) Executor stucked on retrying to fetch shuffle data when `java.lang.OutOfMemoryError. unable to create native thread` exception occurred.

2023-08-19 Thread hgs (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

hgs updated SPARK-44881:

Summary: Executor stucked on retrying to fetch shuffle data when 
`java.lang.OutOfMemoryError. unable to create native thread` exception 
occurred.  (was: Executor stucked on retry to fetch shuffle data when 
`java.lang.OutOfMemoryError. unable to create native thread` exception 
occurred.)

> Executor stucked on retrying to fetch shuffle data when 
> `java.lang.OutOfMemoryError. unable to create native thread` exception 
> occurred.
> 
>
> Key: SPARK-44881
> URL: https://issues.apache.org/jira/browse/SPARK-44881
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: hgs
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44881) Executor stucked on retry to fetch shuffle data when `java.lang.OutOfMemoryError. unable to create native thread` exception occurred.

2023-08-19 Thread hgs (Jira)
hgs created SPARK-44881:
---

 Summary: Executor stucked on retry to fetch shuffle data when 
`java.lang.OutOfMemoryError. unable to create native thread` exception occurred.
 Key: SPARK-44881
 URL: https://issues.apache.org/jira/browse/SPARK-44881
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: hgs






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang reassigned SPARK-44880:
---

Assignee: Kent Yao

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang resolved SPARK-44880.
-
Fix Version/s: 3.5.0
   4.0.0
   Resolution: Fixed

Issue resolved by pull request 42571
[https://github.com/apache/spark/pull/42571]

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.5.0, 4.0.0
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Yuming Wang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuming Wang updated SPARK-44880:

Fix Version/s: 3.5.1
   (was: 3.5.0)

> Remove unnecessary curly braces at the end of the thread locks info
> ---
>
> Key: SPARK-44880
> URL: https://issues.apache.org/jira/browse/SPARK-44880
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 3.3.2, 3.4.1, 3.5.0, 4.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 4.0.0, 3.5.1
>
>
> Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-44874) Handle unrecognizable exceptions

2023-08-19 Thread Yihong He (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yihong He resolved SPARK-44874.
---
Resolution: Duplicate

> Handle unrecognizable exceptions
> 
>
> Key: SPARK-44874
> URL: https://issues.apache.org/jira/browse/SPARK-44874
> Project: Spark
>  Issue Type: Improvement
>  Components: Connect
>Affects Versions: 3.5.0
>Reporter: Yihong He
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-44862) Adding metric tracking number of jobs running on a cluster

2023-08-19 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44862?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao reassigned SPARK-44862:


Assignee: Kent Yao

> Adding metric tracking number of jobs running on a cluster
> --
>
> Key: SPARK-44862
> URL: https://issues.apache.org/jira/browse/SPARK-44862
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 4.0.0
>Reporter: Ruifeng Zheng
>Assignee: Kent Yao
>Priority: Major
>
> We frequently come across issues where in there are considerable number of 
> jobs/notebooks running on one interactive cluster. This leads to either 
> degraded performance due to cluster being overloaded or leads to driver OOM. 
> It would be helpful to have a metric identifying number of notebooks/jobs 
> running on a cluster as this would be important datapoint for customers to 
> balance out their workloads.
> We propose to add a metric which keeps track of number of jobs running on a 
> cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44880) Remove unnecessary curly braces at the end of the thread locks info

2023-08-19 Thread Kent Yao (Jira)
Kent Yao created SPARK-44880:


 Summary: Remove unnecessary curly braces at the end of the thread 
locks info
 Key: SPARK-44880
 URL: https://issues.apache.org/jira/browse/SPARK-44880
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 3.4.1, 3.3.2, 3.5.0, 4.0.0
Reporter: Kent Yao


Remove unnecessary curly braces at the end of the thread locks info



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2023-08-19 Thread Adam Binford (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17756323#comment-17756323
 ] 

Adam Binford commented on SPARK-22876:
--

Attempting to finally add support for this: 
https://github.com/apache/spark/pull/42570

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

2023-08-19 Thread Adam Binford (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Binford reopened SPARK-22876:
--

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> -
>
> Key: SPARK-22876
> URL: https://issues.apache.org/jira/browse/SPARK-22876
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 2.2.0
> Environment: hadoop version 2.7.3
>Reporter: Jinhan Zhong
>Priority: Minor
>  Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with 
> spark.yarn.am.attemptFailuresValidityInterval to make a long running 
> application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n 
> times ( n is minimum value of spark.yarn.maxAppAttempts and 
> yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 
> times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file 
> ApplicationMaster.scala 
> https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, 
> if attempt id >= maxAppAttempts, it will try to unregister the application 
> and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org