[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-11-23 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448271#comment-17448271
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 11/23/21, 10:47 PM:
---

Thanks [~mauzhang]  What I observed is most of the time 
numRegisteredConenctions is smaller than numActiveConnections.

For example, in one case the gap can be as large as beyond 3k. while 
numRegisteredConnections range between 0 - 2.5k, and numActiveConnections range 
between 0 - 3.6k.


was (Author: yzhangal):
Thanks [~mauzhang]  What I observed is most of the time 
numRegisteredConenctions is smaller than numActiveConnections.

For example, in one case the gap can be as large as beyond 3k. while 
numRegisteredConnections range between 0 - 2.5k, and 
numActiveRegisteredConnections range between 0 - 3.6k.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-11-15 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444173#comment-17444173
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 11/15/21, 11:17 PM:
---

HI [~mauzhang] , wonder if you have been monitoring the metrics 
activeConnections and registeredConnections, somehow I observed 
registeredConnections is smaller than activeConnections, I thought it should be 
the opposite. I also asked here:

https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924

Thanks.


was (Author: yzhangal):
HI [~mauzhang] , wonder if you have been monitoring the metrics 
activeConnections and registeredConnections, somehow I observed 
registeredConnections is smaller than activeConenctions, I thought it should be 
the opposite. I also asked here:

https://issues.apache.org/jira/browse/SPARK-25642?focusedCommentId=17442924&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17442924

Thanks.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-21 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17418366#comment-17418366
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/22/21, 1:04 AM:
-

HI [~mauzhang],

Thanks a lot for your answers and sorry for late reply. I think I understand it 
better now why you are doing this change: the registeredConnections metrics 
added in ExternalShuffleBlockHandler was not used. 

However, the one added to TransportContext is used, see in 
YarnShuffleService.java:
{code:java}
 // register metrics on the block handler into the Node Manager's metrics 
system.
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
  shuffleServer.getRegisteredConnections());
  YarnShuffleServiceMetrics serviceMetrics =
  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());
  MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
  metricsSystem.register(
  "sparkShuffleService", "Metrics on the Spark Shuffle Service", 
serviceMetrics);
  logger.info("Registered metrics with Hadoop's DefaultMetricsSystem");
 {code}
The TransportContext version of registeredConnections is retrieved by 
"shuffleServer.getRegisteredConnections())" in the above code. That means both 
the activeConnections and registeredConnections are still available with your 
change. Is that your expectation?

If my understanding is correct, we can either derive "registeredConnections - 
activeConnections" as the backlogged connections, or we can add a new metrics 
as backloggedConnection to have the value of "registeredConnections - 
activeConnections" .

What do you think?

Thanks!


was (Author: yzhangal):
HI [~mauzhang],

Thanks a lot for your answers and sorry for late reply. I think I understand it 
better now why you are doing this change: the registeredConnections metrics 
added in ExternalShuffleBlockHandler was not used. 

However, the one added to TransportContext is used, see in 
YarnShuffleService.java:
{code:java}
 // register metrics on the block handler into the Node Manager's metrics 
system.
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections",
  shuffleServer.getRegisteredConnections());
  YarnShuffleServiceMetrics serviceMetrics =
  new YarnShuffleServiceMetrics(blockHandler.getAllMetrics());  
MetricsSystemImpl metricsSystem = (MetricsSystemImpl) 
DefaultMetricsSystem.instance();
  metricsSystem.register(
  "sparkShuffleService", "Metrics on the Spark Shuffle Service", 
serviceMetrics);
  logger.info("Registered metrics with Hadoop's DefaultMetricsSystem");
 {code}
The TransportContext version of registeredConnections is retrieved by 
"shuffleServer.getRegisteredConnections())" in the above code. That means both 
the activeConnections and registeredConnections are still available with your 
change. Is that your expectation?

If my understanding is correct, we can either derive "registeredConnections - 
activeConnections" as the backlogged connections, or we can add a new metrics 
as backloggedConnection to have the value of "registeredConnections - 
activeConnections" .

What do you think?

Thanks!

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-17 Thread Manu Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416656#comment-17416656
 ] 

Manu Zhang edited comment on SPARK-31646 at 9/17/21, 12:40 PM:
---

[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.


was (Author: mauzhang):
[~yzhangal],

Please check this comment  
[https://github.com/apache/spark/pull/28416#discussion_r418357988] for more 
background.

The counter reverted in this PR was just never used, or this PR was simply to 
remove some dead codes.

I didn't meant to use registeredConnections for anything different. It's 
eventually registered into ShuffleMetrics here.

[https://github.com/apache/spark/blob/master/common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java#L248]
{code:java}
  blockHandler.getAllMetrics().getMetrics().put("numRegisteredConnections", 
 shuffleServer.getRegisteredConnections()); {code}
 

As I understand it, registeredConnections (and IdleConnections) is monitored at 
channel level (TransportChannelHandler) while activeConnections 
(blockTransferRateBytes, etc) at RPC level (ExternalShuffleBlockHandler). 
Hence, these metrics are kept in two places. 

You may register your backloggedConnections in ShuffleMetrics and update it 
with "registeredConenctions - activeConnections" in 

ShuffleMetrics#getMetrics.

 

Your understanding of executors registering with Shuffle Service is correct but 
I don't see how it's related to your question.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-16 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416225#comment-17416225
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/16/21, 4:36 PM:
-

Thanks a lot for your quick response [~mauzhang]. Sorry my question was a bit 
not clear earlier.

When you made the comment

{quote}
It's {{registeredConnections}} counter created in {{TransportContext}} that's 
really counting the numbers and it's misleading for people who want to add new 
metrics like {{registeredConnections}}.
{quote}

, you meant to use registeredConnections to mean something different than what 
you reverted with this jira.  Would you please explain
# what's your interpretation of the reverted version of registeredConnections 
besides "counting numbers"? is it not useful at all?
# what your definition is for "new metrics like {{registeredConnections}}."?  
is it not counting?  if it's counting, why the reverted implementation doesn't 
serve the purpose?  

BTW, my understanding is that 1, all executors running on a given host would 
register with the remote shuffle service on the same host, 2, executors only 
register with shuffle service on the same host where the executors are running 
at, but not shuffle service running on other hosts. Is this understanding 
correct?

Thanks.


was (Author: yzhangal):
Thanks a lot for your quick response [~mauzhang]. Sorry my question was a bit 
not clear earlier.

When you made the comment

{quote}
It's {{registeredConnections}} counter created in {{TransportContext}} that's 
really counting the numbers and it's misleading for people who want to add new 
metrics like {{registeredConnections}}.
{quote}

, you meant to use registeredConnections to mean something different than what 
you reverted with this jira.  Would you please explain
1. what's your interpretation of the reverted version of registeredConnections 
besides "counting numbers"? is it not useful at all?
2. what your definition is for "new metrics like {{registeredConnections}}."?  
is it not counting?  if it's counting, why the reverted implementation doesn't 
serve the purpose?  

BTW, my understanding is that 1, all executors running on a given host would 
register with the remote shuffle service on the same host, 2, executors only 
register with shuffle service on the same host where the executors are running 
at, but not shuffle service running on other hosts. Is this understanding 
correct?

Thanks.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-16 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17416225#comment-17416225
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/16/21, 4:35 PM:
-

Thanks a lot for your quick response [~mauzhang]. Sorry my question was a bit 
not clear earlier.

When you made the comment

{quote}
It's {{registeredConnections}} counter created in {{TransportContext}} that's 
really counting the numbers and it's misleading for people who want to add new 
metrics like {{registeredConnections}}.
{quote}

, you meant to use registeredConnections to mean something different than what 
you reverted with this jira.  Would you please explain
1. what's your interpretation of the reverted version of registeredConnections 
besides "counting numbers"? is it not useful at all?
2. what your definition is for "new metrics like {{registeredConnections}}."?  
is it not counting?  if it's counting, why the reverted implementation doesn't 
serve the purpose?  

BTW, my understanding is that 1, all executors running on a given host would 
register with the remote shuffle service on the same host, 2, executors only 
register with shuffle service on the same host where the executors are running 
at, but not shuffle service running on other hosts. Is this understanding 
correct?

Thanks.


was (Author: yzhangal):
Thanks a lot for your quick response [~mauzhang]. Sorry my question was a bit 
not clear earlier.

When you made the comment

{quote}
It's {{registeredConnections}} counter created in {{TransportContext}} that's 
really counting the numbers and it's misleading for people who want to add new 
metrics like {{registeredConnections}}.
{quote}

, you meant to use registeredConnections to mean something different than what 
you reverted with this jira.  Would you please explain
1. what's your interpretation of the reverted version?
2. what your definition is for "new metrics like {{registeredConnections}}."?  
is it not counting?  if it's counting, why the reverted implementation doesn't 
serve the purpose?  

BTW, my understanding is that 1, all executors running on a given host would 
register with the remote shuffle service on the same host, 2, executors only 
register with shuffle service on the same host where the executors are running 
at, but not shuffle service running on other hosts. Is this understanding 
correct?

Thanks.

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-15 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415740#comment-17415740
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/15/21, 8:27 PM:
-

HI [~mauzhang], [~dongjoon], [~marek.simunek],

Thanks for your work here. SPARK-18364 (Thanks to [~marek.simunek]) intended to 
create the registeredConnections metrics but reverted by this Jira here with 
comment:
{quote}It's {{registeredConnections}} counter created in {{TransportContext}} 
that's really counting the numbers and it's misleading for people who want to 
add new metrics like {{registeredConnections}}.
{quote}
Could you please elaborate why the original one won't work? How would you all 
define {{registeredConnections?}}

I'm looking into adding a metrics to report the backlogged connections to 
shuffle service. If feels "registeredConenctions - activeConnections" would be 
the backlogged connections.

Thanks.


was (Author: yzhangal):
HI [~mauzhang], [~dongjoon], [~marek.simunek],

Thanks for your work here. SPARK-18364 (Thanks to [~marek.simunek]) intended to 
create the registeredConnections metrics but reverted by this Jira here with 
comment:
{quote}It's {{registeredConnections}} counter created in {{TransportContext}} 
that's really counting the numbers and it's misleading for people who want to 
add new metrics like {{registeredConnections}}.
{quote}
Could you please elaborate why the original one won't work? How would you all 
define {{registeredConnections?}}

I'm looking into adding a metrics to report the backlogged connections to 
shuffle service. If feels "registeredConenctions - activeConnections" would be 
the backlogged connections.

Thanks.

 

 

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-31646) Remove unused registeredConnections counter from ShuffleMetrics

2021-09-15 Thread Yongjun Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-31646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415740#comment-17415740
 ] 

Yongjun Zhang edited comment on SPARK-31646 at 9/15/21, 8:27 PM:
-

HI [~mauzhang], [~dongjoon], [~marek.simunek],

Thanks for your work here. SPARK-18364 (Thanks to [~marek.simunek]) intended to 
create the registeredConnections metrics but reverted by this Jira here with 
comment:
{quote}It's {{registeredConnections}} counter created in {{TransportContext}} 
that's really counting the numbers and it's misleading for people who want to 
add new metrics like {{registeredConnections}}.
{quote}
Could you please elaborate why the original one won't work? How would you all 
define {{registeredConnections?}}

I'm looking into adding a metrics to report the backlogged connections to 
shuffle service. If feels "registeredConenctions - activeConnections" would be 
the backlogged connections.

Thanks.

 

 


was (Author: yzhangal):
HI [~mauzhang], [~dongjoon], [~marek.simunek],

Thanks for your work here. SPARK-18364 (Thanks to [~marek.simunek]) intended to 
create the registeredConnections metrics but reverted by this Jira here with 
comment:

{quote}

It's {{registeredConnections}} counter created in {{TransportContext}} that's 
really counting the numbers and it's misleading for people who want to add new 
metrics like {{registeredConnections}}.

{quote}

Could you please elaborate why the original one won't work? How would you all 
define {{registeredConnections?}}

I'm looking into adding a metrics to report the backlogged connections to 
shuffle service. If feels "registeredConenctions - activeConnections" would be 
the backlogged connections.

Thanks.

 

 

 

> Remove unused registeredConnections counter from ShuffleMetrics
> ---
>
> Key: SPARK-31646
> URL: https://issues.apache.org/jira/browse/SPARK-31646
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Shuffle, Spark Core
>Affects Versions: 3.0.0
>Reporter: Manu Zhang
>Assignee: Manu Zhang
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org