[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2023-01-31 Thread Viraj Jasani (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani updated HDFS-16902:

Description: 
Recently came across an k8s environment where randomly some datanode pods are 
not able to stay connected to all namenode pods (e.g. last heartbeat time stays 
higher than 2 hr sometimes). When any standby namenode becomes active, any 
datanode that is not heartbeating to it for quite sometime would not be able to 
send any further block reports, leading to missing replicas immediately after 
namenode failover, which could only be resolved with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use 
some logging improvements. It is also good to get namenode status exposed with 
BPServiceActorInfo to identify any lags from datanode side in recognizing 
updated Active namenode status with heartbeats.

  was:
Recently came across an k8s environment where randomly some datanode pods are 
not able to stay connected to all namenode pods (e.g. last heartbeat time stays 
higher than 2 hr sometimes). When new namenode becomes active, any datanode 
that is not heartbeating to it would not be able to send any further block 
reports, leading to missing replicas sometimes, which would be resolved only 
with datanode pod restart.

While the issue seems env specific, BPServiceActor's offer service could use 
some logging improvements. It is also good to get namenode status exposed with 
BPServiceActorInfo to identify any lags from datanode side in recognizing 
updated Active namenode status with heartbeats.


> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2023-02-01 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-16902:
--
Labels: pull-request-available  (was: )

> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2023-02-02 Thread Tao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li updated HDFS-16902:
--
Fix Version/s: 3.2.5

> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5
>
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-16902:
--
 Hadoop Flags: Reviewed
 Target Version/s: 3.3.6, 3.4.0
Affects Version/s: 3.3.6
   3.4.0

> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: namanode
>Affects Versions: 3.4.0, 3.3.6
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16902) Add Namenode status to BPServiceActor metrics and improve logging in offerservice

2024-01-27 Thread Shilun Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16902?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shilun Fan updated HDFS-16902:
--
Component/s: namanode

> Add Namenode status to BPServiceActor metrics and improve logging in 
> offerservice
> -
>
> Key: HDFS-16902
> URL: https://issues.apache.org/jira/browse/HDFS-16902
> Project: Hadoop HDFS
>  Issue Type: Task
>  Components: namanode
>Reporter: Viraj Jasani
>Assignee: Viraj Jasani
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.5, 3.3.6
>
>
> Recently came across an k8s environment where randomly some datanode pods are 
> not able to stay connected to all namenode pods (e.g. last heartbeat time 
> stays higher than 2 hr sometimes). When any standby namenode becomes active, 
> any datanode that is not heartbeating to it for quite sometime would not be 
> able to send any further block reports, leading to missing replicas 
> immediately after namenode failover, which could only be resolved with 
> datanode pod restart.
> While the issue seems env specific, BPServiceActor's offer service could use 
> some logging improvements. It is also good to get namenode status exposed 
> with BPServiceActorInfo to identify any lags from datanode side in 
> recognizing updated Active namenode status with heartbeats.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org