[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated HDFS-7725: - Fix Version/s: 2.8.0 > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Labels: 2.7.2-candidate > Fix For: 2.8.0, 2.7.2, 3.0.0-alpha1 > > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arpit Agarwal updated HDFS-7725: Fix Version/s: (was: 3.0.0) > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Labels: 2.7.2-candidate > Fix For: 2.7.2 > > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-7725: - Fix Version/s: (was: 2.8.0) 2.7.2 3.0.0 > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Labels: 2.7.2-candidate > Fix For: 3.0.0, 2.7.2 > > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kihwal Lee updated HDFS-7725: - Labels: 2.7.2-candidate (was: ) > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Labels: 2.7.2-candidate > Fix For: 2.8.0 > > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Wang updated HDFS-7725: -- Resolution: Fixed Fix Version/s: 2.8.0 Status: Resolved (was: Patch Available) Committed to trunk and branch-2, thanks for the patch Ming, Zhe for also reviewing. The failed test also failed twice for me without the patch applied, it seems independently broken. > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Fix For: 2.8.0 > > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Attachment: HDFS-7725-3.patch Thanks, Andrew and Zhe. The latest patch moves the liveness check to {{HeartbeatManager}} with some minor changes to {{DecommissionManager}}'s {{startDecommission}}. The patch also has the other suggestions you mentioned. > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Attachment: HDFS-7725-2.patch Thanks, Zhe. Here is the rebase of the patch. > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725-2.patch, HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Description: One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 was: `One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Description: `One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 was: One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725.patch > > > `One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Description: One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 was: One of our clusters sometimes couldn't allocate blocks from any DNs. BlockPlacementPolicyDefault complains with the following messages for all DNs. {noformat} the node is too busy (load:x > y) {noformat} It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is intentional. Here is the sequence of event without HDFS-7374. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == -1 * However, HDFS-7374 introduces another inconsistency when recomm is involved. ** Cluster has one live node. nodesInService == 1 ** The node becomes dead. nodesInService == 0 ** Decomm the node. nodesInService == 0 ** Recomm the node. nodesInService == 1 > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of > event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Attachment: HDFS-7725.patch The patch makes sure the nodeInService count won't be updated when a dead node is recommissioned. > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma > Attachments: HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. Here is the sequence of event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail
[ https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ming Ma updated HDFS-7725: -- Assignee: Ming Ma Status: Patch Available (was: Open) > Incorrect "nodes in service" metrics caused all writes to fail > -- > > Key: HDFS-7725 > URL: https://issues.apache.org/jira/browse/HDFS-7725 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Ming Ma >Assignee: Ming Ma > Attachments: HDFS-7725.patch > > > One of our clusters sometimes couldn't allocate blocks from any DNs. > BlockPlacementPolicyDefault complains with the following messages for all DNs. > {noformat} > the node is too busy (load:x > y) > {noformat} > It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed > incorrectly when admins decomm or recomm dead nodes. Here are two scenarios. > * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is > intentional. Here is the sequence of event without HDFS-7374. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == -1 > * However, HDFS-7374 introduces another inconsistency when recomm is involved. > ** Cluster has one live node. nodesInService == 1 > ** The node becomes dead. nodesInService == 0 > ** Decomm the node. nodesInService == 0 > ** Recomm the node. nodesInService == 1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)