[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2016-12-01 Thread Junping Du (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junping Du updated HDFS-7725:
-
Fix Version/s: 2.8.0

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0, 2.7.2, 3.0.0-alpha1
>
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-11-02 Thread Arpit Agarwal (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arpit Agarwal updated HDFS-7725:

Fix Version/s: (was: 3.0.0)

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.7.2-candidate
> Fix For: 2.7.2
>
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-10-27 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-7725:
-
Fix Version/s: (was: 2.8.0)
   2.7.2
   3.0.0

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.7.2-candidate
> Fix For: 3.0.0, 2.7.2
>
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-10-27 Thread Kihwal Lee (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kihwal Lee updated HDFS-7725:
-
Labels: 2.7.2-candidate  (was: )

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
>  Labels: 2.7.2-candidate
> Fix For: 2.8.0
>
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-04-08 Thread Andrew Wang (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Wang updated HDFS-7725:
--
   Resolution: Fixed
Fix Version/s: 2.8.0
   Status: Resolved  (was: Patch Available)

Committed to trunk and branch-2, thanks for the patch Ming, Zhe for also 
reviewing.

The failed test also failed twice for me without the patch applied, it seems 
independently broken.

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Fix For: 2.8.0
>
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-04-03 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Attachment: HDFS-7725-3.patch

Thanks, Andrew and Zhe. The latest patch moves the liveness check to 
{{HeartbeatManager}} with some minor changes to {{DecommissionManager}}'s 
{{startDecommission}}. The patch also has the other suggestions you mentioned.

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725-2.patch, HDFS-7725-3.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-03-30 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Attachment: HDFS-7725-2.patch

Thanks, Zhe. Here is the rebase of the patch.

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725-2.patch, HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-03-30 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Description: 
One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event 
without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1

  was:
`One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event 
without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1


> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-03-30 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Description: 
`One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event 
without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1

  was:
One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event 
without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1


> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> `One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Description: 
One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of event 
without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1

  was:
One of our clusters sometimes couldn't allocate blocks from any DNs. 
BlockPlacementPolicyDefault complains with the following messages for all DNs.

{noformat}
the node is too busy (load:x > y)
{noformat}


It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.

* Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
intentional. Here is the sequence of event without HDFS-7374.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == -1

* However, HDFS-7374 introduces another inconsistency when recomm is involved.
** Cluster has one live node. nodesInService == 1
** The node becomes dead. nodesInService == 0
** Decomm the node. nodesInService == 0
** Recomm the node. nodesInService == 1


> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. cc / [~zhz], [~andrew.wang], [~atm] Here is the sequence of 
> event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Attachment: HDFS-7725.patch

The patch makes sure the nodeInService count won't be updated when a dead node 
is recommissioned.

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. Here is the sequence of event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (HDFS-7725) Incorrect "nodes in service" metrics caused all writes to fail

2015-02-02 Thread Ming Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/HDFS-7725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated HDFS-7725:
--
Assignee: Ming Ma
  Status: Patch Available  (was: Open)

> Incorrect "nodes in service" metrics caused all writes to fail
> --
>
> Key: HDFS-7725
> URL: https://issues.apache.org/jira/browse/HDFS-7725
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Ming Ma
>Assignee: Ming Ma
> Attachments: HDFS-7725.patch
>
>
> One of our clusters sometimes couldn't allocate blocks from any DNs. 
> BlockPlacementPolicyDefault complains with the following messages for all DNs.
> {noformat}
> the node is too busy (load:x > y)
> {noformat}
> It turns out the {{HeartbeatManager}}'s {{nodesInService}} was computed 
> incorrectly when admins decomm or recomm dead nodes. Here are two scenarios.
> * Decomm dead nodes. It turns out HDFS-7374 has fixed it; not sure if it is 
> intentional. Here is the sequence of event without HDFS-7374.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == -1
> * However, HDFS-7374 introduces another inconsistency when recomm is involved.
> ** Cluster has one live node. nodesInService == 1
> ** The node becomes dead. nodesInService == 0
> ** Decomm the node. nodesInService == 0
> ** Recomm the node. nodesInService == 1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)