[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277120#comment-17277120
 ] 

huhaiyang commented on HDFS-15798:
--

Upload v003 patch according to your suggestions. 
 

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, 
> HDFS-15798.003.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Attachment: HDFS-15798.003.patch

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, 
> HDFS-15798.003.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-02 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277075#comment-17277075
 ] 

huhaiyang commented on HDFS-15798:
--

[~ferhui]    [~sodonnell] Thank you for your advice!

I think it makes sense to ,I later submit a new patch.

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-01 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283
 ] 

huhaiyang edited comment on HDFS-15798 at 2/1/21, 12:13 PM:


[~sodonnell]  We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e)  , and 
finally run decrementing XmitsInProgress.

However  Haven't come across yet  exception log to appear in 
ErasureCodingWorker#processErasureCoding–> catch(Throwable e) .


was (Author: haiyang hu):
[~sodonnell]  We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e)  , and 
finally run decrementing XmitsInProgress.

However  No exception log to appear in 
ErasureCodingWorker#processErasureCoding–> catch(Throwable e) .

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmit

[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-01 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283
 ] 

huhaiyang edited comment on HDFS-15798 at 2/1/21, 12:12 PM:


[~sodonnell]  We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e)  , and 
finally run decrementing XmitsInProgress.

However  No exception log to appear in 
ErasureCodingWorker#processErasureCoding–> catch(Throwable e) .


was (Author: haiyang hu):
[~sodonnell]  We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e)  , and 
finally run decrementing XmitsInProgress.

However  No exception log to appear in ErasureCodingWorker#processErasureCoding 
-->catch(Throwable e) .

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task

[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-01 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283
 ] 

huhaiyang commented on HDFS-15798:
--

[~sodonnell]  We have encountered exceptions like this in our cluster
{code:java}
2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
Failed to reconstruct striped block: BP-xxx:blk_-xxx
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93)
at 
org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}
Is currently in StripedBlockReconstructor#run–> catch(Throwable e)  , and 
finally run decrementing XmitsInProgress.

However  No exception log to appear in ErasureCodingWorker#processErasureCoding 
-->catch(Throwable e) .

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-02-01 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276145#comment-17276145
 ] 

huhaiyang commented on HDFS-15798:
--

[~ferhui] Thanks for the reviews!

I have carefully checked the code, the current logical processing should be no 
problem. 

Thanks [~ferhui] and [~sodonnell] , help to review.

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
> increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
> decrement,  XmitsInProgress is decremented by the previous value
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
> decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504
 ] 

huhaiyang edited comment on HDFS-15803 at 1/30/21, 7:28 AM:


Upload the simple patch .

Here is the patch to remove it. No need for new test case.

 


was (Author: haiyang hu):
Upload the simple patch ,  Here is the patch to remove it. No need for new test 
case.

 

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo
> {code:java}
> // StripedReconstructionInfo.java
> /**
>  * Return the weight of this EC reconstruction task.
>  *
>  * DN uses it to coordinate with NN to adjust the speed of scheduling the
>  * reconstructions tasks to this DN.
>  *
>  * @return the weight of this reconstruction task.
>  * @see HDFS-12044
>  */
> int getWeight() {
>   // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
>   // connections it opens.
>   return sources.length + targets.length;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang reassigned HDFS-15803:


Assignee: huhaiyang

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo
> {code:java}
> // StripedReconstructionInfo.java
> /**
>  * Return the weight of this EC reconstruction task.
>  *
>  * DN uses it to coordinate with NN to adjust the speed of scheduling the
>  * reconstructions tasks to this DN.
>  *
>  * @return the weight of this reconstruction task.
>  * @see HDFS-12044
>  */
> int getWeight() {
>   // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
>   // connections it opens.
>   return sources.length + targets.length;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504
 ] 

huhaiyang edited comment on HDFS-15803 at 1/30/21, 7:28 AM:


Upload the simple patch ,  Here is the patch to remove it. No need for new test 
case.

 


was (Author: haiyang hu):
Here is the patch to remove it. No need for new test case.

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo
> {code:java}
> // StripedReconstructionInfo.java
> /**
>  * Return the weight of this EC reconstruction task.
>  *
>  * DN uses it to coordinate with NN to adjust the speed of scheduling the
>  * reconstructions tasks to this DN.
>  *
>  * @return the weight of this reconstruction task.
>  * @see HDFS-12044
>  */
> int getWeight() {
>   // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
>   // connections it opens.
>   return sources.length + targets.length;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15803:
-
Description: 
 Removing the unused method from StripedReconstructionInfo
{code:java}
// StripedReconstructionInfo.java
/**
 * Return the weight of this EC reconstruction task.
 *
 * DN uses it to coordinate with NN to adjust the speed of scheduling the
 * reconstructions tasks to this DN.
 *
 * @return the weight of this reconstruction task.
 * @see HDFS-12044
 */
int getWeight() {
  // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
  // connections it opens.
  return sources.length + targets.length;
}
{code}

  was: Removing the unused method from StripedReconstructionInfo


> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo
> {code:java}
> // StripedReconstructionInfo.java
> /**
>  * Return the weight of this EC reconstruction task.
>  *
>  * DN uses it to coordinate with NN to adjust the speed of scheduling the
>  * reconstructions tasks to this DN.
>  *
>  * @return the weight of this reconstruction task.
>  * @see HDFS-12044
>  */
> int getWeight() {
>   // See HDFS-12044. The weight of a RS(n, k) is calculated by the network
>   // connections it opens.
>   return sources.length + targets.length;
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504
 ] 

huhaiyang commented on HDFS-15803:
--

Here is the patch to remove it. No need for new test case.

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15803:
-
Description:  Removing the unused method from StripedReconstructionInfo

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>
>  Removing the unused method from StripedReconstructionInfo



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15803:
-
Attachment: HDFS-15803_001.patch

> Remove unnecessary method (getWeight) in StripedReconstructionInfo 
> ---
>
> Key: HDFS-15803
> URL: https://issues.apache.org/jira/browse/HDFS-15803
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: huhaiyang
>Priority: Trivial
> Attachments: HDFS-15803_001.patch
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo

2021-01-29 Thread huhaiyang (Jira)
huhaiyang created HDFS-15803:


 Summary: Remove unnecessary method (getWeight) in 
StripedReconstructionInfo 
 Key: HDFS-15803
 URL: https://issues.apache.org/jira/browse/HDFS-15803
 Project: Hadoop HDFS
  Issue Type: Improvement
Reporter: huhaiyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal value ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
decrement,  XmitsInProgress is decremented by the previous value
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal value ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
decrement, xmitsSubmitted 
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
decrement
...
  }
}{code}


> EC: 

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal value ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  task failed 
decrement, xmitsSubmitted 
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal value ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}


> EC: Reconstruct task failed, and It would be Xm

[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274126#comment-17274126
 ] 

huhaiyang edited comment on HDFS-15798 at 1/29/21, 3:07 AM:


Thanks for the reviews, [~sodonnell]   

{quote}

If I understand this correctly, this problem can only occur is there are 
several tasks to process in the loop:

1. First pass around the loop, sets xmitsSubmitted = X, say 5.

2. This is used to increment the DN XmitsInProgress.

3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was 
never reset to zero, the DN XmitsInProgress is decremented by the previous 
value from the first pass (5 in this example).

{quote}

Just as you said, This problem can only occur is there are several tasks to 
process in the loop.

As you suggested,Updated the patch.

 


was (Author: haiyang hu):
Thanks for the reviews, [~sodonnell] 

As you suggested,Updated the patch.

 

{{{quote}}} 

If I understand this correctly, this problem can only occur is there are 
several tasks to process in the loop:

1. First pass around the loop, sets xmitsSubmitted = X, say 5.

2. This is used to increment the DN XmitsInProgress.

3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was 
never reset to zero, the DN XmitsInProgress is decremented by the previous 
value from the first pass (5 in this example).

{{{quote}}}

{{Just as you said. This problem can only occur is there are several tasks to 
process in the loop}}

{{}}{{}}

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274126#comment-17274126
 ] 

huhaiyang commented on HDFS-15798:
--

Thanks for the reviews, [~sodonnell] 

As you suggested,Updated the patch.

 

{{{quote}}} 

If I understand this correctly, this problem can only occur is there are 
several tasks to process in the loop:

1. First pass around the loop, sets xmitsSubmitted = X, say 5.

2. This is used to increment the DN XmitsInProgress.

3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was 
never reset to zero, the DN XmitsInProgress is decremented by the previous 
value from the first pass (5 in this example).

{{{quote}}}

{{Just as you said. This problem can only occur is there are several tasks to 
process in the loop}}

{{}}{{}}

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Attachment: HDFS-15798.002.patch

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal value ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal value ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal execution ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}


> EC: Reconstruct task failed, and It would be Xmi

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal execution ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}


> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Summary: EC: Reconstruct task failed, and It would be XmitsInProgress of DN 
has negative number  (was: EC: Reconstruct task failed, and the XmitsInProgress 
operation will be performed twice)

> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the XmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Summary: EC: Reconstruct task failed, and the XmitsInProgress operation 
will be performed twice  (was: EC: Reconstruct task failed, and the 
decrementXmitsInProgress operation will be performed twice)

> EC: Reconstruct task failed, and the XmitsInProgress operation will be 
> performed twice
> --
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed 
decrement
...
  }
}{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number

 
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed 
decrement
...
  }
}{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: 

[jira] [Assigned] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang reassigned HDFS-15798:


Assignee: huhaiyang

> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
>  
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> failed decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Attachment: HDFS-15798.001.patch

> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
> Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
>  
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   ...
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
> initDecoderIfNecessary();
>...
>   } catch (Throwable e) {
> LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
> getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
> float xmitWeight = getErasureCodingWorker().getXmitWeight();
> // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
> // because if it set to zero, we cannot to measure the xmits submitted
> int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
> getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> failed decrement
> ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number

 
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed 
decrement
...
  }
}{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Is

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  1. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed 
decrement
...
  }
}{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type:

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  ...
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  1. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
float xmitWeight = getErasureCodingWorker().getXmitWeight();
// if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
// because if it set to zero, we cannot to measure the xmits submitted
int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed 
decrement
...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  StripedReconstructionInfo stripedReconInfo =
  new StripedReconstructionInfo(
  reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
  reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
  reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
  reconInfo.getTargetStorageIDs());
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  1. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}
{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Col

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  StripedReconstructionInfo stripedReconInfo =
  new StripedReconstructionInfo(
  reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
  reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
  reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
  reconInfo.getTargetStorageIDs());
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  task start 
increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  1. task 
failed decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}
{code}

  was:
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  StripedReconstructionInfo stripedReconInfo =
  new StripedReconstructionInfo(
  reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
  reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
  reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
  reconInfo.getTargetStorageIDs());
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 1.decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


{code}


> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC reconstruct task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   StripedReconstructionInfo stripedReconInfo =
>   new StripedReconstructionInfo(
>   reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
>   reconInfo.getLiveBlockIndices(), reconInfo.get

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Summary: EC: Reconstruct task failed, and the decrementXmitsInProgress 
operation will be performed twice  (was: EC:Reconstruction task failed, and the 
decrementXmitsInProgress operation will be performed twice)

> EC: Reconstruct task failed, and the decrementXmitsInProgress operation will 
> be performed twice
> ---
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC refactoring task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
> {code:java}
> // ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   StripedReconstructionInfo stripedReconInfo =
>   new StripedReconstructionInfo(
>   reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
>   reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
>   reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
>   reconInfo.getTargetStorageIDs());
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 
> 1.decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// ErasureCodingWorker.java

public void processErasureCodingTasks(
Collection ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
int xmitsSubmitted = 0;
try {
  StripedReconstructionInfo stripedReconInfo =
  new StripedReconstructionInfo(
  reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
  reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
  reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
  reconInfo.getTargetStorageIDs());
  // It may throw IllegalArgumentException from task#stripedReader
  // constructor.
  final StripedBlockReconstructor task =
  new StripedBlockReconstructor(this, stripedReconInfo);
  if (task.hasValidTargets()) {
// See HDFS-12044. We increase xmitsInProgress even the task is only
// enqueued, so that
//   1) NN will not send more tasks than what DN can execute and
//   2) DN will not throw away reconstruction tasks, and instead keeps
//  an unbounded number of tasks in the executor's task queue.
xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment
stripedReconstructionPool.submit(task);
  } else {
LOG.warn("No missing internal block. Skip reconstruction for task:{}",
reconInfo);
  }
} catch (Throwable e) {
  getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 1.decrement
  LOG.warn("Failed to reconstruct striped block {}",
  reconInfo.getExtendedBlock().getLocalBlock(), e);
}
  }
}


{code}

  was:
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// code placeholder
{code}


> EC:Reconstruction task failed, and the decrementXmitsInProgress operation 
> will be performed twice
> -
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC refactoring task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
> {code:java}
> // ErasureCodingWorker.java
> public void processErasureCodingTasks(
> Collection ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
> int xmitsSubmitted = 0;
> try {
>   StripedReconstructionInfo stripedReconInfo =
>   new StripedReconstructionInfo(
>   reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(),
>   reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(),
>   reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(),
>   reconInfo.getTargetStorageIDs());
>   // It may throw IllegalArgumentException from task#stripedReader
>   // constructor.
>   final StripedBlockReconstructor task =
>   new StripedBlockReconstructor(this, stripedReconInfo);
>   if (task.hasValidTargets()) {
> // See HDFS-12044. We increase xmitsInProgress even the task is only
> // enqueued, so that
> //   1) NN will not send more tasks than what DN can execute and
> //   2) DN will not throw away reconstruction tasks, and instead keeps
> //  an unbounded number of tasks in the executor's task queue.
> xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
> getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment
> stripedReconstructionPool.submit(task);
>   } else {
> LOG.warn("No missing internal block. Skip reconstruction for task:{}",
> reconInfo);
>   }
> } catch (Throwable e) {
>   getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 
> 1.decrement
>   LOG.warn("Failed to reconstruct striped block {}",
>   reconInfo.getExtendedBlock().getLocalBlock(), e);
> }
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number
{code:java}
// code placeholder
{code}

  was:
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
It would be XmitsInProgress of DN has negative number


> EC:Reconstruction task failed, and the decrementXmitsInProgress operation 
> will be performed twice
> -
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC refactoring task failed, and the decrementXmitsInProgress operation 
> will be performed twice
>  It would be XmitsInProgress of DN has negative number
> {code:java}
> // code placeholder
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)
huhaiyang created HDFS-15798:


 Summary: EC:Reconstruction task failed, and the 
decrementXmitsInProgress operation will be performed twice
 Key: HDFS-15798
 URL: https://issues.apache.org/jira/browse/HDFS-15798
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: huhaiyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice

2021-01-28 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15798:
-
Description: 
The EC refactoring task failed, and the decrementXmitsInProgress operation will 
be performed twice
It would be XmitsInProgress of DN has negative number

> EC:Reconstruction task failed, and the decrementXmitsInProgress operation 
> will be performed twice
> -
>
> Key: HDFS-15798
> URL: https://issues.apache.org/jira/browse/HDFS-15798
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: huhaiyang
>Priority: Major
>
> The EC refactoring task failed, and the decrementXmitsInProgress operation 
> will be performed twice
> It would be XmitsInProgress of DN has negative number



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12861) Track speed in DFSClient

2020-12-15 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249676#comment-17249676
 ] 

huhaiyang commented on HDFS-12861:
--

 
{code:java}
// code placeholder
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java
 
b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java
index be822d664f8..ea216bc04e3 100644
--- 
a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java
+++ 
b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java
@@ -165,6 +165,19 @@ public long getDownstreamAckTimeNanos() {
 return proto.getDownstreamAckTimeNanos();
   }
 
+  /**
+   * Get packet processing time of datanode at the given index in the pipeline.
+   * @param i - datanode index in the pipeline
+   */
+  public long getPacketProcessingTime(int i) {
+if (proto.getPacketProcessingTimeNanosCount() > i) {
+  return proto.getPacketProcessingTimeNanos(i);
+} else {
+  // Return -1 if datanode at this index didn't send this info
+  return -1;
+}
+  }
+  
   /**
* Check if this ack contains error status
* @return true if all statuses are SUCCESS
diff --git 
a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto 
b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto
index 2356201f04d..dfededb7619 100644
--- a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto
+++ b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto
@@ -260,6 +260,7 @@ message PipelineAckProto {
   repeated Status reply = 2;
   optional uint64 downstreamAckTimeNanos = 3 [default = 0];
   repeated uint32 flag = 4 [packed=true];
+  repeated uint64 packetProcessingTimeNanos = 100;
 }
{code}
hi [~elgoiri]  Consult, I found that there is no set packetProcessingTimeNanos 
value method found in the current patch?
Looking forward to your reply, Thanks!

 

> Track speed in DFSClient
> 
>
> Key: HDFS-12861
> URL: https://issues.apache.org/jira/browse/HDFS-12861
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: María Fernanda Borge
>Priority: Major
> Attachments: HDFS-12861-10-april-18.patch
>
>
> Sometimes we get slow jobs because of the access to HDFS. However, is hard to 
> tell what is the actual speed. We propose to add a log line with something 
> like:
> {code}
> 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 
> READ 129500B in 7ms 17.6MB/s
> 2017-11-27 19:01:04,141 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s
> 2017-11-27 19:01:14,219 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s
> 2017-11-27 19:01:24,282 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:34,330 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:44,408 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.

2020-11-27 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15697:
-
Description: Enhance FastCopy to support EC file .  (was: Enhance FastCopy 
to support EC file )

> Fast copy support EC for HDFS.
> --
>
> Key: HDFS-15697
> URL: https://issues.apache.org/jira/browse/HDFS-15697
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
>
> Enhance FastCopy to support EC file .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.

2020-11-27 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15697:
-
External issue ID:   (was: https://issues.apache.org/jira/browse/HDFS-2139)

> Fast copy support EC for HDFS.
> --
>
> Key: HDFS-15697
> URL: https://issues.apache.org/jira/browse/HDFS-15697
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
>
> Enhance FastCopy to support EC file 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.

2020-11-27 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15697:
-
External issue ID: https://issues.apache.org/jira/browse/HDFS-2139

> Fast copy support EC for HDFS.
> --
>
> Key: HDFS-15697
> URL: https://issues.apache.org/jira/browse/HDFS-15697
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
>
> Enhance FastCopy to support EC file 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.

2020-11-27 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15697:
-
Description: Enhance FastCopy to support EC file 

> Fast copy support EC for HDFS.
> --
>
> Key: HDFS-15697
> URL: https://issues.apache.org/jira/browse/HDFS-15697
> Project: Hadoop HDFS
>  Issue Type: New Feature
>Reporter: huhaiyang
>Assignee: huhaiyang
>Priority: Major
>
> Enhance FastCopy to support EC file 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15697) Fast copy support EC for HDFS.

2020-11-27 Thread huhaiyang (Jira)
huhaiyang created HDFS-15697:


 Summary: Fast copy support EC for HDFS.
 Key: HDFS-15697
 URL: https://issues.apache.org/jira/browse/HDFS-15697
 Project: Hadoop HDFS
  Issue Type: New Feature
Reporter: huhaiyang
Assignee: huhaiyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-12861) Track speed in DFSClient

2020-09-28 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203054#comment-17203054
 ] 

huhaiyang commented on HDFS-12861:
--

[~elgoiri]It looks like very good work And are there plans to merge into the 
trunk? Thanks.

> Track speed in DFSClient
> 
>
> Key: HDFS-12861
> URL: https://issues.apache.org/jira/browse/HDFS-12861
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: María Fernanda Borge
>Priority: Major
> Attachments: HDFS-12861-10-april-18.patch
>
>
> Sometimes we get slow jobs because of the access to HDFS. However, is hard to 
> tell what is the actual speed. We propose to add a log line with something 
> like:
> {code}
> 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 
> READ 129500B in 7ms 17.6MB/s
> 2017-11-27 19:01:04,141 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s
> 2017-11-27 19:01:14,219 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s
> 2017-11-27 19:01:24,282 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:34,330 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:44,408 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-12861) Track speed in DFSClient

2020-09-28 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203054#comment-17203054
 ] 

huhaiyang edited comment on HDFS-12861 at 9/28/20, 7:29 AM:


[~elgoiri]  It looks like very good work And are there plans to merge into the 
trunk? Thanks.


was (Author: haiyang hu):
[~elgoiri]It looks like very good work And are there plans to merge into the 
trunk? Thanks.

> Track speed in DFSClient
> 
>
> Key: HDFS-12861
> URL: https://issues.apache.org/jira/browse/HDFS-12861
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Íñigo Goiri
>Assignee: María Fernanda Borge
>Priority: Major
> Attachments: HDFS-12861-10-april-18.patch
>
>
> Sometimes we get slow jobs because of the access to HDFS. However, is hard to 
> tell what is the actual speed. We propose to add a log line with something 
> like:
> {code}
> 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 
> READ 129500B in 7ms 17.6MB/s
> 2017-11-27 19:01:04,141 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s
> 2017-11-27 19:01:14,219 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s
> 2017-11-27 19:01:24,282 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:34,330 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s
> 2017-11-27 19:01:44,408 INFO [DataStreamer for file 
> /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: 
> blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-04 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190601#comment-17190601
 ] 

huhaiyang edited comment on HDFS-15556 at 9/4/20, 7:28 AM:
---

The current issue is the same as  [HDFS-14042| 
https://issues.apache.org/jira/browse/HDFS-14042].


was (Author: haiyang hu):
The current issue is the same as[HDFS-14042| 
https://issues.apache.org/jira/browse/HDFS-14042].

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-04 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190601#comment-17190601
 ] 

huhaiyang commented on HDFS-15556:
--

The current issue is the same as[HDFS-14042| 
https://issues.apache.org/jira/browse/HDFS-14042].

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN_DN.LOG

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN_DN.LOG)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:48 PM:


[~hexiaoqiao] Thanks for your comments.
{quote}
Great catch here. v001 is fair for me, it will be better if add new unit test 
to cover.
{quote}
I'll add to it later unit test

{quote}
I am interested that why storage is null here. Anywhere not synchronized 
storageMap where should do that?
{quote}

the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

detailed execution log
 [^NN_DN.LOG] 

Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processin

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:43 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:
HeartbeatManager#updateLifeline
{code:java}
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}

BlockPlacementPolicyDefault#excludeNodeByLoad
{code:java}
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}


{code:java}
BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:42 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline
synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] 
reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node);
   //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus 
the XceiverCount of the DN of the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,  xceiverCount, 
failedVolumes, volumeFailureSummary);
  //NPE exception occurred here throws

stats.add(node);  //Here logic is never executed
  }
{code}


{code:java}
BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline

  synchronized void updateLifeline(final DatanodeDescriptor 
node,StorageReport[] reports, long cacheCapacity, long cacheUsed,
  int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node); //Every time DN heartbeat 
report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of 
the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,
xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception 
occurred here throws
stats.add(node);  //Here logic is never executed
  }

BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cl

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:41 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

4. detailed execution log
 [^NN_DN.LOG] 

5.Source code is:

{code:java}
HeartbeatManager#updateLifeline

  synchronized void updateLifeline(final DatanodeDescriptor 
node,StorageReport[] reports, long cacheCapacity, long cacheUsed,
  int xceiverCount, int failedVolumes,
  VolumeFailureSummary volumeFailureSummary) {
stats.subtract(node); //Every time DN heartbeat 
report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of 
the current 
...
node.updateHeartbeatState(reports, cacheCapacity, cacheUsed,
xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception 
occurred here throws
stats.add(node);  //Here logic is never executed
  }

BlockPlacementPolicyDefault#excludeNodeByLoad
  boolean excludeNodeByLoad(DatanodeDescriptor node){
final double maxLoad = considerLoadFactor *
stats.getInServiceXceiverAverage(); 
//stats.getInServiceXceiverAverage()= 
heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() 
//the final maxLoad value will be affected
final int nodeLoad = node.getXceiverCount();
if ((nodeLoad > maxLoad) && (maxLoad > 0)) {
  logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY,
  "(load: " + nodeLoad + " > " + maxLoad + ")");
  return true;
}
return false;
  }
{code}



was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log
 [^NN_DN.LOG] 


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.cal

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:39 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log
 [^NN_DN.LOG] 



was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:38 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}

//execution log




was (Author: haiyang hu):
3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subje

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN_DN.LOG

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM:


3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$P

[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108
 ] 

huhaiyang commented on HDFS-15556:
--

3.the cause of occurred the problem is:
{quote}
1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be 
occurred when the service is restored:
  
#BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat
2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove 
storageMap) for the registered DN
3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, 
the Lifeline reports to NN,
  But at this point, the storageMap is null of the DN is recorded at the NN 
occurred NPE
{quote}


{code:java}
//execution log
//NameNode LOG:
#registered DN:
2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a 
node: xxx:50010
2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a 
new node: xx:50010
2020-08-25 00:58:53,977 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: 
[DISK]:NORMAL:xxx:50010 failed.
2020-08-25 00:58:53,978 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed 
storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010
...

#sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It 
keeps occurred the NPE 
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException

...
2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 
on 8022, call Call#67833 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from DN:34766
java.lang.NullPointerException
...

#DN sendHeartBeat the NN will add storageMap:
2020-08-25 00:59:46,632 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new 
storage ID xxx for DN xxx:50010

DN LOG:
#DN run DNA_REGISTER
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
DatanodeCommand action : DNA_REGISTER from NN:8021 with active state
2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake 
with NN
2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
IOException in LifelineSender for Block pool XXX service to NN:8021
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): 
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511)
at org.apache.hadoop.ipc.Client.call(Client.java:1457)
at org.apache.hadoop.ipc.Client.call(Client.java:1367)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy21.sendLifeline(Unknown Source)
at 

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:30 AM:
---

1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913)
{code}

  
2.there are a large number of logs, and in extreme cases, all DN nodes of the 
cluster are not satisfied with the allocation

{code:java}
2020-08-25 01:38:50,370 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough 
replicas was chosen. Reason:{NODE_TOO_BUSY=xxx}
2020-08-25 01:38:50,370 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to 
place enough replicas, still in need of 3 to reach 3 (unavailableStoragrom 
storage xxx node DatanodeRegistration(:50010, datanodeUuid=xxx, 
infoPort=50075, infoSecurePor
t=0, ipcPort=50020, storageInfo=lv=-57;cid=xxx;nsid=;c=0), blocks: 2266, 
hasStaleStorage: false, processing time: 7 msecs, invalidatedBlocks: 0
2020-08-25 01:38:50,370 INFO 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough 
replicas was chosen. Reason:{NODE_TOO_BUSY=xxx}
{code}






was (Author: haiyang hu):
1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementP

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:25 AM:
---

1. CPU NameNode high, thread stack is

{code:java}
"IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 
tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000]
   java.lang.Thread.State: RUNNABLE
at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282)
at 
java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
at 
org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263)
at 
org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678)
at 
org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533)
at 
org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174)
at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239)
at 
org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913)
{code}

  
2.


was (Author: haiyang hu):
# CPU NameNode high, thread stack is
  
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.D

[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991
 ] 

huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:24 AM:
---

# CPU NameNode high, thread stack is
  
# 


was (Author: haiyang hu):
# CPU NameNode high, thread stack is
  !NN-jstack.png! 
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN-jstack.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991
 ] 

huhaiyang commented on HDFS-15556:
--

# CPU NameNode high, thread stack is
  !NN-jstack.png! 
# 

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-jstack.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: screenshot-1.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: screenshot-1.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-CPU.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch, NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: HDFS-15556.001.patch

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: (was: NN-CPU.png)

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: HDFS-15556.001.patch
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Attachment: NN-CPU.png

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
> Attachments: NN-CPU.png
>
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> because DataNode is identified as busy and unable to allocate available nodes 
> in choose  DataNode, program loop execution results in high CPU and reduces 
> the processing performance of the cluster.
> *NameNode the exception stack*:
> {code:java}
> 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 5 on 8022, call Call#20535 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from x:34766
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
> at 
> org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
> {code}
> {code:java}
> // DatanodeDescriptor#updateStorageStats
> ...
> for (StorageReport report : reports) {
>   DatanodeStorageInfo storage = null;
>   synchronized (storageMap) {
> storage =
> storageMap.get(report.getStorage().getStorageID());
>   }
>   if (checkFailedStorages) {
> failedStorageInfos.remove(storage);
>   }
>   storage.receivedHeartbeat(report);  //  NPE exception occurred here 
>   // skip accounting for capacity of PROVIDED storages!
>   if (StorageType.PROVIDED.equals(storage.getStorageType())) {
> continue;
>   }
> ...
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.r

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
because DataNode is identified as busy and unable to allocate available nodes 
in choose  DataNode, program loop execution results in high CPU and reduces the 
processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.r

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //  NPE exception occurred here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.r

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  // an NPE exception is occur here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  // an NPE exception is raised here 
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCal

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  /{color:red}/an NPE exception is 
raised here{color}
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


{code:java}
// DatanodeDescriptor#updateStorageStats
...
for (StorageReport report : reports) {

  DatanodeStorageInfo storage = null;
  synchronized (storageMap) {
storage =
storageMap.get(report.getStorage().getStorageID());
  }
  if (checkFailedStorages) {
failedStorageInfos.remove(storage);
  }

  storage.receivedHeartbeat(report);  //{color:red}an NPE exception is 
raised here{color}
  // skip accounting for capacity of PROVIDED storages!
  if (StorageType.PROVIDED.equals(storage.getStorageType())) {
continue;
  }
...
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:
460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.jav
a:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
>   

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 
on 8022, call Call#20535 Retry#0 org.ap
ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from 
x:34766
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:
460)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.jav
a:390)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> 

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}



> Fix NPE in DatanodeDescriptor#updateStorageStats 

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack:
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


{code:java}

2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.ap

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack*:
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

*NameNode the exception stack:
{code:java}
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


{code:java}

2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


{code:java}
NameNode the exception stack:
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.ap

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


{code:java}
NameNode the exception stack:
2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 
on 8022, call Call#68269 Retry#0 
org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
from xxx:47138
java.lang.NullPointerException
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
at 
org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
at 
org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
at 
org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62)
at 
org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409)
at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717)
{code}


  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> In choose  DataNode because DataNode is identified as busy and unable to 
> allocate available nodes, program loop execution results in high CPU and 
> reduces the processing performance of the cluster.
> {code:java}
> NameNode the exception stack:
> 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: 
> sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException
> 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 
> 0 on 8022, call Call#68269 Retry#0 
> org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline 
> from xxx:47138
> java.lang.NullPointerException
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254)
> at 
> org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761)
> at 

[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.

In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> In choose  DataNode because DataNode is identified as busy and unable to 
> allocate available nodes, program loop execution results in high CPU and 
> reduces the processing performance of the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.

In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

  was:
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.


> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> In choose  DataNode because DataNode is identified as busy and unable to 
> allocate available nodes, program loop execution results in high CPU and 
> reduces the processing performance of the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

huhaiyang updated HDFS-15556:
-
Description: 
In our cluster, the NameNode appears NPE when processing lifeline messages sent 
by the DataNode, which will cause an maxLoad exception calculated by NN.
In choose  DataNode because DataNode is identified as busy and unable to 
allocate available nodes, program loop execution results in high CPU and 
reduces the processing performance of the cluster.

> Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
> 
>
> Key: HDFS-15556
> URL: https://issues.apache.org/jira/browse/HDFS-15556
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In our cluster, the NameNode appears NPE when processing lifeline messages 
> sent by the DataNode, which will cause an maxLoad exception calculated by NN.
> In choose  DataNode because DataNode is identified as busy and unable to 
> allocate available nodes, program loop execution results in high CPU and 
> reduces the processing performance of the cluster.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline

2020-09-03 Thread huhaiyang (Jira)
huhaiyang created HDFS-15556:


 Summary: Fix NPE in DatanodeDescriptor#updateStorageStats when 
handle DN Lifeline
 Key: HDFS-15556
 URL: https://issues.apache.org/jira/browse/HDFS-15556
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: namenode
Affects Versions: 3.2.0
Reporter: huhaiyang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135712#comment-17135712
 ] 

huhaiyang commented on HDFS-15391:
--

Thanks [~hexiaoqiao] To help solve.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-15 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135698#comment-17135698
 ] 

huhaiyang commented on HDFS-15391:
--

[~liuml07] Thank you for reply!
 The current issue is the same as 
[HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] and 
[HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175]  submitted patch 
and ready for repair.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog

2020-06-12 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134129#comment-17134129
 ] 

huhaiyang commented on HDFS-15175:
--

hi [~wanchang]  Thank you for reply.
[HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] I described the 
information. 
Our current code does compatibility handling and skips exception op
Let me look at  your patch,  Thank you again for !

> Multiple CloseOp shared block instance causes the standby namenode to crash 
> when rolling editlog
> 
>
> Key: HDFS-15175
> URL: https://issues.apache.org/jira/browse/HDFS-15175
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.9.2
>Reporter: Yicong Cai
>Assignee: Yicong Cai
>Priority: Critical
>
>  
> {panel:title=Crash exception}
> 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log 
> tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp 
> [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, 
> atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], 
> permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, 
> clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, 
> txid=32625024993]
>  java.io.IOException: File is not under construction: ..
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237)
>  at 
> org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:360)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873)
>  at 
> org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479)
>  at 
> org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361)
> {panel}
>  
> {panel:title=Editlog}
> 
>  OP_REASSIGN_LEASE
>  
>  32625021150
>  DFSClient_NONMAPREDUCE_-969060727_197760
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625023743
>  0
>  0
>  ..
>  3
>  1581816135883
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> ..
> 
>  OP_TRUNCATE
>  
>  32625024049
>  ..
>  DFSClient_NONMAPREDUCE_1000868229_201260
>  ..
>  185818644
>  1581816136336
>  
>  5568434562
>  185818648
>  4495417845
>  
>  
>  
> ..
> 
>  OP_CLOSE
>  
>  32625024993
>  0
>  0
>  ..
>  3
>  1581816138774
>  1581814760398
>  536870912
>  
>  
>  false
>  
>  5568434562
>  185818644
>  4495417845
>  
>  
>  da_music
>  hdfs
>  416
>  
>  
>  
> {panel}
>  
>  
> The block size should be 185818648 in the first CloseOp. When truncate is 
> used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is 
> synchronized to the JournalNode in the same batch. The block used by CloseOp 
> twice is the same instance, which causes the first CloseOp has wrong block 
> size. When SNN rolling Editlog, TruncateOp does not make the file to the 
> UnderConstruction state. Then, when the second CloseOp is executed, the file 
> is not in the UnderConstruction state, and SNN crashes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910
 ] 

huhaiyang edited comment on HDFS-15391 at 6/12/20, 6:14 AM:


[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}
The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
In the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354154184.

And in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.



was (Author: haiyang hu):
[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}
The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
In the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

And in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.


> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245
 ] 

huhaiyang edited comment on HDFS-15391 at 6/12/20, 6:13 AM:


Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060311717 OP_SET_GENSTAMP_V2

10354071495

6. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

7. TXID=126060921400 OP_SET_GENSTAMP_V2 10354154184

8. TXID=126060921401 OP_REASSIGN_LEASE

9. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

10.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

11. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

12. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.


was (Author: haiyang hu):
Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060311717 OP_SET_GENSTAMP_V2

10354071495

6. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

7. TXID=126060921401 OP_REASSIGN_LEASE

8. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

9.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

10. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

11. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
>  

[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910
 ] 

huhaiyang edited comment on HDFS-15391 at 6/12/20, 4:23 AM:


[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}
The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
in the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

and in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.



was (Author: haiyang hu):
[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
in the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

and in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.


> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910
 ] 

huhaiyang edited comment on HDFS-15391 at 6/12/20, 4:24 AM:


[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}
The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
In the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

And in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.



was (Author: haiyang hu):
[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}
The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
in the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

and in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.


> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910
 ] 

huhaiyang commented on HDFS-15391:
--

[~ayushtkn] Thank you for reply!

Will try to reproduce.  However, the problem has not been repeated in the test 
environment。
I follow up and see if I can reproduce it?

{quote}
{quote}The block used by CloseOp twice is the same instance, which causes the 
first CloseOp has wrong block size.
{quote}
didn't quite understood this.
{quote}
in the first CloseOp(TXID=126060942290) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480, but in fact in the first CloseOp 
block_11382080753 block size should be 108764672 and GENSTAMP should be  
10354071495.

and in the second CloseOp(TXID= 126060943585) block_11382080753  block size is 
63154347 and GENSTAMP   is 10354157480.

The block  block_11382080753 used by CloseOp twice is the same instance, the 
first CloseOp has wrong block information.


> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133865#comment-17133865
 ] 

huhaiyang commented on HDFS-15391:
--

Hi [~hexiaoqiao] Thank you for reply!
{quote}
Do you enable AsyncEditlog feature? I think it could be related to different 
operations process the same blocks which not sync/return back to client. IIRC, 
we try to fix it using deep copy as HDFS-15175 mentioned in my internal branch.
{quote}

Yes, we enable AsyncEditlog feature,  I also think it may be related to this 
feature.
the current scenario multiple times to the same file  append and truncate 
operations.
Ok, let's also try to fix it using deep copy …


> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
>Reporter: huhaiyang
>Priority: Critical
>
> In the cluster version 3.2.0 production environment,
>  We found that due to edit log corruption, Standby NameNode could not 
> properly load the Ediltog log, result in abnormal exit of the service and 
> failure to restart
> {noformat}
> The specific scenario is that Flink writes to HDFS(replication file), and in 
> the case of an exception to the write file, the following operations are 
> performed :
> 1.close file
> 2.open file
> 3.truncate file
> 4.append file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245
 ] 

huhaiyang edited comment on HDFS-15391 at 6/11/20, 1:28 PM:


Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060311717 OP_SET_GENSTAMP_V2

10354071495

6. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

7. TXID=126060921401 OP_REASSIGN_LEASE

8. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

9.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

10. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

11. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.


was (Author: haiyang hu):
Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

6. TXID=126060921401 OP_REASSIGN_LEASE

7. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

8.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

9. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

10. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>

[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted

2020-06-11 Thread huhaiyang (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245
 ] 

huhaiyang edited comment on HDFS-15391 at 6/11/20, 1:13 PM:


Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

6. TXID=126060921401 OP_REASSIGN_LEASE

7. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

8.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

9. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

10. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp(TXID=126060942290).
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp(TXID=126060943585) is executed, the file is not in the 
UnderConstruction state, and SNN down.


was (Author: haiyang hu):
Hi [~ayushtkn] could you please take a look at this issue?

{quote}
2020-06-04 18:32:11,561 ERROR 
org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception 
on operation CloseOp [length=0, inodeId=0, path=path, replication=3, 
mtime=1591266620287, atime=1591264800229, blockSize=134217728, 
blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, 
blk_11382041307_10353383098, blk_11382049845_10353392031, 
blk_11382057341_10353399899, blk_11382071544_10353415171, 
blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, 
aclEntries=null, clientName=, clientMachine=, overwrite=false, 
storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585]
 java.io.IOException: File is not under construction: hdfs://path
{quote}
Related edit log transactions 

{noformat}
1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45)

NEWLENGTH=868460715
blocks: ... 
1138208075310336493410354049310

2. TXID=126060182170 OP_REASSIGN_LEASE

3. TXID=126060308267 OP_CLOSE
1591266492080 2020-06-04 18:28:12 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354049316

4. TXID=126060311503 OP_APPEND

5. TXID=126060313001 OP_UPDATE_BLOCKS
blocks: 
...113820807536315434710354071495

6. TXID=126060921401 OP_REASSIGN_LEASE

7. TXID=126060942290 OP_CLOSE
1591266619003 2020-06-04 18:30:19 1591264800229 
2020-06-04 18:00:00
blocks: 
...113820807536315434710354157480

8.TXID=126060942548 OP_SET_GENSTAMP_V2

10354157480

9. TXID=126060942549 OP_TRUNCATE
868460715
1591266619207 2020-06-04 18:30:19
blocks: 
...1138208075310876467210354157480

10. TXID=126060943585 OP_CLOSE
15912666202872020-06-04 18:30:20 
15912648002292020-06-04 18:00:00
blocks: 
...113820807536315434710354157480
{noformat}

The block size should be 108764672  in the first CloseOp.
 When truncate is used, the block size is 63154347. 
The block used by CloseOp twice is the same instance, which causes the first 
CloseOp has wrong block size.
When the second CloseOp is executed, the file is not in the UnderConstruction 
state, and SNN down.

> Standby NameNode due loads the corruption edit log, the service exits and 
> cannot be restarted
> -
>
> Key: HDFS-15391
> URL: https://issues.apache.org/jira/browse/HDFS-15391
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.2.0
> 

  1   2   >