[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277120#comment-17277120 ] huhaiyang commented on HDFS-15798: -- Upload v003 patch according to your suggestions. > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, > HDFS-15798.003.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Attachment: HDFS-15798.003.patch > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, > HDFS-15798.003.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17277075#comment-17277075 ] huhaiyang commented on HDFS-15798: -- [~ferhui] [~sodonnell] Thank you for your advice! I think it makes sense to ,I later submit a new patch. > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283 ] huhaiyang edited comment on HDFS-15798 at 2/1/21, 12:13 PM: [~sodonnell] We have encountered exceptions like this in our cluster {code:java} 2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-xxx:blk_-xxx java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and finally run decrementing XmitsInProgress. However Haven't come across yet exception log to appear in ErasureCodingWorker#processErasureCoding–> catch(Throwable e) . was (Author: haiyang hu): [~sodonnell] We have encountered exceptions like this in our cluster {code:java} 2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-xxx:blk_-xxx java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and finally run decrementing XmitsInProgress. However No exception log to appear in ErasureCodingWorker#processErasureCoding–> catch(Throwable e) . > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmit
[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283 ] huhaiyang edited comment on HDFS-15798 at 2/1/21, 12:12 PM: [~sodonnell] We have encountered exceptions like this in our cluster {code:java} 2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-xxx:blk_-xxx java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and finally run decrementing XmitsInProgress. However No exception log to appear in ErasureCodingWorker#processErasureCoding–> catch(Throwable e) . was (Author: haiyang hu): [~sodonnell] We have encountered exceptions like this in our cluster {code:java} 2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-xxx:blk_-xxx java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and finally run decrementing XmitsInProgress. However No exception log to appear in ErasureCodingWorker#processErasureCoding -->catch(Throwable e) . > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task
[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276283#comment-17276283 ] huhaiyang commented on HDFS-15798: -- [~sodonnell] We have encountered exceptions like this in our cluster {code:java} 2020-12-29 07:47:03,409 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Failed to reconstruct striped block: BP-xxx:blk_-xxx java.lang.NullPointerException at org.apache.hadoop.hdfs.util.StripedBlockUtil.getNextCompletedStripedRead(StripedBlockUtil.java:314) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.doReadMinimumSources(StripedReader.java:308) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedReader.readMinimumSources(StripedReader.java:269) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.reconstruct(StripedBlockReconstructor.java:93) at org.apache.hadoop.hdfs.server.datanode.erasurecode.StripedBlockReconstructor.run(StripedBlockReconstructor.java:60) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} Is currently in StripedBlockReconstructor#run–> catch(Throwable e) , and finally run decrementing XmitsInProgress. However No exception log to appear in ErasureCodingWorker#processErasureCoding -->catch(Throwable e) . > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276145#comment-17276145 ] huhaiyang commented on HDFS-15798: -- [~ferhui] Thanks for the reviews! I have carefully checked the code, the current logical processing should be no problem. Thanks [~ferhui] and [~sodonnell] , help to review. > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504 ] huhaiyang edited comment on HDFS-15803 at 1/30/21, 7:28 AM: Upload the simple patch . Here is the patch to remove it. No need for new test case. was (Author: haiyang hu): Upload the simple patch , Here is the patch to remove it. No need for new test case. > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo > {code:java} > // StripedReconstructionInfo.java > /** > * Return the weight of this EC reconstruction task. > * > * DN uses it to coordinate with NN to adjust the speed of scheduling the > * reconstructions tasks to this DN. > * > * @return the weight of this reconstruction task. > * @see HDFS-12044 > */ > int getWeight() { > // See HDFS-12044. The weight of a RS(n, k) is calculated by the network > // connections it opens. > return sources.length + targets.length; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang reassigned HDFS-15803: Assignee: huhaiyang > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo > {code:java} > // StripedReconstructionInfo.java > /** > * Return the weight of this EC reconstruction task. > * > * DN uses it to coordinate with NN to adjust the speed of scheduling the > * reconstructions tasks to this DN. > * > * @return the weight of this reconstruction task. > * @see HDFS-12044 > */ > int getWeight() { > // See HDFS-12044. The weight of a RS(n, k) is calculated by the network > // connections it opens. > return sources.length + targets.length; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504 ] huhaiyang edited comment on HDFS-15803 at 1/30/21, 7:28 AM: Upload the simple patch , Here is the patch to remove it. No need for new test case. was (Author: haiyang hu): Here is the patch to remove it. No need for new test case. > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo > {code:java} > // StripedReconstructionInfo.java > /** > * Return the weight of this EC reconstruction task. > * > * DN uses it to coordinate with NN to adjust the speed of scheduling the > * reconstructions tasks to this DN. > * > * @return the weight of this reconstruction task. > * @see HDFS-12044 > */ > int getWeight() { > // See HDFS-12044. The weight of a RS(n, k) is calculated by the network > // connections it opens. > return sources.length + targets.length; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15803: - Description: Removing the unused method from StripedReconstructionInfo {code:java} // StripedReconstructionInfo.java /** * Return the weight of this EC reconstruction task. * * DN uses it to coordinate with NN to adjust the speed of scheduling the * reconstructions tasks to this DN. * * @return the weight of this reconstruction task. * @see HDFS-12044 */ int getWeight() { // See HDFS-12044. The weight of a RS(n, k) is calculated by the network // connections it opens. return sources.length + targets.length; } {code} was: Removing the unused method from StripedReconstructionInfo > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo > {code:java} > // StripedReconstructionInfo.java > /** > * Return the weight of this EC reconstruction task. > * > * DN uses it to coordinate with NN to adjust the speed of scheduling the > * reconstructions tasks to this DN. > * > * @return the weight of this reconstruction task. > * @see HDFS-12044 > */ > int getWeight() { > // See HDFS-12044. The weight of a RS(n, k) is calculated by the network > // connections it opens. > return sources.length + targets.length; > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17275504#comment-17275504 ] huhaiyang commented on HDFS-15803: -- Here is the patch to remove it. No need for new test case. > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15803: - Description: Removing the unused method from StripedReconstructionInfo > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > > Removing the unused method from StripedReconstructionInfo -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
[ https://issues.apache.org/jira/browse/HDFS-15803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15803: - Attachment: HDFS-15803_001.patch > Remove unnecessary method (getWeight) in StripedReconstructionInfo > --- > > Key: HDFS-15803 > URL: https://issues.apache.org/jira/browse/HDFS-15803 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: huhaiyang >Priority: Trivial > Attachments: HDFS-15803_001.patch > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15803) Remove unnecessary method (getWeight) in StripedReconstructionInfo
huhaiyang created HDFS-15803: Summary: Remove unnecessary method (getWeight) in StripedReconstructionInfo Key: HDFS-15803 URL: https://issues.apache.org/jira/browse/HDFS-15803 Project: Hadoop HDFS Issue Type: Improvement Reporter: huhaiyang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed decrement, XmitsInProgress is decremented by the previous value LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed decrement, xmitsSubmitted LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement ... } }{code} > EC:
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed decrement, xmitsSubmitted LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} > EC: Reconstruct task failed, and It would be Xm
[jira] [Comment Edited] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274126#comment-17274126 ] huhaiyang edited comment on HDFS-15798 at 1/29/21, 3:07 AM: Thanks for the reviews, [~sodonnell] {quote} If I understand this correctly, this problem can only occur is there are several tasks to process in the loop: 1. First pass around the loop, sets xmitsSubmitted = X, say 5. 2. This is used to increment the DN XmitsInProgress. 3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was never reset to zero, the DN XmitsInProgress is decremented by the previous value from the first pass (5 in this example). {quote} Just as you said, This problem can only occur is there are several tasks to process in the loop. As you suggested,Updated the patch. was (Author: haiyang hu): Thanks for the reviews, [~sodonnell] As you suggested,Updated the patch. {{{quote}}} If I understand this correctly, this problem can only occur is there are several tasks to process in the loop: 1. First pass around the loop, sets xmitsSubmitted = X, say 5. 2. This is used to increment the DN XmitsInProgress. 3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was never reset to zero, the DN XmitsInProgress is decremented by the previous value from the first pass (5 in this example). {{{quote}}} {{Just as you said. This problem can only occur is there are several tasks to process in the loop}} {{}}{{}} > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > complete decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17274126#comment-17274126 ] huhaiyang commented on HDFS-15798: -- Thanks for the reviews, [~sodonnell] As you suggested,Updated the patch. {{{quote}}} If I understand this correctly, this problem can only occur is there are several tasks to process in the loop: 1. First pass around the loop, sets xmitsSubmitted = X, say 5. 2. This is used to increment the DN XmitsInProgress. 3. Next pass around the loop, the exception is thrown. As xmitsSubmitted was never reset to zero, the DN XmitsInProgress is decremented by the previous value from the first pass (5 in this example). {{{quote}}} {{Just as you said. This problem can only occur is there are several tasks to process in the loop}} {{}}{{}} > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > complete decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Attachment: HDFS-15798.002.patch > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > complete decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal execution ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} > EC: Reconstruct task failed, and It would be Xmi
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal execution ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Summary: EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number (was: EC: Reconstruct task failed, and the XmitsInProgress operation will be performed twice) > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > complete decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the XmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Summary: EC: Reconstruct task failed, and the XmitsInProgress operation will be performed twice (was: EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice) > EC: Reconstruct task failed, and the XmitsInProgress operation will be > performed twice > -- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > complete decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed decrement ... } }{code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed decrement ... } }{code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key:
[jira] [Assigned] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang reassigned HDFS-15798: Assignee: huhaiyang > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > failed decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Attachment: HDFS-15798.001.patch > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > Attachments: HDFS-15798.001.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task > start increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task > failed decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); >... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task > failed decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // 1.task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task failed decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement ... } }{code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Is
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed decrement ... } }{code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type:
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2. task failed decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { StripedReconstructionInfo stripedReconInfo = new StripedReconstructionInfo( reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), reconInfo.getTargetStorageIDs()); // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } {code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Col
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { StripedReconstructionInfo stripedReconInfo = new StripedReconstructionInfo( reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), reconInfo.getTargetStorageIDs()); // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 1. task failed decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } {code} was: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { StripedReconstructionInfo stripedReconInfo = new StripedReconstructionInfo( reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), reconInfo.getTargetStorageIDs()); // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 1.decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } {code} > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC reconstruct task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > StripedReconstructionInfo stripedReconInfo = > new StripedReconstructionInfo( > reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), > reconInfo.getLiveBlockIndices(), reconInfo.get
[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Summary: EC: Reconstruct task failed, and the decrementXmitsInProgress operation will be performed twice (was: EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice) > EC: Reconstruct task failed, and the decrementXmitsInProgress operation will > be performed twice > --- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC refactoring task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > {code:java} > // ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > StripedReconstructionInfo stripedReconInfo = > new StripedReconstructionInfo( > reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), > reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), > reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), > reconInfo.getTargetStorageIDs()); > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if > 1.decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // ErasureCodingWorker.java public void processErasureCodingTasks( Collection ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { StripedReconstructionInfo stripedReconInfo = new StripedReconstructionInfo( reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), reconInfo.getTargetStorageIDs()); // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if 1.decrement LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } {code} was: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // code placeholder {code} > EC:Reconstruction task failed, and the decrementXmitsInProgress operation > will be performed twice > - > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC refactoring task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > {code:java} > // ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > StripedReconstructionInfo stripedReconInfo = > new StripedReconstructionInfo( > reconInfo.getExtendedBlock(), reconInfo.getErasureCodingPolicy(), > reconInfo.getLiveBlockIndices(), reconInfo.getSourceDnInfos(), > reconInfo.getTargetDnInfos(), reconInfo.getTargetStorageTypes(), > reconInfo.getTargetStorageIDs()); > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // if > 1.decrement > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number {code:java} // code placeholder {code} was: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number > EC:Reconstruction task failed, and the decrementXmitsInProgress operation > will be performed twice > - > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC refactoring task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number > {code:java} > // code placeholder > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice
huhaiyang created HDFS-15798: Summary: EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice Key: HDFS-15798 URL: https://issues.apache.org/jira/browse/HDFS-15798 Project: Hadoop HDFS Issue Type: Bug Reporter: huhaiyang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15798) EC:Reconstruction task failed, and the decrementXmitsInProgress operation will be performed twice
[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15798: - Description: The EC refactoring task failed, and the decrementXmitsInProgress operation will be performed twice It would be XmitsInProgress of DN has negative number > EC:Reconstruction task failed, and the decrementXmitsInProgress operation > will be performed twice > - > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: huhaiyang >Priority: Major > > The EC refactoring task failed, and the decrementXmitsInProgress operation > will be performed twice > It would be XmitsInProgress of DN has negative number -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12861) Track speed in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17249676#comment-17249676 ] huhaiyang commented on HDFS-12861: -- {code:java} // code placeholder diff --git a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java index be822d664f8..ea216bc04e3 100644 --- a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java +++ b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/datatransfer/PipelineAck.java @@ -165,6 +165,19 @@ public long getDownstreamAckTimeNanos() { return proto.getDownstreamAckTimeNanos(); } + /** + * Get packet processing time of datanode at the given index in the pipeline. + * @param i - datanode index in the pipeline + */ + public long getPacketProcessingTime(int i) { +if (proto.getPacketProcessingTimeNanosCount() > i) { + return proto.getPacketProcessingTimeNanos(i); +} else { + // Return -1 if datanode at this index didn't send this info + return -1; +} + } + /** * Check if this ack contains error status * @return true if all statuses are SUCCESS diff --git a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto index 2356201f04d..dfededb7619 100644 --- a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto +++ b/hadoop-hdfs-project/hadoop-hdfs-client/src/main/proto/datatransfer.proto @@ -260,6 +260,7 @@ message PipelineAckProto { repeated Status reply = 2; optional uint64 downstreamAckTimeNanos = 3 [default = 0]; repeated uint32 flag = 4 [packed=true]; + repeated uint64 packetProcessingTimeNanos = 100; } {code} hi [~elgoiri] Consult, I found that there is no set packetProcessingTimeNanos value method found in the current patch? Looking forward to your reply, Thanks! > Track speed in DFSClient > > > Key: HDFS-12861 > URL: https://issues.apache.org/jira/browse/HDFS-12861 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: María Fernanda Borge >Priority: Major > Attachments: HDFS-12861-10-april-18.patch > > > Sometimes we get slow jobs because of the access to HDFS. However, is hard to > tell what is the actual speed. We propose to add a log line with something > like: > {code} > 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 > READ 129500B in 7ms 17.6MB/s > 2017-11-27 19:01:04,141 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s > 2017-11-27 19:01:14,219 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s > 2017-11-27 19:01:24,282 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:34,330 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:44,408 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15697: - Description: Enhance FastCopy to support EC file . (was: Enhance FastCopy to support EC file ) > Fast copy support EC for HDFS. > -- > > Key: HDFS-15697 > URL: https://issues.apache.org/jira/browse/HDFS-15697 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > > Enhance FastCopy to support EC file . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15697: - External issue ID: (was: https://issues.apache.org/jira/browse/HDFS-2139) > Fast copy support EC for HDFS. > -- > > Key: HDFS-15697 > URL: https://issues.apache.org/jira/browse/HDFS-15697 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > > Enhance FastCopy to support EC file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15697: - External issue ID: https://issues.apache.org/jira/browse/HDFS-2139 > Fast copy support EC for HDFS. > -- > > Key: HDFS-15697 > URL: https://issues.apache.org/jira/browse/HDFS-15697 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > > Enhance FastCopy to support EC file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15697) Fast copy support EC for HDFS.
[ https://issues.apache.org/jira/browse/HDFS-15697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15697: - Description: Enhance FastCopy to support EC file > Fast copy support EC for HDFS. > -- > > Key: HDFS-15697 > URL: https://issues.apache.org/jira/browse/HDFS-15697 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: huhaiyang >Assignee: huhaiyang >Priority: Major > > Enhance FastCopy to support EC file -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15697) Fast copy support EC for HDFS.
huhaiyang created HDFS-15697: Summary: Fast copy support EC for HDFS. Key: HDFS-15697 URL: https://issues.apache.org/jira/browse/HDFS-15697 Project: Hadoop HDFS Issue Type: New Feature Reporter: huhaiyang Assignee: huhaiyang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-12861) Track speed in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203054#comment-17203054 ] huhaiyang commented on HDFS-12861: -- [~elgoiri]It looks like very good work And are there plans to merge into the trunk? Thanks. > Track speed in DFSClient > > > Key: HDFS-12861 > URL: https://issues.apache.org/jira/browse/HDFS-12861 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: María Fernanda Borge >Priority: Major > Attachments: HDFS-12861-10-april-18.patch > > > Sometimes we get slow jobs because of the access to HDFS. However, is hard to > tell what is the actual speed. We propose to add a log line with something > like: > {code} > 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 > READ 129500B in 7ms 17.6MB/s > 2017-11-27 19:01:04,141 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s > 2017-11-27 19:01:14,219 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s > 2017-11-27 19:01:24,282 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:34,330 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:44,408 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-12861) Track speed in DFSClient
[ https://issues.apache.org/jira/browse/HDFS-12861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17203054#comment-17203054 ] huhaiyang edited comment on HDFS-12861 at 9/28/20, 7:29 AM: [~elgoiri] It looks like very good work And are there plans to merge into the trunk? Thanks. was (Author: haiyang hu): [~elgoiri]It looks like very good work And are there plans to merge into the trunk? Thanks. > Track speed in DFSClient > > > Key: HDFS-12861 > URL: https://issues.apache.org/jira/browse/HDFS-12861 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Íñigo Goiri >Assignee: María Fernanda Borge >Priority: Major > Attachments: HDFS-12861-10-april-18.patch > > > Sometimes we get slow jobs because of the access to HDFS. However, is hard to > tell what is the actual speed. We propose to add a log line with something > like: > {code} > 2017-11-19 09:55:26,309 INFO [main] hdfs.DFSClient: blk_1107222019_38144502 > READ 129500B in 7ms 17.6MB/s > 2017-11-27 19:01:04,141 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792057_86833357 WRITE 131072B in 10ms 12.5MB/s > 2017-11-27 19:01:14,219 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792069_86833369 WRITE 131072B in 12ms 10.4MB/s > 2017-11-27 19:01:24,282 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792081_86833381 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:34,330 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792093_86833393 WRITE 131072B in 11ms 11.4MB/s > 2017-11-27 19:01:44,408 INFO [DataStreamer for file > /hdfs-federation/stats/2017/11/27/151183800.json] hdfs.DFSClient: > blk_1135792105_86833405 WRITE 131072B in 11ms 11.4MB/s > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190601#comment-17190601 ] huhaiyang edited comment on HDFS-15556 at 9/4/20, 7:28 AM: --- The current issue is the same as [HDFS-14042| https://issues.apache.org/jira/browse/HDFS-14042]. was (Author: haiyang hu): The current issue is the same as[HDFS-14042| https://issues.apache.org/jira/browse/HDFS-14042]. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190601#comment-17190601 ] huhaiyang commented on HDFS-15556: -- The current issue is the same as[HDFS-14042| https://issues.apache.org/jira/browse/HDFS-14042]. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN_DN.LOG > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN_DN.LOG) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:48 PM: [~hexiaoqiao] Thanks for your comments. {quote} Great catch here. v001 is fair for me, it will be better if add new unit test to cover. {quote} I'll add to it later unit test {quote} I am interested that why storage is null here. Anywhere not synchronized storageMap where should do that? {quote} the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} detailed execution log [^NN_DN.LOG] Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processin
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:43 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: HeartbeatManager#updateLifeline {code:java} synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} BlockPlacementPolicyDefault#excludeNodeByLoad {code:java} boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} {code:java} BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:42 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed,int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } {code} {code:java} BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed, int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cl
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:41 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} 4. detailed execution log [^NN_DN.LOG] 5.Source code is: {code:java} HeartbeatManager#updateLifeline synchronized void updateLifeline(final DatanodeDescriptor node,StorageReport[] reports, long cacheCapacity, long cacheUsed, int xceiverCount, int failedVolumes, VolumeFailureSummary volumeFailureSummary) { stats.subtract(node); //Every time DN heartbeat report,nodesInServiceXceiverCount will be minus the XceiverCount of the DN of the current ... node.updateHeartbeatState(reports, cacheCapacity, cacheUsed, xceiverCount, failedVolumes, volumeFailureSummary); //NPE exception occurred here throws stats.add(node); //Here logic is never executed } BlockPlacementPolicyDefault#excludeNodeByLoad boolean excludeNodeByLoad(DatanodeDescriptor node){ final double maxLoad = considerLoadFactor * stats.getInServiceXceiverAverage(); //stats.getInServiceXceiverAverage()= heartbeatManager.getInServiceXceiverCount()/getNumDatanodesInService() //the final maxLoad value will be affected final int nodeLoad = node.getXceiverCount(); if ((nodeLoad > maxLoad) && (maxLoad > 0)) { logNodeIsNotChosen(node, NodeNotChosenReason.NODE_TOO_BUSY, "(load: " + nodeLoad + " > " + maxLoad + ")"); return true; } return false; } {code} was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log [^NN_DN.LOG] > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.cal
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:39 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log [^NN_DN.LOG] was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code}
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:38 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} //execution log was (Author: haiyang hu): 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subje
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN_DN.LOG > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN_DN.LOG > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run->offerService->processCommand->reRegister->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 12:36 PM: 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$P
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17190108#comment-17190108 ] huhaiyang commented on HDFS-15556: -- 3.the cause of occurred the problem is: {quote} 1.One DataNode reports heartbeat to NN timed out, The DNA_REGISTER will be occurred when the service is restored: #BPServiceActor#run-->offerService-->processCommand-->reRegister-->sendHeartBeat 2.NN run registerDatanode will DatanodeDescriptor#pruneStorageMap (remove storageMap) for the registered DN 3.DN reRegister it took about a minute, after the heartbeat exceeds 9 seconds, the Lifeline reports to NN, But at this point, the storageMap is null of the DN is recorded at the NN occurred NPE {quote} {code:java} //execution log //NameNode LOG: #registered DN: 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* registerDatanode: from DatanodeRegistration(xxx:50010,xxx) storage xxx 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Removing a node: xxx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.net.NetworkTopology: Adding a new node: xx:50010 2020-08-25 00:58:53,977 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: [DISK]:NORMAL:xxx:50010 failed. 2020-08-25 00:58:53,978 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Removed storage [DISK]xxx:FAILED:xxx:50010 from DataNode xxx:50010 ... #sendLifeline NPE: from 2020-08-25 00:59:02,977 to 2020-08-25 00:59:45,668, It keeps occurred the NPE 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... 2020-08-25 00:59:45,668 WARN org.apache.hadoop.ipc.Server: IPC Server handler 1 on 8022, call Call#67833 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from DN:34766 java.lang.NullPointerException ... #DN sendHeartBeat the NN will add storageMap: 2020-08-25 00:59:46,632 INFO org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor: Adding new storage ID xxx for DN xxx:50010 DN LOG: #DN run DNA_REGISTER 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeCommand action : DNA_REGISTER from NN:8021 with active state 2020-08-25 00:58:53,975 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-xxx (Datanode Uuid xxx) service to NN:8021 beginning handshake with NN 2020-08-25 00:59:02,976 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: IOException in LifelineSender for Block pool XXX service to NN:8021 org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1511) at org.apache.hadoop.ipc.Client.call(Client.java:1457) at org.apache.hadoop.ipc.Client.call(Client.java:1367) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) at com.sun.proxy.$Proxy21.sendLifeline(Unknown Source) at
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:30 AM: --- 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913) {code} 2.there are a large number of logs, and in extreme cases, all DN nodes of the cluster are not satisfied with the allocation {code:java} 2020-08-25 01:38:50,370 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=xxx} 2020-08-25 01:38:50,370 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Failed to place enough replicas, still in need of 3 to reach 3 (unavailableStoragrom storage xxx node DatanodeRegistration(:50010, datanodeUuid=xxx, infoPort=50075, infoSecurePor t=0, ipcPort=50020, storageInfo=lv=-57;cid=xxx;nsid=;c=0), blocks: 2266, hasStaleStorage: false, processing time: 7 msecs, invalidatedBlocks: 0 2020-08-25 01:38:50,370 INFO org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicy: Not enough replicas was chosen. Reason:{NODE_TOO_BUSY=xxx} {code} was (Author: haiyang hu): 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementP
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:25 AM: --- 1. CPU NameNode high, thread stack is {code:java} "IPC Server handler 59 on 8020" #244 daemon prio=5 os_prio=0 tid=0x7f18b0ff7800 nid=0x1c006 runnable [0x7f185cbfc000] java.lang.Thread.State: RUNNABLE at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1282) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.hadoop.net.NetworkTopology.getNode(NetworkTopology.java:263) at org.apache.hadoop.net.NetworkTopology.countNumOfAvailableNodes(NetworkTopology.java:678) at org.apache.hadoop.net.NetworkTopology.chooseRandom(NetworkTopology.java:533) at org.apache.hadoop.hdfs.net.DFSNetworkTopology.chooseRandomWithStorageTypeTwoTrial(DFSNetworkTopology.java:122) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseDataNode(BlockPlacementPolicyDefault.java:903) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:800) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:768) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseFromNextRack(BlockPlacementPolicyDefault.java:719) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseLocalRack(BlockPlacementPolicyDefault.java:687) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:534) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:440) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:310) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:149) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:174) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:2239) at org.apache.hadoop.hdfs.server.namenode.FSDirWriteFileOp.chooseTargetForNewBlock(FSDirWriteFileOp.java:294) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2828) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:913) {code} 2. was (Author: haiyang hu): # CPU NameNode high, thread stack is # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.D
[jira] [Comment Edited] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991 ] huhaiyang edited comment on HDFS-15556 at 9/3/20, 9:24 AM: --- # CPU NameNode high, thread stack is # was (Author: haiyang hu): # CPU NameNode high, thread stack is !NN-jstack.png! # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN-jstack.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17189991#comment-17189991 ] huhaiyang commented on HDFS-15556: -- # CPU NameNode high, thread stack is !NN-jstack.png! # > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-jstack.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: screenshot-1.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: screenshot-1.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png, NN-jstack.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-CPU.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch, NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: HDFS-15556.001.patch > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: (was: NN-CPU.png) > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: HDFS-15556.001.patch > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Attachment: NN-CPU.png > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > Attachments: NN-CPU.png > > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > because DataNode is identified as busy and unable to allocate available nodes > in choose DataNode, program loop execution results in high CPU and reduces > the processing performance of the cluster. > *NameNode the exception stack*: > {code:java} > 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 5 on 8022, call Call#20535 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from x:34766 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) > at > org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) > {code} > {code:java} > // DatanodeDescriptor#updateStorageStats > ... > for (StorageReport report : reports) { > DatanodeStorageInfo storage = null; > synchronized (storageMap) { > storage = > storageMap.get(report.getStorage().getStorageID()); > } > if (checkFailedStorages) { > failedStorageInfos.remove(storage); > } > storage.receivedHeartbeat(report); // NPE exception occurred here > // skip accounting for capacity of PROVIDED storages! > if (StorageType.PROVIDED.equals(storage.getStorageType())) { > continue; > } > ... > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.r
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. because DataNode is identified as busy and unable to allocate available nodes in choose DataNode, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.r
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // NPE exception occurred here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.r
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // an NPE exception is occur here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); // an NPE exception is raised here // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCal
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); /{color:red}/an NPE exception is raised here{color} // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} {code:java} // DatanodeDescriptor#updateStorageStats ... for (StorageReport report : reports) { DatanodeStorageInfo storage = null; synchronized (storageMap) { storage = storageMap.get(report.getStorage().getStorageID()); } if (checkFailedStorages) { failedStorageInfos.remove(storage); } storage.receivedHeartbeat(report); //{color:red}an NPE exception is raised here{color} // skip accounting for capacity of PROVIDED storages! if (StorageType.PROVIDED.equals(storage.getStorageType())) { continue; } ... {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java: 460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.jav a:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 >
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-08-25 00:59:02,977 WARN org.apache.hadoop.ipc.Server: IPC Server handler 5 on 8022, call Call#20535 Retry#0 org.ap ache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from x:34766 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java: 460) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.jav a:390) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1805) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 >
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} > Fix NPE in DatanodeDescriptor#updateStorageStats
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack: {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.ap
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack*: {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. *NameNode the exception stack: {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. {code:java} 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. {code:java} NameNode the exception stack: 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.ap
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. {code:java} NameNode the exception stack: 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler 0 on 8022, call Call#68269 Retry#0 org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline from xxx:47138 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) at org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) at org.apache.hadoop.hdfs.protocolPB.DatanodeLifelineProtocolServerSideTranslatorPB.sendLifeline(DatanodeLifelineProtocolServerSideTranslatorPB.java:62) at org.apache.hadoop.hdfs.protocol.proto.DatanodeLifelineProtocolProtos$DatanodeLifelineProtocolService$2.callBlockingMethod(DatanodeLifelineProtocolProtos.java:409) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886) at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2717) {code} was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > In choose DataNode because DataNode is identified as busy and unable to > allocate available nodes, program loop execution results in high CPU and > reduces the processing performance of the cluster. > {code:java} > NameNode the exception stack: > 2020-09-02 11:01:57,043 DEBUG org.apache.hadoop.ipc.Server: Served: > sendLifeline, queueTime= 2 procesingTime= 0 exception= NullPointerException > 2020-09-02 11:01:57,044 WARN org.apache.hadoop.ipc.Server: IPC Server handler > 0 on 8022, call Call#68269 Retry#0 > org.apache.hadoop.hdfs.server.protocol.DatanodeLifelineProtocol.sendLifeline > from xxx:47138 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateStorageStats(DatanodeDescriptor.java:475) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeDescriptor.updateHeartbeatState(DatanodeDescriptor.java:391) > at > org.apache.hadoop.hdfs.server.blockmanagement.HeartbeatManager.updateLifeline(HeartbeatManager.java:254) > at > org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.handleLifeline(DatanodeManager.java:1825) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.handleLifeline(FSNamesystem.java:4039) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.sendLifeline(NameNodeRpcServer.java:1761) > at
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > In choose DataNode because DataNode is identified as busy and unable to > allocate available nodes, program loop execution results in high CPU and > reduces the processing performance of the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. was: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > In choose DataNode because DataNode is identified as busy and unable to > allocate available nodes, program loop execution results in high CPU and > reduces the processing performance of the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
[ https://issues.apache.org/jira/browse/HDFS-15556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] huhaiyang updated HDFS-15556: - Description: In our cluster, the NameNode appears NPE when processing lifeline messages sent by the DataNode, which will cause an maxLoad exception calculated by NN. In choose DataNode because DataNode is identified as busy and unable to allocate available nodes, program loop execution results in high CPU and reduces the processing performance of the cluster. > Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline > > > Key: HDFS-15556 > URL: https://issues.apache.org/jira/browse/HDFS-15556 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In our cluster, the NameNode appears NPE when processing lifeline messages > sent by the DataNode, which will cause an maxLoad exception calculated by NN. > In choose DataNode because DataNode is identified as busy and unable to > allocate available nodes, program loop execution results in high CPU and > reduces the processing performance of the cluster. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15556) Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline
huhaiyang created HDFS-15556: Summary: Fix NPE in DatanodeDescriptor#updateStorageStats when handle DN Lifeline Key: HDFS-15556 URL: https://issues.apache.org/jira/browse/HDFS-15556 Project: Hadoop HDFS Issue Type: Bug Components: namenode Affects Versions: 3.2.0 Reporter: huhaiyang -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135712#comment-17135712 ] huhaiyang commented on HDFS-15391: -- Thanks [~hexiaoqiao] To help solve. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17135698#comment-17135698 ] huhaiyang commented on HDFS-15391: -- [~liuml07] Thank you for reply! The current issue is the same as [HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] and [HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] submitted patch and ready for repair. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15175) Multiple CloseOp shared block instance causes the standby namenode to crash when rolling editlog
[ https://issues.apache.org/jira/browse/HDFS-15175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17134129#comment-17134129 ] huhaiyang commented on HDFS-15175: -- hi [~wanchang] Thank you for reply. [HDFS-15175|https://issues.apache.org/jira/browse/HDFS-15175] I described the information. Our current code does compatibility handling and skips exception op Let me look at your patch, Thank you again for ! > Multiple CloseOp shared block instance causes the standby namenode to crash > when rolling editlog > > > Key: HDFS-15175 > URL: https://issues.apache.org/jira/browse/HDFS-15175 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.2 >Reporter: Yicong Cai >Assignee: Yicong Cai >Priority: Critical > > > {panel:title=Crash exception} > 2020-02-16 09:24:46,426 [507844305] - ERROR [Edit log > tailer:FSEditLogLoader@245] - Encountered exception on operation CloseOp > [length=0, inodeId=0, path=..., replication=3, mtime=1581816138774, > atime=1581814760398, blockSize=536870912, blocks=[blk_5568434562_4495417845], > permissions=da_music:hdfs:rw-r-, aclEntries=null, clientName=, > clientMachine=, overwrite=false, storagePolicyId=0, opCode=OP_CLOSE, > txid=32625024993] > java.io.IOException: File is not under construction: .. > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:442) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:237) > at > org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:146) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:891) > at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:872) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer.doTailEdits(EditLogTailer.java:262) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.doWork(EditLogTailer.java:395) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.access$300(EditLogTailer.java:348) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread$1.run(EditLogTailer.java:365) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:360) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1873) > at > org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:479) > at > org.apache.hadoop.hdfs.server.namenode.ha.EditLogTailer$EditLogTailerThread.run(EditLogTailer.java:361) > {panel} > > {panel:title=Editlog} > > OP_REASSIGN_LEASE > > 32625021150 > DFSClient_NONMAPREDUCE_-969060727_197760 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > > > .. > > OP_CLOSE > > 32625023743 > 0 > 0 > .. > 3 > 1581816135883 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > .. > > OP_TRUNCATE > > 32625024049 > .. > DFSClient_NONMAPREDUCE_1000868229_201260 > .. > 185818644 > 1581816136336 > > 5568434562 > 185818648 > 4495417845 > > > > .. > > OP_CLOSE > > 32625024993 > 0 > 0 > .. > 3 > 1581816138774 > 1581814760398 > 536870912 > > > false > > 5568434562 > 185818644 > 4495417845 > > > da_music > hdfs > 416 > > > > {panel} > > > The block size should be 185818648 in the first CloseOp. When truncate is > used, the block size becomes 185818644. The CloseOp/TruncateOp/CloseOp is > synchronized to the JournalNode in the same batch. The block used by CloseOp > twice is the same instance, which causes the first CloseOp has wrong block > size. When SNN rolling Editlog, TruncateOp does not make the file to the > UnderConstruction state. Then, when the second CloseOp is executed, the file > is not in the UnderConstruction state, and SNN crashes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910 ] huhaiyang edited comment on HDFS-15391 at 6/12/20, 6:14 AM: [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote} The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} In the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354154184. And in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. was (Author: haiyang hu): [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote} The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} In the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. And in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245 ] huhaiyang edited comment on HDFS-15391 at 6/12/20, 6:13 AM: Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060311717 OP_SET_GENSTAMP_V2 10354071495 6. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 7. TXID=126060921400 OP_SET_GENSTAMP_V2 10354154184 8. TXID=126060921401 OP_REASSIGN_LEASE 9. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 10.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 11. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 12. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. was (Author: haiyang hu): Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060311717 OP_SET_GENSTAMP_V2 10354071495 6. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 7. TXID=126060921401 OP_REASSIGN_LEASE 8. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 9.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 10. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 11. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 >
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910 ] huhaiyang edited comment on HDFS-15391 at 6/12/20, 4:23 AM: [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote} The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} in the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. and in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. was (Author: haiyang hu): [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote}The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} in the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. and in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910 ] huhaiyang edited comment on HDFS-15391 at 6/12/20, 4:24 AM: [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote} The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} In the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. And in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. was (Author: haiyang hu): [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote} The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} in the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. and in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133910#comment-17133910 ] huhaiyang commented on HDFS-15391: -- [~ayushtkn] Thank you for reply! Will try to reproduce. However, the problem has not been repeated in the test environment。 I follow up and see if I can reproduce it? {quote} {quote}The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. {quote} didn't quite understood this. {quote} in the first CloseOp(TXID=126060942290) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480, but in fact in the first CloseOp block_11382080753 block size should be 108764672 and GENSTAMP should be 10354071495. and in the second CloseOp(TXID= 126060943585) block_11382080753 block size is 63154347 and GENSTAMP is 10354157480. The block block_11382080753 used by CloseOp twice is the same instance, the first CloseOp has wrong block information. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133865#comment-17133865 ] huhaiyang commented on HDFS-15391: -- Hi [~hexiaoqiao] Thank you for reply! {quote} Do you enable AsyncEditlog feature? I think it could be related to different operations process the same blocks which not sync/return back to client. IIRC, we try to fix it using deep copy as HDFS-15175 mentioned in my internal branch. {quote} Yes, we enable AsyncEditlog feature, I also think it may be related to this feature. the current scenario multiple times to the same file append and truncate operations. Ok, let's also try to fix it using deep copy … > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >Reporter: huhaiyang >Priority: Critical > > In the cluster version 3.2.0 production environment, > We found that due to edit log corruption, Standby NameNode could not > properly load the Ediltog log, result in abnormal exit of the service and > failure to restart > {noformat} > The specific scenario is that Flink writes to HDFS(replication file), and in > the case of an exception to the write file, the following operations are > performed : > 1.close file > 2.open file > 3.truncate file > 4.append file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245 ] huhaiyang edited comment on HDFS-15391 at 6/11/20, 1:28 PM: Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060311717 OP_SET_GENSTAMP_V2 10354071495 6. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 7. TXID=126060921401 OP_REASSIGN_LEASE 8. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 9.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 10. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 11. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. was (Author: haiyang hu): Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 6. TXID=126060921401 OP_REASSIGN_LEASE 7. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 8.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 9. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 10. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS >
[jira] [Comment Edited] (HDFS-15391) Standby NameNode due loads the corruption edit log, the service exits and cannot be restarted
[ https://issues.apache.org/jira/browse/HDFS-15391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17133245#comment-17133245 ] huhaiyang edited comment on HDFS-15391 at 6/11/20, 1:13 PM: Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 6. TXID=126060921401 OP_REASSIGN_LEASE 7. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 8.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 9. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 10. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp(TXID=126060942290). When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp(TXID=126060943585) is executed, the file is not in the UnderConstruction state, and SNN down. was (Author: haiyang hu): Hi [~ayushtkn] could you please take a look at this issue? {quote} 2020-06-04 18:32:11,561 ERROR org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader: Encountered exception on operation CloseOp [length=0, inodeId=0, path=path, replication=3, mtime=1591266620287, atime=1591264800229, blockSize=134217728, blocks=[blk_11382006007_10353346830, blk_11382023760_10353365201, blk_11382041307_10353383098, blk_11382049845_10353392031, blk_11382057341_10353399899, blk_11382071544_10353415171, blk_11382080753_10354157480], permissions=dw_water:rd:rw-r--r--, aclEntries=null, clientName=, clientMachine=, overwrite=false, storagePolicyId=0, erasureCodingPolicyId=0, opCode=OP_CLOSE, txid=126060943585] java.io.IOException: File is not under construction: hdfs://path {quote} Related edit log transactions {noformat} 1. TXID=126060182153 OP_TRUNCATE time=1591266465492(2020-06-04 18:27:45) NEWLENGTH=868460715 blocks: ... 1138208075310336493410354049310 2. TXID=126060182170 OP_REASSIGN_LEASE 3. TXID=126060308267 OP_CLOSE 1591266492080 2020-06-04 18:28:12 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354049316 4. TXID=126060311503 OP_APPEND 5. TXID=126060313001 OP_UPDATE_BLOCKS blocks: ...113820807536315434710354071495 6. TXID=126060921401 OP_REASSIGN_LEASE 7. TXID=126060942290 OP_CLOSE 1591266619003 2020-06-04 18:30:19 1591264800229 2020-06-04 18:00:00 blocks: ...113820807536315434710354157480 8.TXID=126060942548 OP_SET_GENSTAMP_V2 10354157480 9. TXID=126060942549 OP_TRUNCATE 868460715 1591266619207 2020-06-04 18:30:19 blocks: ...1138208075310876467210354157480 10. TXID=126060943585 OP_CLOSE 15912666202872020-06-04 18:30:20 15912648002292020-06-04 18:00:00 blocks: ...113820807536315434710354157480 {noformat} The block size should be 108764672 in the first CloseOp. When truncate is used, the block size is 63154347. The block used by CloseOp twice is the same instance, which causes the first CloseOp has wrong block size. When the second CloseOp is executed, the file is not in the UnderConstruction state, and SNN down. > Standby NameNode due loads the corruption edit log, the service exits and > cannot be restarted > - > > Key: HDFS-15391 > URL: https://issues.apache.org/jira/browse/HDFS-15391 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.2.0 >