[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
huhaiyang updated HDFS-15798: ----------------------------- Description: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection<BlockECReconstructionInfo> ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed decrement, XmitsInProgress is decremented by the previous value LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement ... } }{code} was: The EC reconstruct task failed, and the decrementXmitsInProgress of processErasureCodingTasks operation abnormal value ; It would be XmitsInProgress of DN has negative number, it affects NN chooses pending tasks based on the ratio between the lengths of replication and erasure-coded block queues. {code:java} // 1.ErasureCodingWorker.java public void processErasureCodingTasks( Collection<BlockECReconstructionInfo> ecTasks) { for (BlockECReconstructionInfo reconInfo : ecTasks) { int xmitsSubmitted = 0; try { ... // It may throw IllegalArgumentException from task#stripedReader // constructor. final StripedBlockReconstructor task = new StripedBlockReconstructor(this, stripedReconInfo); if (task.hasValidTargets()) { // See HDFS-12044. We increase xmitsInProgress even the task is only // enqueued, so that // 1) NN will not send more tasks than what DN can execute and // 2) DN will not throw away reconstruction tasks, and instead keeps // an unbounded number of tasks in the executor's task queue. xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start increment stripedReconstructionPool.submit(task); } else { LOG.warn("No missing internal block. Skip reconstruction for task:{}", reconInfo); } } catch (Throwable e) { getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed decrement, xmitsSubmitted LOG.warn("Failed to reconstruct striped block {}", reconInfo.getExtendedBlock().getLocalBlock(), e); } } } // 2.StripedBlockReconstructor.java public void run() { try { initDecoderIfNecessary(); ... } catch (Throwable e) { LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); getDatanode().getMetrics().incrECFailedReconstructionTasks(); } finally { float xmitWeight = getErasureCodingWorker().getXmitWeight(); // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 // because if it set to zero, we cannot to measure the xmits submitted int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete decrement ... } }{code} > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -------------------------------------------------------------------------------------- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug > Reporter: huhaiyang > Assignee: huhaiyang > Priority: Major > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection<BlockECReconstructionInfo> ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); > ... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org