[ https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shilun Fan updated HDFS-15798: ------------------------------ Affects Version/s: 3.3.1 3.4.0 > EC: Reconstruct task failed, and It would be XmitsInProgress of DN has > negative number > -------------------------------------------------------------------------------------- > > Key: HDFS-15798 > URL: https://issues.apache.org/jira/browse/HDFS-15798 > Project: Hadoop HDFS > Issue Type: Bug > Components: erasure-coding > Affects Versions: 3.3.1, 3.4.0 > Reporter: Haiyang Hu > Assignee: Haiyang Hu > Priority: Major > Fix For: 3.3.1, 3.4.0, 3.2.3 > > Attachments: HDFS-15798.001.patch, HDFS-15798.002.patch, > HDFS-15798.003.patch > > > The EC reconstruct task failed, and the decrementXmitsInProgress of > processErasureCodingTasks operation abnormal value ; > It would be XmitsInProgress of DN has negative number, it affects NN chooses > pending tasks based on the ratio between the lengths of replication and > erasure-coded block queues. > {code:java} > // 1.ErasureCodingWorker.java > public void processErasureCodingTasks( > Collection<BlockECReconstructionInfo> ecTasks) { > for (BlockECReconstructionInfo reconInfo : ecTasks) { > int xmitsSubmitted = 0; > try { > ... > // It may throw IllegalArgumentException from task#stripedReader > // constructor. > final StripedBlockReconstructor task = > new StripedBlockReconstructor(this, stripedReconInfo); > if (task.hasValidTargets()) { > // See HDFS-12044. We increase xmitsInProgress even the task is only > // enqueued, so that > // 1) NN will not send more tasks than what DN can execute and > // 2) DN will not throw away reconstruction tasks, and instead keeps > // an unbounded number of tasks in the executor's task queue. > xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1); > getDatanode().incrementXmitsInProcess(xmitsSubmitted); // task start > increment > stripedReconstructionPool.submit(task); > } else { > LOG.warn("No missing internal block. Skip reconstruction for task:{}", > reconInfo); > } > } catch (Throwable e) { > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task failed > decrement, XmitsInProgress is decremented by the previous value > LOG.warn("Failed to reconstruct striped block {}", > reconInfo.getExtendedBlock().getLocalBlock(), e); > } > } > } > // 2.StripedBlockReconstructor.java > public void run() { > try { > initDecoderIfNecessary(); > ... > } catch (Throwable e) { > LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e); > getDatanode().getMetrics().incrECFailedReconstructionTasks(); > } finally { > float xmitWeight = getErasureCodingWorker().getXmitWeight(); > // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1 > // because if it set to zero, we cannot to measure the xmits submitted > int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1); > getDatanode().decrementXmitsInProgress(xmitsSubmitted); // task complete > decrement > ... > } > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org