[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

huhaiyang (Jira) Thu, 28 Jan 2021 18:56:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-15798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


huhaiyang updated HDFS-15798:
-----------------------------
    Description: 
The EC reconstruct task failed, and the decrementXmitsInProgress of 
processErasureCodingTasks operation abnormal execution ;
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
    Collection<BlockECReconstructionInfo> ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
    int xmitsSubmitted = 0;
    try {
      ...
      // It may throw IllegalArgumentException from task#stripedReader
      // constructor.
      final StripedBlockReconstructor task =
          new StripedBlockReconstructor(this, stripedReconInfo);
      if (task.hasValidTargets()) {
        // See HDFS-12044. We increase xmitsInProgress even the task is only
        // enqueued, so that
        //   1) NN will not send more tasks than what DN can execute and
        //   2) DN will not throw away reconstruction tasks, and instead keeps
        //      an unbounded number of tasks in the executor's task queue.
        xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
        getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
        stripedReconstructionPool.submit(task);
      } else {
        LOG.warn("No missing internal block. Skip reconstruction for task:{}",
            reconInfo);
      }
    } catch (Throwable e) {
      getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
      LOG.warn("Failed to reconstruct striped block {}",
          reconInfo.getExtendedBlock().getLocalBlock(), e);
    }
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
    initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
    LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
    getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
    float xmitWeight = getErasureCodingWorker().getXmitWeight();
    // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
    // because if it set to zero, we cannot to measure the xmits submitted
    int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
    getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
    ...
  }
}{code}

  was:
The EC reconstruct task failed, and the decrementXmitsInProgress operation will 
be performed twice
 It would be XmitsInProgress of DN has negative number, it affects NN chooses 
pending tasks based on the ratio between the lengths of replication and 
erasure-coded block queues.
{code:java}
// 1.ErasureCodingWorker.java

public void processErasureCodingTasks(
    Collection<BlockECReconstructionInfo> ecTasks) {
  for (BlockECReconstructionInfo reconInfo : ecTasks) {
    int xmitsSubmitted = 0;
    try {
      ...
      // It may throw IllegalArgumentException from task#stripedReader
      // constructor.
      final StripedBlockReconstructor task =
          new StripedBlockReconstructor(this, stripedReconInfo);
      if (task.hasValidTargets()) {
        // See HDFS-12044. We increase xmitsInProgress even the task is only
        // enqueued, so that
        //   1) NN will not send more tasks than what DN can execute and
        //   2) DN will not throw away reconstruction tasks, and instead keeps
        //      an unbounded number of tasks in the executor's task queue.
        xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
        getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task start 
increment
        stripedReconstructionPool.submit(task);
      } else {
        LOG.warn("No missing internal block. Skip reconstruction for task:{}",
            reconInfo);
      }
    } catch (Throwable e) {
      getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
failed decrement
      LOG.warn("Failed to reconstruct striped block {}",
          reconInfo.getExtendedBlock().getLocalBlock(), e);
    }
  }
}


// 2.StripedBlockReconstructor.java
public void run() {
  try {
    initDecoderIfNecessary();
   ...
  } catch (Throwable e) {
    LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
    getDatanode().getMetrics().incrECFailedReconstructionTasks();
  } finally {
    float xmitWeight = getErasureCodingWorker().getXmitWeight();
    // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
    // because if it set to zero, we cannot to measure the xmits submitted
    int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
    getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
complete decrement
    ...
  }
}{code}


> EC: Reconstruct task failed, and It would be XmitsInProgress of DN has 
> negative number
> --------------------------------------------------------------------------------------
>
>                 Key: HDFS-15798
>                 URL: https://issues.apache.org/jira/browse/HDFS-15798
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: huhaiyang
>            Assignee: huhaiyang
>            Priority: Major
>         Attachments: HDFS-15798.001.patch
>
>
> The EC reconstruct task failed, and the decrementXmitsInProgress of 
> processErasureCodingTasks operation abnormal execution ;
>  It would be XmitsInProgress of DN has negative number, it affects NN chooses 
> pending tasks based on the ratio between the lengths of replication and 
> erasure-coded block queues.
> {code:java}
> // 1.ErasureCodingWorker.java
> public void processErasureCodingTasks(
>     Collection<BlockECReconstructionInfo> ecTasks) {
>   for (BlockECReconstructionInfo reconInfo : ecTasks) {
>     int xmitsSubmitted = 0;
>     try {
>       ...
>       // It may throw IllegalArgumentException from task#stripedReader
>       // constructor.
>       final StripedBlockReconstructor task =
>           new StripedBlockReconstructor(this, stripedReconInfo);
>       if (task.hasValidTargets()) {
>         // See HDFS-12044. We increase xmitsInProgress even the task is only
>         // enqueued, so that
>         //   1) NN will not send more tasks than what DN can execute and
>         //   2) DN will not throw away reconstruction tasks, and instead keeps
>         //      an unbounded number of tasks in the executor's task queue.
>         xmitsSubmitted = Math.max((int)(task.getXmits() * xmitWeight), 1);
>         getDatanode().incrementXmitsInProcess(xmitsSubmitted); //  1.task 
> start increment
>         stripedReconstructionPool.submit(task);
>       } else {
>         LOG.warn("No missing internal block. Skip reconstruction for task:{}",
>             reconInfo);
>       }
>     } catch (Throwable e) {
>       getDatanode().decrementXmitsInProgress(xmitsSubmitted); //  2.2. task 
> failed decrement
>       LOG.warn("Failed to reconstruct striped block {}",
>           reconInfo.getExtendedBlock().getLocalBlock(), e);
>     }
>   }
> }
> // 2.StripedBlockReconstructor.java
> public void run() {
>   try {
>     initDecoderIfNecessary();
>    ...
>   } catch (Throwable e) {
>     LOG.warn("Failed to reconstruct striped block: {}", getBlockGroup(), e);
>     getDatanode().getMetrics().incrECFailedReconstructionTasks();
>   } finally {
>     float xmitWeight = getErasureCodingWorker().getXmitWeight();
>     // if the xmits is smaller than 1, the xmitsSubmitted should be set to 1
>     // because if it set to zero, we cannot to measure the xmits submitted
>     int xmitsSubmitted = Math.max((int) (getXmits() * xmitWeight), 1);
>     getDatanode().decrementXmitsInProgress(xmitsSubmitted); // 2.1. task 
> complete decrement
>     ...
>   }
> }{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Updated] (HDFS-15798) EC: Reconstruct task failed, and It would be XmitsInProgress of DN has negative number

Reply via email to