[jira] [Commented] (HDFS-15315) IOException on close() when using Erasure Coding

Ayush Saxena (Jira) Thu, 30 Apr 2020 14:21:15 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-15315?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17096986#comment-17096986
 ]


Ayush Saxena commented on HDFS-15315:
-------------------------------------

Some issue with Datanode IBR's? Seems the datanode isn't sending IBR or is 
being slow, You can check in the Datanode logs.
 I got the same logs, when pausedIBR for 2 Datanodes.
 Exception :
{code:java}
java.io.IOException: Unable to close file because the last block 
BP-1842875897-127.0.1.1-1588280195530:blk_-9223372036854775792_1001 does not 
have enough number of replicas.

        at 
org.apache.hadoop.hdfs.DFSOutputStream.completeFile(DFSOutputStream.java:971)
        at 
org.apache.hadoop.hdfs.DFSStripedOutputStream.closeImpl(DFSStripedOutputStream.java:1265)
{code}
LOGS at NN :
{noformat}
2020-05-01 02:26:41,867 [IPC Server handler 2 on default port 42203] INFO  
namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3176)) - BLOCK* 
blk_-9223372036854775792_1001 is COMMITTED but not COMPLETE(numNodes= 0 <  
minimum = 2) in file /dir/file
2020-05-01 02:26:42,281 [IPC Server handler 8 on default port 42203] INFO  
namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3176)) - BLOCK* 
blk_-9223372036854775792_1001 is COMMITTED but not COMPLETE(numNodes= 1 <  
minimum = 2) in file /dir/file
2020-05-01 02:26:43,085 [IPC Server handler 1 on default port 42203] INFO  
namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3176)) - BLOCK* 
blk_-9223372036854775792_1001 is COMMITTED but not COMPLETE(numNodes= 1 <  
minimum = 2) in file /dir/file
2020-05-01 02:26:44,688 [IPC Server handler 6 on default port 42203] INFO  
namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3176)) - BLOCK* 
blk_-9223372036854775792_1001 is COMMITTED but not COMPLETE(numNodes= 1 <  
minimum = 2) in file /dir/file
2020-05-01 02:26:47,891 [IPC Server handler 0 on default port 42203] INFO  
namenode.FSNamesystem (FSNamesystem.java:checkBlocksComplete(3176)) - BLOCK* 
blk_-9223372036854775792_1001 is COMMITTED but not COMPLETE(numNodes= 1 <  
minimum = 2) in file /dir/file
{noformat}
Test Code :
{code:java}
  @Test
  public void testECIssue() throws IOException {
    HdfsConfiguration conf = new HdfsConfiguration();
    try (MiniDFSCluster cluster =
        new MiniDFSCluster.Builder(conf).numDataNodes(3).build()) {
      cluster.waitActive();
      final DistributedFileSystem dfs = cluster.getFileSystem();
      Path dir = new Path("/dir");
      dfs.mkdirs(dir);
      dfs.enableErasureCodingPolicy("XOR-2-1-1024k");
      dfs.setErasureCodingPolicy(dir,"XOR-2-1-1024k");

      FSDataOutputStream str = dfs.create(new Path("/dir/file"));
      for(int i=0; i <1024*1024*4;i++) {
        str.write(i);
      }
      DataNodeTestUtils.pauseIBR(cluster.getDataNodes().get(0));
      DataNodeTestUtils.pauseIBR(cluster.getDataNodes().get(1));
      str.close();
    }
  }
 {code}
Well usually in normal replicated files, 
{{dfs.namenode.file.close.num-committed-allowed}} can be used to counter this, 
to allow closing files with committed file.IIRC we too have it configured in 
the prod cases.
 But in EC, this doesn't seems to take affect :
{code:java}
if (b.isStriped() || i < blocks.length - numCommittedAllowed) {
      return b + " is " + state + " but not COMPLETE";
    }
{code}
It is there for only replicated files.

> IOException on close() when using Erasure Coding
> ------------------------------------------------
>
>                 Key: HDFS-15315
>                 URL: https://issues.apache.org/jira/browse/HDFS-15315
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: 3.1.1, ec, hdfs
>    Affects Versions: 3.1.1
>         Environment: XOR-2-1-1024k policy on hadoop 3.1.1 with 3 datanodes
>            Reporter: Anshuman Singh
>            Priority: Major
>
> When using Erasure Coding policy on a directory, the replication factor is 
> set to 1. Solr fails in indexing documents with error - _java.io.IOException: 
> Unable to close file because the last block does not have enough number of 
> replicas._ It works fine without EC (with replication factor as 3.) It seems 
> to be identical to this issue. [ 
> https://issues.apache.org/jira/browse/HDFS-11486|https://issues.apache.org/jira/browse/HDFS-11486]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Commented] (HDFS-15315) IOException on close() when using Erasure Coding

Reply via email to