[ https://issues.apache.org/jira/browse/HBASE-24625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17152610#comment-17152610 ]
chenglei edited comment on HBASE-24625 at 7/7/20, 10:41 AM: ------------------------------------------------------------ [~ndimiduk] [~stack] [~zhangduo], I think the {{TestWALEntryStream}} error in branch 2.x is caused by following {{AsyncFSWAL.doShutdown}} method of branch-2.x, which is different from the master: {code:java} 711 @Override 712 protected void doShutdown() throws IOException { 713 waitForSafePoint(); 714 closeWriter(); 715 closeExecutor.shutdown(); {code} After we invoke {{closeWriter()}}, we do not set the {{writer}} to null, which causes {{AbstractFSWAL.getLogFileSizeIfBeingWritten}} continues to invoke {{AsyncProtobufLogWriter.getSyncedLength}} in following line 1017 after the {{AsyncProtobufLogWriter}} is close. {code:java} 1010 @Override 1011 public OptionalLong getLogFileSizeIfBeingWritten(Path path) { 1012 rollWriterLock.lock(); 1013 try { 1014 Path currentPath = getOldPath(); 1015 if (path.equals(currentPath)) { 1016 W writer = this.writer; 1017 return writer != null ? OptionalLong.of(writer.getSyncedLength()) : OptionalLong.empty(); 1018 } else { 1019 return OptionalLong.empty(); 1020 } 1021 } finally { 1022 rollWriterLock.unlock(); 1023 } 1024 } {code} Why {{TestWALEntryStream}} on the master is passed? That is because the {{AsyncFSWAL.doShutdown}} code in the master set the {{writer}} to null after {{closeWriter()}} in following line 780: {code:java} 776 @Override 777 protected void doShutdown() throws IOException { 778 waitForSafePoint(); 779 closeWriter(this.writer); 780 this.writer = null; 781 closeExecutor.shutdown(); {code} The difference of {{AsyncFSWAL.doShutdown}} betwwen master and 2.x is caused by HBASE-20456, which is a Synchronous replication patch did not applied to branch 2.x . BTW. After we set set the {{writer}} to null after {{closeWriter()}} , the end of {{TestWALEntryStream.testReplicationSourceWALReaderRecovered}} should be modified as master by HBASE-20456 : {code:java} @@ -413,9 +413,7 @@ public class TestWALEntryStream { batch = reader.take(); assertEquals(walPath, batch.getLastWalPath()); assertEquals(5, batch.getNbEntries()); - // Actually this should be true but we haven't handled this yet since for a normal queue the - // last one is always open... Not a big deal for now. - assertFalse(batch.isEndOfFile()); + assertTrue(batch.isEndOfFile()); assertSame(WALEntryBatch.NO_MORE_DATA, reader.take()); } {code} was (Author: comnetwork): [~ndimiduk] [~stack] [~zhangduo], I think the {{TestWALEntryStream}} error in branch 2.x is caused by following {{AsyncFSWAL.doShutdown}} method of branch-2.x, which is different from the master: {code:java} 711 @Override 712 protected void doShutdown() throws IOException { 713 waitForSafePoint(); 714 closeWriter(); 715 closeExecutor.shutdown(); {code} After we invoke {{closeWriter()}}, we do not set the {{writer}} to null, which causes {{AbstractFSWAL.getLogFileSizeIfBeingWritten}} continues to invoke {{AsyncProtobufLogWriter.getSyncedLength}} in following line 1017 after the {{AsyncProtobufLogWriter}} is close. {code:java} 1010 @Override 1011 public OptionalLong getLogFileSizeIfBeingWritten(Path path) { 1012 rollWriterLock.lock(); 1013 try { 1014 Path currentPath = getOldPath(); 1015 if (path.equals(currentPath)) { 1016 W writer = this.writer; 1017 return writer != null ? OptionalLong.of(writer.getSyncedLength()) : OptionalLong.empty(); 1018 } else { 1019 return OptionalLong.empty(); 1020 } 1021 } finally { 1022 rollWriterLock.unlock(); 1023 } 1024 } {code} Why {{TestWALEntryStream}} on the master is passed? That is because the {{AsyncFSWAL.doShutdown}} code in the master set the {{writer}} to null after {{closeWriter()}} in following line 780: {code:java} 776 @Override 777 protected void doShutdown() throws IOException { 778 waitForSafePoint(); 779 closeWriter(this.writer); 780 this.writer = null; 781 closeExecutor.shutdown(); {code} The difference of {{{AsyncFSWAL.doShutdown}} is caused by HBASE-20456, which is a Synchronous replication patch did not applied to branch 2.x . BTW. After we set set the {{writer}} to null after {{closeWriter()}} , the end of {{TestWALEntryStream.testReplicationSourceWALReaderRecovered}} should be modified as master by HBASE-20456 : {code:java} @@ -413,9 +413,7 @@ public class TestWALEntryStream { batch = reader.take(); assertEquals(walPath, batch.getLastWalPath()); assertEquals(5, batch.getNbEntries()); - // Actually this should be true but we haven't handled this yet since for a normal queue the - // last one is always open... Not a big deal for now. - assertFalse(batch.isEndOfFile()); + assertTrue(batch.isEndOfFile()); assertSame(WALEntryBatch.NO_MORE_DATA, reader.take()); } {code} > AsyncFSWAL.getLogFileSizeIfBeingWritten does not return the expected synced > file length. > ---------------------------------------------------------------------------------------- > > Key: HBASE-24625 > URL: https://issues.apache.org/jira/browse/HBASE-24625 > Project: HBase > Issue Type: Bug > Components: Replication, wal > Affects Versions: 2.1.0, 2.0.0, 2.2.0, 2.3.0 > Reporter: chenglei > Assignee: chenglei > Priority: Critical > Fix For: 3.0.0-alpha-1, 2.3.1, 2.2.6 > > > By HBASE-14004, we introduce {{WALFileLengthProvider}} interface to keep the > current writing wal file length by ourselves, {{WALEntryStream}} used by > {{ReplicationSourceWALReader}} could only read WAL file byte size <= > {{WALFileLengthProvider.getLogFileSizeIfBeingWritten}} if the WAL file is > current been writing on the same RegionServer . > {{AsyncFSWAL}} implements {{WALFileLengthProvider}} by > {{AbstractFSWAL.getLogFileSizeIfBeingWritten}}, just as folllows : > {code:java} > public OptionalLong getLogFileSizeIfBeingWritten(Path path) { > rollWriterLock.lock(); > try { > Path currentPath = getOldPath(); > if (path.equals(currentPath)) { > W writer = this.writer; > return writer != null ? OptionalLong.of(writer.getLength()) : > OptionalLong.empty(); > } else { > return OptionalLong.empty(); > } > } finally { > rollWriterLock.unlock(); > } > } > {code} > For {{AsyncFSWAL}}, above {{AsyncFSWAL.writer}} is > {{AsyncProtobufLogWriter}} ,and {{AsyncProtobufLogWriter.getLength}} is as > follows: > {code:java} > public long getLength() { > return length.get(); > } > {code} > But for {{AsyncProtobufLogWriter}}, any append method may increase the above > {{AsyncProtobufLogWriter.length}}, especially for following > {{AsyncFSWAL.append}} > method just appending the {{WALEntry}} to > {{FanOutOneBlockAsyncDFSOutput.buf}}: > {code:java} > public void append(Entry entry) { > int buffered = output.buffered(); > try { > entry.getKey(). > > getBuilder(compressor).setFollowingKvCount(entry.getEdit().size()).build() > .writeDelimitedTo(asyncOutputWrapper); > } catch (IOException e) { > throw new AssertionError("should not happen", e); > } > > try { > for (Cell cell : entry.getEdit().getCells()) { > cellEncoder.write(cell); > } > } catch (IOException e) { > throw new AssertionError("should not happen", e); > } > length.addAndGet(output.buffered() - buffered); > } > {code} > That is to say, {{AsyncFSWAL.getLogFileSizeIfBeingWritten}} could not reflect > the file length which successfully synced to underlying HDFS, which is not > as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)