[ https://issues.apache.org/jira/browse/HBASE-25932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Work on HBASE-25932 stopped by Bharath Vissapragada. ---------------------------------------------------- > TestWALEntryStream#testCleanClosedWALs test is failing. > ------------------------------------------------------- > > Key: HBASE-25932 > URL: https://issues.apache.org/jira/browse/HBASE-25932 > Project: HBase > Issue Type: Bug > Components: Replication > Affects Versions: 3.0.0-alpha-1, 2.5.0, 2.3.6, 2.4.4 > Reporter: Rushabh Shah > Assignee: Bharath Vissapragada > Priority: Major > Attachments: HBASE-25932-test-approach.patch > > > We are seeing the following test failure. > TestWALEntryStream#testCleanClosedWALs > This test was added in HBASE-25924. I don't think the test failure has > anything to do with the patch in HBASE-25924. > Before HBASE-25924, we were *not* monitoring _uncleanlyClosedWAL_ metric. In > all the branches, we were not parsing the wal trailer when we close the wal > reader inside ReplicationSourceWALReader thread. The root cause was when we > add active WAL to ReplicationSourceWALReader, we cache the file size when the > wal was being actively written and once the wal was closed and replicated and > removed from WALEntryStream, we did reset the ProtobufLogReader object but we > didn't update the length of the wal and that was causing EOF errors since it > can't find the WALTrailer with the stale wal file length. > The fix applied nicely to branch-1 since we use FSHlog implementation which > closes the WAL file sychronously. > But in branch-2 and master, we use _AsyncFSWAL_ implementation and the > closing of wal file is done asynchronously (as the name suggests). This is > causing the test to fail. Below is the test. > {code:java} > @Test > public void testCleanClosedWALs() throws Exception { > try (WALEntryStream entryStream = new WALEntryStream( > logQueue, CONF, 0, log, null, logQueue.getMetrics(), fakeWalGroupId)) { > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > appendToLogAndSync(); > assertNotNull(entryStream.next()); > log.rollWriter(); =======> This does an asynchronous close of wal. > appendToLogAndSync(); > assertNotNull(entryStream.next()); > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > } > } > {code} > In the above code, when we roll writer, we don't close the old wal file > immediately so the ReplicationReader thread is not able to get the updated > wal file size and that is throwing EOF errors. > If I add a sleep of few milliseconds (1 ms in my local env) between > rollWriter and appendToLogAndSync statement then the test passes but this is > *not* a proper fix since we are working around the race between > ReplicationSourceWALReaderThread and closing of WAL file. -- This message was sent by Atlassian Jira (v8.3.4#803005)