[ https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17373170#comment-17373170 ]
Anoop Sam John commented on HBASE-25596: ---------------------------------------- Thanks [~zhangduo] Its clear now. > Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated > data due to EOFException from WAL > --------------------------------------------------------------------------------------------------------------- > > Key: HBASE-25596 > URL: https://issues.apache.org/jira/browse/HBASE-25596 > Project: HBase > Issue Type: Bug > Components: Replication > Reporter: Sandeep Pal > Assignee: Sandeep Pal > Priority: Critical > Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2 > > > There seems to be a major issue with how we handle the EOF exception from > WALEntryStream. > Problem: > When we see EOFException, we try to handle it and remove it from the log > queue, but we never try to ship the existing batch of entries. *This is a > permanent data loss in replication.* > > Secondly, we do not stop the reader on encountering the EOFException and thus > if EOFException was on the last WAL, we still try to process the WALEntry > stream and ship the empty batch with lastWALPath set to null. This is the > reason of NPE as below which *crash* the region server. > {code:java} > 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] > regionserver.ReplicationSource - Unexpected exception in > ReplicationSourceWorkerThread, > currentPath=nulljava.lang.NullPointerExceptionat > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at > > org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16 > 15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - > STOPPED: Unexpected exception in ReplicationSourceWorkerThread > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005)