[ 
https://issues.apache.org/jira/browse/HBASE-25596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360525#comment-17360525
 ] 

Sandeep Pal commented on HBASE-25596:
-------------------------------------

[~zhangduo] This is where I think we will not replicate. 

 

 
{code:java}
            while (hasNext) {
              Entry entry = entryStream.next(); <-----------------------we hit 
an exception here
              entry = filterEntry(entry);
              if (entry != null) {
                WALEdit edit = entry.getEdit();
                if (edit != null && !edit.isEmpty()) {
                  long entrySize = getEntrySizeIncludeBulkLoad(entry);
                  long entrySizeExcludeBulkLoad = 
getEntrySizeExcludeBulkLoad(entry);
                  batch.addEntry(entry, entrySize);  
<------------------------------------we add the entries in batch
                  updateBatchStats(batch, entry, entryStream.getPosition(), 
entrySize);
                  boolean totalBufferTooLarge = 
acquireBufferQuota(entrySizeExcludeBulkLoad);
                  // Stop if too many entries or too big
                  if (totalBufferTooLarge || batch.getHeapSize() >= 
replicationBatchSizeCapacity
                    || batch.getNbEntries() >= replicationBatchCountCapacity) {
                    break;
                  }
                }
              }
              hasNext = entryStream.hasNext();
{code}
 

While reading wals we add entries in batch but in between we hit an exception 
let's say in the next empty WAL file. We won't replicate the existing batch 
which might have entries from the previous wal file. I am referring to branch-1 
code 
[here|https://github.com/apache/hbase/blob/branch-1/hbase-server/src/main/java/org/apache/hadoop/hbase/replication/regionserver/ReplicationSourceWALReaderThread.java#L165].
 

 

 

> Fix NPE in ReplicationSourceManager as well as avoid permanently unreplicated 
> data due to EOFException from WAL
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-25596
>                 URL: https://issues.apache.org/jira/browse/HBASE-25596
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Sandeep Pal
>            Assignee: Sandeep Pal
>            Priority: Critical
>             Fix For: 3.0.0-alpha-1, 1.7.0, 2.5.0, 2.4.2
>
>
> There seems to be a major issue with how we handle the EOF exception from 
> WALEntryStream. 
> Problem:
> When we see EOFException, we try to handle it and remove it from the log 
> queue, but we never try to ship the existing batch of entries. *This is a 
> permanent data loss in replication.*
>  
> Secondly, we do not stop the reader on encountering the EOFException and thus 
> if EOFException was on the last WAL, we still try to process the WALEntry 
> stream and ship the empty batch with lastWALPath set to null. This is the 
> reason of NPE as below which *crash* the region server. 
> {code:java}
> 2021-02-16 15:33:21,293 ERROR [,60020,1613262147968] 
> regionserver.ReplicationSource - Unexpected exception in 
> ReplicationSourceWorkerThread, 
> currentPath=nulljava.lang.NullPointerExceptionat 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.logPositionAndCleanOldLogs(ReplicationSourceManager.java:193)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.updateLogPosition(ReplicationSource.java:831)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.shipEdits(ReplicationSource.java:746)at
>  
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceShipperThread.run(ReplicationSource.java:650)2021-02-16
>  15:33:21,294 INFO [,60020,1613262147968] regionserver.HRegionServer - 
> STOPPED: Unexpected exception in ReplicationSourceWorkerThread
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to