Marton Elek created RATIS-804:
---------------------------------

             Summary: Race condition between cache evict and load in LogSegment
                 Key: RATIS-804
                 URL: https://issues.apache.org/jira/browse/RATIS-804
             Project: Ratis
          Issue Type: Bug
            Reporter: Marton Elek


I am doing some kind of stress testing with Ozone. I start one Datanode in 
FOLLOWER mode and the load generator (Freon) behaves like a LEADER.

I am sending huge number of AppendLogEntries to the FOLLOWER without 
inhibitions.

As a result I got NPE:
{code:java}
2020-01-28 15:08:20 ERROR StateMachineUpdater:184 - 
3fda0c39-ce3c-4540-a804-44d9ac1f4853@group-E1B13B4CA5C0-StateMachineUpdater: 
the StateMachineUp
dater hits Throwable
org.apache.ratis.server.raftlog.RaftLogIOException: 
java.lang.NullPointerException
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:320)
        at 
org.apache.ratis.server.raftlog.segmented.SegmentedRaftLog.get(SegmentedRaftLog.java:293)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.applyLog(StateMachineUpdater.java:218)
        at 
org.apache.ratis.server.impl.StateMachineUpdater.run(StateMachineUpdater.java:167)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
        at java.util.Objects.requireNonNull(Objects.java:203)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment$LogEntryLoader.load(LogSegment.java:214)
        at 
org.apache.ratis.server.raftlog.segmented.LogSegment.loadCache(LogSegment.java:318)
        ... 4 more {code}
It seems to be a race condition between LogSegment.evictCache() and 
LogSegment.loadCache().
 # StateMachineUpdater tries to update the StateMachine with the next log entry
 # It can't be found in the cache, therefore the LogSegment.loadCache() is 
called
 # The LogSegment.LogEntryLoader.load() reads the segment files from the disk
 # After loading, it returns with the loaded entry

If the GRPC thread evicts the cache between 3 and 4. (it's possible that the 
log segment is already flushed, therefore can be evicted) an NPE will be thrown.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to