[ https://issues.apache.org/jira/browse/FLUME-2118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841375#comment-13841375 ]
Brock Noland commented on FLUME-2118: ------------------------------------- Nevermind my last comment. I do think this scenario occurs most often when dual checkpoint is not enabled because the slow remove() code hits much more often during full replay. We'll take this forward in FLUME-2155. TL; DR: Enable dual checkpoint and you'll see this less > Occasional multi-hour pauses in file channel replay > --------------------------------------------------- > > Key: FLUME-2118 > URL: https://issues.apache.org/jira/browse/FLUME-2118 > Project: Flume > Issue Type: Bug > Components: File Channel > Affects Versions: v1.5.0 > Reporter: Juhani Connolly > Attachments: flume-log, flume-thread-dump, gc-flume.log.20130702 > > > Sometimes during replay, immediately after an EOF of one log, the replay will > pause for a long time. > Here are two samples from this morning when we restarted our 3 aggregators > and 2 of them hit this issue. > 02 7 2013 03:06:30,089 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000 > records > 02 7 2013 03:06:30,179 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000 > records > 02 7 2013 03:06:30,241 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) - > Encountered EOF at 1623195625 in /data2/flume-data/log-1184 > 02 7 2013 06:23:27,629 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000 > records > 02 7 2013 06:23:28,641 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2230000 > records > 02 7 2013 06:23:29,162 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2240000 > records > 02 7 2013 06:23:30,118 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2250000 > records > 02 7 2013 06:23:30,750 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2260000 > records > 02 7 2013 08:03:00,942 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2160000 > records > 02 7 2013 08:03:01,055 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2170000 > records > 02 7 2013 08:03:01,168 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2180000 > records > 02 7 2013 08:03:01,181 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.LogFile$SequentialReader.next:505) - > Encountered EOF at 1623195640 in /data2/flume-data/log-1182 > 02 7 2013 14:45:55,302 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2190000 > records > 02 7 2013 14:45:56,282 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2200000 > records > 02 7 2013 14:45:57,084 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2210000 > records > 02 7 2013 14:45:59,043 INFO [lifecycleSupervisor-1-0] > (org.apache.flume.channel.file.ReplayHandler.replayLog:292) - Read 2220000 > records > I've tried for an hour and some to track down the cause of this. There's > nothing suspicious turning up on ganglia, and a cursory review of the code > didn't turn up anything overly suspicious. Owing to time limitations I can't > dig further at this time. > We run a version of flume from somewhat before the current 1.4 release > candidate(hash is eefefa941a60c0982f0957804be0cafb4d83e46e) there doesn't > appear to be any replay patches since then. -- This message was sent by Atlassian JIRA (v6.1#6144)