> On June 21, 2012, 7:23 a.m., Hari Shreedharan wrote:
> > +1. Looks good! I haven't review the code thoroughly, since Brock would be 
> > the best for that. I did test it out though - like Will mentioned, works 
> > well. Responds to Ctrl-C, phew! And restarts after kills also. I am not 
> > sure of edge cases or more subtle problems, but looks good in general. (I'm 
> > not going to mark Ship It - since I'd expect a full review from Brock to 
> > decide). 
> > 
> > Thanks Arvind! Great work!
> 
> Hari Shreedharan wrote:
>     I have one comment though. Would it be possible to use the agent name, 
> pid and channel name to make sure multiple channels will work even if 
> multiple flume agents and/or multiple flows within the same agent work 
> properly? As of now it looks like only one agent can run using the default. 
> It might be a good idea to do something about it. 
>     
>     Right now, if I use the exact same config on 2 different agents only one 
> of them will make any progress, not the other. I didn't check with multiple 
> flows within the same agent though.
> 
> Arvind Prabhakar wrote:
>     Hari - thanks for taking the time to review and test the patch. Regarding 
> your comment - can you please elaborate more? Specifically - are you able to 
> run multiple agents when you specify different (non-default) data and 
> checkpoint directories? If so, then this is not really an issue as it was 
> discussed as part of FLUME-1085 patch and we decided against it. Please see 
> https://reviews.apache.org/r/4661/ for the discussion.

Btw, the above situation also causes the logs to be filled with IOException :
java.lang.RuntimeException: java.io.IOException: Cannot lock 
/Users/hshreedharan/.flume/file-channel/checkpoint. The directory is already 
locked.
2012-06-21 00:43:27,426 INFO nodemanager.DefaultLogicalNodeManager: Waiting for 
channel: fileChannel to start. Sleeping for 500 ms
2012-06-21 00:43:27,928 INFO nodemanager.DefaultLogicalNodeManager: Waiting for 
channel: fileChannel to start. Sleeping for 500 ms


At this point, the channel causes the agent to not respond to SIGINT and a kill 
-9 is needed to kill the agent.


- Hari


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/5436/#review8444
-----------------------------------------------------------


On June 21, 2012, 1:29 a.m., Arvind Prabhakar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/5436/
> -----------------------------------------------------------
> 
> (Updated June 21, 2012, 1:29 a.m.)
> 
> 
> Review request for Flume and Brock Noland.
> 
> 
> Description
> -------
> 
> The file channel uses a disk-serialized in-memory checkpointing mechanism. 
> When the channel is full and the capacity is large, these checkpoints take a 
> long time to serialize and deserialize. For example, a channel with 1M 
> entries could take many minutes to boot up. Similarly, a boot up of a largely 
> full channel would require the replay of all log events to reconstruct the 
> correct state. Due to this latency issues and the failure interaction of the 
> channel with the LifeCycleSupervisor, the system could get into an unusable 
> state easily as evident from the FLUME-1232 issue.
> 
> This patch modifies the checkpointing mechanism as follows:
> * The FlumeEventQueue itself represents a checkpoint that is maintained as a 
> memory mapped file.
> * During checkpointing, a marker is introduced in active logs which is used 
> to skip records during display. 
> 
> In order to ensure correctness, a reader/writer lock is used where the reader 
> lock is used by consumers operating against the channel while the writer lock 
> is used to facilitate checkpointing. Some limitations of this approach are:
> 
> * The total number of active log files is now limited to a maximum of 1024. 
> * Dynamic resizing of the channel capacity is no longer allowed unless the 
> checkpoint is rebuild from scratch which can cause significant delay in 
> startup.
> 
> 
> This addresses bug FLUME-1232.
>     https://issues.apache.org/jira/browse/FLUME-1232
> 
> 
> Diffs
> -----
> 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Checkpoint.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannel.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FileChannelConfiguration.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/FlumeEventQueue.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/Log.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/LogFile.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/main/java/org/apache/flume/channel/file/ReplayHandler.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestCheckpoint.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFileChannel.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestFlumeEventQueue.java
>  1351988 
>   
> /trunk/flume-ng-channels/flume-file-channel/src/test/java/org/apache/flume/channel/file/TestLog.java
>  1351988 
>   /trunk/flume-ng-core/src/main/java/org/apache/flume/sink/LoggerSink.java 
> 1351988 
> 
> Diff: https://reviews.apache.org/r/5436/diff/
> 
> 
> Testing
> -------
> 
> Ran all tests. Did some manual testing. Will be doing more manual testing and 
> cleanup as necessary while the review is underway.
> 
> 
> Thanks,
> 
> Arvind Prabhakar
> 
>

Reply via email to