My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded agent to write to a file channel. From a previous thread started by my colleague, "FileChannel Replays consistently take a long time" and associated issue, https://issues.apache.org/jira/browse/FLUME-2450, it was suggested to use a backup checkpoint directory to avoid lengthy replays. When I enabled the backup checkpoint directory, I observed via iotop near 100% IO by my application with the embedded agent. This level of IO persists for about 30 seconds rendering the application unusable during this time period.
For comparison, I monitored via iotop when backup checkpoint is disabled. IO activity occurs for at most several seconds. That is, there is a qualitative difference when enabling the backup checkpoint directory. Additionally, I also tried deleting the existing checkpoints/data directories to start with a clean slate. Those experiment results are in-line with my above observations. Is this expected behavior when using a backup checkpoint directory? Is there anyway in which the amount of IO can be reduced? I appreciate feedback and insights because the current behavior is untenable for a production environment. Thank you, Michael
