[ https://issues.apache.org/jira/browse/FLUME-1425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13491133#comment-13491133 ]
Mike Percy commented on FLUME-1425: ----------------------------------- Rather than @Ignore the test on commit, I posted a small patch to FLUME-1681 to disable the unit test for now. > Create a SpoolDirectory Source and Client > ----------------------------------------- > > Key: FLUME-1425 > URL: https://issues.apache.org/jira/browse/FLUME-1425 > Project: Flume > Issue Type: Improvement > Reporter: Patrick Wendell > Assignee: Patrick Wendell > Fix For: v1.3.0 > > Attachments: FileProcessingSource.java, > FLUME-1425.avro-conf-file.txt, FLUME-1425.patch.v1.txt, > FLUME-1425.v5.patch.txt, FLUME-1425.v6.patch.txt, FLUME-1425.v6.patch.txt, > FLUME-1425.v7.patch.txt, FLUME-1425.v8.patch.txt > > > The proposal is to create a small executable client which reads logs from a > spooling directory and sends them to a flume sink, then performs cleanup on > the directory (either by deleting or moving the logs). It would make the > following assumptions > - Files placed in the directory are uniquely named > - Files placed in the directory are immutable > The problem this is trying to solve is that there is currently no way to do > guaranteed event delivery across flume agent restarts when the data is being > collected through an asynchronous source (and not directly from the client > API). Say, for instance, you are using a exec("tail -F") source. If the agent > restarts due to error or intentionally, tail may pick up at a new location > and you lose the intermediate data. > At the same time, there are users who want at-least-once semantics, and > expect those to apply as soon as the data is written to disk from the initial > logger process (e.g. apache logs), not just once it has reached a flume > agent. This idea would bridge that gap, assuming the user is able to copy > immutable logs to a spooling directory through a cron script or something. > The basic internal logic of such a client would be as follows: > - Scan the directory for files > - Chose a file and read through, while sending events to an agent > - Close the file and delete it (or rename, or otherwise mark completed) > That's about it. We could add sync-points to make recovery more efficient in > the case of failure. > A key question is whether this should be implemented as a standalone client > or as a source. My instinct is actually to do this as a source, but there > could be some benefit to not requiring an entire agent in order to run this, > specifically that it would become platform independent and you could stick it > on Windows machines. Others I have talked to have also sided on a standalone > executable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira