[jira] [Commented] (FLUME-1685) ExecSource shouldn't die if the channel is full

Steve Hoffman (JIRA) Wed, 16 Jan 2013 18:32:23 -0800

    [ 
https://issues.apache.org/jira/browse/FLUME-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13555766#comment-13555766
 ]


Steve Hoffman commented on FLUME-1685:
--------------------------------------

I would argue that the behavior is broken.  I'm assuming that the rule doesn't 
apply if the functionality is broken.  Should this situation occur, the source 
stops processing data even if the channel empties and never recovers.  I can't 
see this as desired behavior under any situation.  You'd have to call the flag 
"stay.broken" with a default of "true" ;)

I can't create a review in RB.  It seems to want to know which repo the patch 
is committed in - which it isn't (since I'm not an official committer).  
Thoughts?
                
> ExecSource shouldn't die if the channel is full
> -----------------------------------------------
>
>                 Key: FLUME-1685
>                 URL: https://issues.apache.org/jira/browse/FLUME-1685
>             Project: Flume
>          Issue Type: Bug
>          Components: Sinks+Sources
>    Affects Versions: v1.2.0, v1.3.0, v1.4.0
>            Reporter: Steve Hoffman
>         Attachments: 
> 0001-FLUME-1685-don-t-kill-ExecSource-if-channel-is-full-branch-v1.2.0.patch, 
> 0001-FLUME-1685-don-t-kill-ExecSource-if-channel-is-full-trunk.patch
>
>
> Imagine this scenario.  You are using the ExecSource to tail a file and send 
> to a file channel.  When the file channel fills due to a temporary issue 
> downstream, the source gets a ChannelException which kills the source.
> {code}
> 2012-10-31 20:45:57,872 ERROR source.ExecSource: Failed while running 
> command: tail -F /tmp/test.log
> org.apache.flume.ChannelException: Unable to put batch on required channel: 
> FileChannel test { dataDirs: [/tmp/test/data] }
>         at 
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:195)
>         at 
> org.apache.flume.source.ExecSource$ExecRunnable.run(ExecSource.java:275)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>         at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:138)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>         at java.lang.Thread.run(Thread.java:662)
> Caused by: org.apache.flume.ChannelException: Cannot acquire capacity. 
> [channel=hbasejson]
>         at 
> org.apache.flume.channel.file.FileChannel$FileBackedTransaction.doPut(FileChannel.java:346)
>         at 
> org.apache.flume.channel.BasicTransactionSemantics.put(BasicTransactionSemantics.java:93)
>         at 
> org.apache.flume.channel.BasicChannelSemantics.put(BasicChannelSemantics.java:76)
>         at 
> org.apache.flume.channel.ChannelProcessor.processEventBatch(ChannelProcessor.java:184)
>         ... 7 more
> {code}
> The situation where the command being 'exec'ed fails/exits is already handled 
> with the existing retry logic.
> I suggest that when the source gets a ChannelException it throw the event 
> away (since there is nowhere to put it) and instead sleep for second and loop 
> again for another event.  If the channel is still throwing an exception 
> (still full), the event dropped and the sleep time doubled and we repeat 
> again.  There should be an upper bound on the retry time (say 128 seconds -- 
> about 2 minutes) for the next attempt.  When the putEvent no longer throws a 
> ChannelException, the "fallback" mode is reset and we read records at full 
> speed again.
> Clearly in a situation where the channel is full, data loss will happen.  But 
> in this case, we wouldn't have to restart the agent.  At scale this is an 
> administrative pain.  Even detecting this is difficult as the flume agent 
> itself is still running.  In this case (running a 'tail'), the tail will 
> eventually result in data loss should the file being tailed rotate.  
> Something has to give somewhere.
> I've got a patch I'm working on for this, but wanted to get the JIRA rolling 
> first.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (FLUME-1685) ExecSource shouldn't die if the channel is full

Reply via email to