Hi Malte,

How often are the exceptions being thrown?  I believe this is the expected
behavior when a sink fails as Flume will retry the failed sink at
increasing intervals up to the "maxpenalty" property.  I don't believe
there is any logic to permanently remove a failed sink.

The default max penalty (i.e., maximum time the failed sink will wait
before being retried is 30 seconds (30000 ms)).  You can increase this
penalty if you want to wait longer between retries.  Note that Flume uses
an increasing cool down period where the wait time between retries doubles
each iteration so when the sink first fails you will see several attempts
in fairly quick succession before they start spacing out.

The actual computation for this is:

Math.min(maxPenalty, (1 << sequentialFailures) * FAILURE_PENALTY)


Where "FAILURE_PENALTY" is 1000, "maxPenalty" defaults to 30000 and I think
"sequentialFailures" increments for each failure starting at 1 (although it
might start at 0).  With the default values you should see retries at the
following intervals:

1 second (if sequentialFailures starts at 0)
2 seconds
4 seconds
8 seconds
16 seconds
30 seconds
(repeat retry every 30 seconds)

If you want to reduce the number of exceptions you could increase the
maxpenalty so that Flume doesn't retry the failed sink so often.

Best,

Ed


On Thu, Feb 20, 2014 at 9:25 PM, Malte Pickhan <[email protected]> wrote:

> Hi there,
>
> we are currently trying to use flume, to stream log files from a Host to
> the HDFS of a cloudera cluster.
>
> Since we want to have a reliable system, we've setup flume with a failover
> grouping, so that if one of the sinks is failing the other one will take
> over.
> Each sink is configured to connect to a separate namenode.
>
> This is how our config looks like:
>
> # Defining a sinkgroup for failover
> agent.sinkgroups = groupOne
> agent.sinkgroups.groupOne.sinks = hdfsSink1 hdfsSink2
> agent.sinkgroups.groupOne.processor.type = failover
> agent.sinkgroups.groupOne.processor.priority.hdfsSink1 = 10
> agent.sinkgroups.groupOne.processor.priority.hdfsSink2 = 5
>
> agent.sources = tailSrc
> agent.channels = memoryChannel
> agent.sinks = hdfsSink1 hdfsSink2
>
> # For each one of the sources, the type is defined
> agent.sources.tailSrc.type = exec
> agent.sources.tailSrc.command = tail -F /var/log/events.log
> agent.sources.tailSrc.channels = memoryChannel
>
> # Definition of first sink
> agent.sinks.hdfsSink1.type = hdfs
> agent.sinks.hdfsSink1.hdfs.useLocalTimeStamp = True
> agent.sinks.hdfsSink1.hdfs.path = hdfs://host1.com:8020/events/%y-%m-%d/%H
> agent.sinks.hdfsSink1.hdfs.filePrefix = %M-events
> #Specify the channel the sink should use
> agent.sinks.hdfsSink1.channel = memoryChannel
>
> # Each sink's type must be defined
> agent.sinks.hdfsSink2.type = hdfs
> agent.sinks.hdfsSink2.hdfs.useLocalTimeStamp = True
> agent.sinks.hdfsSink2.hdfs.path = hdfs://host2.com:8020/events/%y-%m-%d/%H
> agent.sinks.hdfsSink2.hdfs.filePrefix = %M-events
> #Specify the channel the sink should use
> agent.sinks.hdfsSink2.channel = memoryChannel
>
>
> # Each channel's type is defined.
> agent.channels.memoryChannel.type = memory
>
> # Other config values specific to each type of channel(sink or source)
> # can be defined as well
> # In this case, it specifies the capacity of the memory channel
> agent.channels.memoryChannel.capacity = 1000
>
> In the first run we used Flume 1.4.0 and tested the switch to the backup
> sink by doing a manual failover of the namenode. This didn't work at all,
> flume got stuck in exceptions.
> A bit of research pointed out that there is known bug, (
> https://issues.apache.org/jira/browse/FLUME-1779) And we patched Flume
> 1.5.0 on our own. At least the failover is working now.
>
> Nevertheless flume keeps on throwing exceptions for the sink which has
> been disconnected. Has anyone any idea how to tackle this issue?
>
> Failed to renew lease for [DFSClient_NONMAPREDUCE_-1223354028_45] for 3598
> seconds.  Will retry shortly ...
> org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException):
> Operation category WRITE is not supported in state standby
>
> Best Regards,
>
> Malte
>

Reply via email to