Cameron.
flume-808 should be for another race condition, I'm not sure it helps to
fix the rollsink issue. However I'm glad you tried it.
Can you post your fix to RollSink to FLUME-798 (this one, right?)?
Thanks,
Mingjie
On 10/27/2011 12:15 PM, Cameron Gandevia wrote:
Hey
We were having problems with our collectors dying (We always had errors
in the logs). We recently applied the patch
https://issues.apache.org/jira/browse/Flume-808 and modified the
RollSink TriggerThread to not Interrupt the append job when acquiring
its lock. Our collectors have now been up for a few days without problems.
On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <e...@gigya.com
<mailto:e...@gigya.com>> wrote:
Just grepped a few days of logs and I don't see this error. It seems
to be correlated with higher load on the HDFS servers (like when
map/reduce jobs are running).
When it is happening the agents fail to connect to the collectors,
but I don't see any errors in the collectors logs. They just hang,
while other virtual collectors on the same server continue to work.
-eran
On Thu, Oct 27, 2011 at 06:39, Eric Sammer <esam...@cloudera.com
<mailto:esam...@cloudera.com>> wrote:
It's almost certainly the issue Mingjie mentioned. There's a race
condition in the rolling that's plagued a few people. I'm heads down
on NG but I think someone (probably Mingjie :)) was working on this.
On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mjla...@gmail.com
<mailto:mjla...@gmail.com>> wrote:
>
> Quite some ppl mentioned on the list recently that the
combination of RollSink + escapedCustomDfs causes issues. You
may saw logs like these:
>
> 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19]
INFO com.cloudera.flume.core.connector.DirectDriver - Connector
logicalNode collector0_log_dir-19 exited with error: Blocked
append interrupted by rotation event
> java.lang.InterruptedException: Blocked append interrupted by
rotation event
> at
com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
>
>
> > 1500-2000 events per second
>
> It's not really a huge amount of data. Flume is expected to
be able to handle it.
>
> Not sure anyone is looking at it. Sorry.
>
> Mingjie
>
> On 10/23/2011 09:07 AM, Eran Kutner wrote:
>> Hi,
>> I'm having a problem where flume collectors occasionally
stop working
>> under heavy load.
>> I'm writing something like 1500-2000 events per second to my
collectors,
>> and occasionally they will just stop working. Nothing is
written to the
>> log the only indication that this is happening is that I see
0 messages
>> being delivered when looking in the flume stats web page
and events
>> start pilling up in the agents. Restarting the service
solves the
>> problem for a while (anything from a few minutes to a few days).
>> An interesting thing to note is that this seems to be load
related. It
>> used to happen a lot more but then I split the collector
into three
>> virtual nodes and balanced the traffic on them and now it
happens a lot
>> less. Also, while one virtual collector stops working the
others, on the
>> same machine, continue to work fine.
>>
>> My collector configuration looks like this:
collectorSource(54001) |
>> collector(600000) {
>> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
>> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
>>
>> I'm using 0.9.5 I've built a few weeks ago.
>>
>> Any ideas what can be causing it?
>>
>> -eran
>>
--
Thanks
Cameron Gandevia