Re: Collector stops working

Mingjie Lai Thu, 27 Oct 2011 23:19:40 -0700

Cameron.

flume-808 should be for another race condition, I'm not sure it helps tofix the rollsink issue. However I'm glad you tried it.


Can you post your fix to RollSink to FLUME-798 (this one, right?)?

Thanks,
Mingjie


On 10/27/2011 12:15 PM, Cameron Gandevia wrote:

Hey

We were having problems with our collectors dying (We always had errors
in the logs). We recently applied the patch
https://issues.apache.org/jira/browse/Flume-808 and modified the
RollSink TriggerThread to not Interrupt the append job when acquiring
its lock. Our collectors have now been up for a few days without problems.

On Thu, Oct 27, 2011 at 8:10 AM, Eran Kutner <e...@gigya.com
<mailto:e...@gigya.com>> wrote:

    Just grepped a few days of logs and I don't see this error. It seems
    to be correlated with higher load on the HDFS servers (like when
    map/reduce jobs are running).
    When it is happening the agents fail to connect to the collectors,
    but I don't see any errors in the collectors logs. They just hang,
    while other virtual collectors on the same server continue to work.

    -eran



    On Thu, Oct 27, 2011 at 06:39, Eric Sammer <esam...@cloudera.com
    <mailto:esam...@cloudera.com>> wrote:

        It's almost certainly the issue Mingjie mentioned. There's a race
        condition in the rolling that's plagued a few people. I'm heads down
        on NG but I think someone (probably Mingjie :)) was working on this.



        On Oct 26, 2011, at 1:59 PM, Mingjie Lai <mjla...@gmail.com
        <mailto:mjla...@gmail.com>> wrote:

         >
         > Quite some ppl mentioned on the list recently that the
        combination of RollSink + escapedCustomDfs causes issues. You
        may saw logs like these:
         >
         > 2011-10-17 17:30:07,190 [logicalNode collector0_log_dir-19]
        INFO com.cloudera.flume.core.connector.DirectDriver - Connector
        logicalNode collector0_log_dir-19 exited with error: Blocked
        append interrupted by rotation event
         > java.lang.InterruptedException: Blocked append interrupted by
        rotation event
         >        at
        com.cloudera.flume.handlers.rolling.RollSink.append(RollSink.java:209)
         >
         >
         > > 1500-2000 events per second
         >
         > It's not really a huge amount of data. Flume is expected to
        be able to handle it.
         >
         > Not sure anyone is looking at it. Sorry.
         >
         > Mingjie
         >
         > On 10/23/2011 09:07 AM, Eran Kutner wrote:
         >> Hi,
         >> I'm having a problem where flume collectors occasionally
        stop working
         >> under heavy load.
         >> I'm writing something like 1500-2000 events per second to my
        collectors,
         >> and occasionally they will just stop working. Nothing is
        written to the
         >> log the only indication that this is happening is that I see
        0 messages
         >> being delivered when looking in the flume stats web page
          and events
         >> start pilling up in the agents. Restarting the service
        solves the
         >> problem for a while (anything from a few minutes to a few days).
         >> An interesting thing to note is that this seems to be load
        related. It
         >> used to happen a lot more but then I split the collector
        into three
         >> virtual nodes and balanced the traffic on them and now it
        happens a lot
         >> less. Also, while one virtual collector stops working the
        others, on the
         >> same machine, continue to work fine.
         >>
         >> My collector configuration looks like this:
        collectorSource(54001) |
         >> collector(600000) {
         >> escapedFormatDfs("hdfs://hadoop1-m1:8020/raw-events/%Y-%m-%d/",
         >> "events-%{rolltag}-f01-c1.snappy", seqfile("SnappyCodec")) };
         >>
         >> I'm using 0.9.5 I've built a few weeks ago.
         >>
         >> Any ideas what can be causing it?
         >>
         >> -eran
         >>





--
Thanks

Cameron Gandevia

Re: Collector stops working

Reply via email to