Re: flume tail source problem and performance

Jeong-shik Jang Sun, 03 Feb 2013 23:47:46 -0800

Hi Andy,

1. "startFromEnd=true" in your source configuration means data missingcan happen at restart in tail side because flume will ignore any dataevent generated during restart and start at the end all the time.2. With agentSink, data duplication can happen due to ack delay frommaster or at agent restart.

I think it is why Flume-NG doesn't support tail any more but does letuser handle using script or program; tailing is a tricky job.

My suggestion is to use agentBEChain in agent tier, and DFO in collectortier; you can still lose some data during failover at failure.To minimize loss and duplication, implementing checkpoint function intail also can help.

Having monitoring system to detecting failure is very important as well,so that you can notice failure and do recovering reaction quickly.


-JS

On 2/4/13 4:27 PM, 周梦想 wrote:

Hi  JS,

We can't accept agentBESink. Because this logs are important for dataanalysis, we can't make any errors of the data. losing data,duplication are all not acceptable.

one agent's configure is :
tail("H:/game.log", startFromEnd=true)        agentSink("hadoop48", 35853)

every time this windows agent restart, it will resend all the data tocollector server.if some reason we restart the agent node, we can't get the mark of logwhere the agent have sent.



2013/1/29 Jeong-shik Jang <[email protected] <mailto:[email protected]>>

    Hi Andy,

    As you set startFromEnd option true, resend might be caused by DFO
    mechanism (agentDFOSink); when you restart flume node in DFO mode,
    all events in different stages(logged, writing, sending and so on)
    rolls back to logged stage, which means resending and duplication.

    And, for better performance, you may want to use agentBESink
    instead of agentDFOSink.
    I recommend to use agentBEChain for failover in case of failure in
    collector tier if you have multiple collectors.

    -JS


    On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:

        Hi,

        you could use tail -F, but this depends on the external
        source. Flume hasn't control about. You can write your own
        script and include this.

        What's the content of:
        /tmp/flume/agent/agent*.*/ - directories? Are sent and sending
        clean?

        - Alex

        On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected]
        <mailto:[email protected]>> wrote:

            hello,
            1. I want to tail a log source and write it to hdfs. below
            is configure：
            config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
            agentDFOSink("hadoop48",35853) ;]
            config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
            agentDFOSink("hadoop48",35853) ;]
            config [co1, collectorSource( 35853 ),  [collectorSink(
            
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
            "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]


            I found if I restart the agent node, it will resend the
            content of game.log
            to collector. There are some solutions to send logs from
            where I haven't
            sent before? Or I have to make a mark myself or remove the
            logs manually
            when restart the agent node?

            2. I tested performance of flume, and found it's a bit slow.
            if I using configure as above, there are only 50MB/minute.
            I changed the configure to below:
            ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000)
            gzip
            agentDFOSink("hadoop48",35853);

            config [co1, collectorSource( 35853 ), [collectorSink(
            
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
            "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]

            I sent 300MB log, it will spent about 3 minutes, so it's
            about 100MB/minute.

            while I send the log from ag1 to co1 via scp, It's about
            30MB/second.

            someone give me any ideas?

            thanks!

            Andy

        --
        Alexander Alten-Lorenz
        http://mapredit.blogspot.com
        German Hadoop LinkedIn Group: http://goo.gl/N8pCF

--Jeong-shik Jang / [email protected] <mailto:[email protected]>

    Gruter, Inc., R&D Team Leader
    www.gruter.com <http://www.gruter.com>
    Enjoy Connecting



--
Jeong-shik Jang / [email protected]
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Reply via email to