Re: flume tail source problem and performance

Jeong-shik Jang Mon, 04 Feb 2013 00:14:20 -0800

Yes, you can; Flume plugin framework provides easy way to implement and
apply your own source, deco and sink.


-JS

On 2/4/13 5:07 PM, 周梦想 wrote:
> Hi JS，
>
> Thank you for your reply. So there is big shortness of collect log
> using flume. can I write my own agent to send logs via thrift protocol
> directly to collector server?
>
> Best Regards,
> Andy Zhou
>
>
> 2013/2/4 Jeong-shik Jang <[email protected] <mailto:[email protected]>>
>
>     Hi Andy,
>
>     1. "startFromEnd=true" in your source configuration means data
>     missing can happen at restart in tail side because flume will
>     ignore any data event generated during restart and start at the
>     end all the time.
>     2. With agentSink, data duplication can happen due to ack delay
>     from master or at agent restart.
>
>     I think it is why Flume-NG doesn't support tail any more but does
>     let user handle using script or program; tailing is a tricky job.
>
>     My suggestion is to use agentBEChain in agent tier, and DFO in
>     collector tier; you can still lose some data during failover at
>     failure.
>     To minimize loss and duplication, implementing checkpoint function
>     in tail also can help.
>
>     Having monitoring system to detecting failure is very important as
>     well, so that you can notice failure and do recovering reaction
>     quickly.
>
>     -JS
>
>
>     On 2/4/13 4:27 PM, 周梦想 wrote:
>>     Hi JS,
>>     We can't accept agentBESink. Because this logs are important for
>>     data analysis, we can't make any errors of the data. losing data,
>>     duplication are all not acceptable.
>>     one agent's configure is :
>>     tail("H:/game.log", startFromEnd=true)   agentSink("hadoop48", 35853)
>>
>>
>>     every time this windows agent restart, it will resend all the
>>     data to collector server.
>>     if some reason we restart the agent node, we can't get the mark
>>     of log where the agent have sent.
>>
>>
>>     2013/1/29 Jeong-shik Jang <[email protected]
>>     <mailto:[email protected]>>
>>
>>         Hi Andy,
>>
>>         As you set startFromEnd option true, resend might be caused
>>         by DFO mechanism (agentDFOSink); when you restart flume node
>>         in DFO mode, all events in different stages(logged, writing,
>>         sending and so on) rolls back to logged stage, which means
>>         resending and duplication.
>>
>>         And, for better performance, you may want to use agentBESink
>>         instead of agentDFOSink.
>>         I recommend to use agentBEChain for failover in case of
>>         failure in collector tier if you have multiple collectors.
>>
>>         -JS
>>
>>
>>         On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:
>>
>>             Hi,
>>
>>             you could use tail -F, but this depends on the external
>>             source. Flume hasn't control about. You can write your
>>             own script and include this.
>>
>>             What's the content of:
>>             /tmp/flume/agent/agent*.*/ - directories? Are sent and
>>             sending clean?
>>
>>             - Alex
>>
>>             On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected]
>>             <mailto:[email protected]>> wrote:
>>
>>                 hello,
>>                 1. I want to tail a log source and write it to hdfs.
>>                 below is configure：
>>                 config [ag1,
>>                 tail("/home/zhouhh/game.log",startFromEnd=true),
>>                 agentDFOSink("hadoop48",35853) ;]
>>                 config [ag2,
>>                 tail("/home/zhouhh/game.log",startFromEnd=true),
>>                 agentDFOSink("hadoop48",35853) ;]
>>                 config [co1, collectorSource( 35853 ), [collectorSink(
>>                 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>                 
>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>>
>>                 I found if I restart the agent node, it will resend
>>                 the content of game.log
>>                 to collector. There are some solutions to send logs
>>                 from where I haven't
>>                 sent before? Or I have to make a mark myself or
>>                 remove the logs manually
>>                 when restart the agent node?
>>
>>                 2. I tested performance of flume, and found it's a
>>                 bit slow.
>>                 if I using configure as above, there are only
>>                 50MB/minute.
>>                 I changed the configure to below:
>>                 
>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000)
>>                 gzip
>>                 agentDFOSink("hadoop48",35853);
>>
>>                 config [co1, collectorSource( 35853 ), [collectorSink(
>>                 
>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
>>                 
>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]
>>
>>                 I sent 300MB log, it will spent about 3 minutes, so
>>                 it's about 100MB/minute.
>>
>>                 while I send the log from ag1 to co1 via scp, It's
>>                 about 30MB/second.
>>
>>                 someone give me any ideas?
>>
>>                 thanks!
>>
>>                 Andy
>>
>>             --
>>             Alexander Alten-Lorenz
>>             http://mapredit.blogspot.com
>>             German Hadoop LinkedIn Group: http://goo.gl/N8pCF
>>
>>
>>
>>
>>
>>         -- 
>>         Jeong-shik Jang / [email protected] <mailto:[email protected]>
>>         Gruter, Inc., R&D Team Leader
>>         www.gruter.com <http://www.gruter.com>
>>         Enjoy Connecting
>>
>>
>
>
>     -- 
>     Jeong-shik Jang / [email protected] <mailto:[email protected]>
>     Gruter, Inc., R&D Team Leader
>     www.gruter.com <http://www.gruter.com>
>     Enjoy Connecting
>
>


-- 
Jeong-shik Jang / [email protected]
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Reply via email to