Yes, you can; Flume plugin framework provides easy way to implement and apply your own source, deco and sink.
-JS On 2/4/13 5:07 PM, 周梦想 wrote: > Hi JS, > > Thank you for your reply. So there is big shortness of collect log > using flume. can I write my own agent to send logs via thrift protocol > directly to collector server? > > Best Regards, > Andy Zhou > > > 2013/2/4 Jeong-shik Jang <[email protected] <mailto:[email protected]>> > > Hi Andy, > > 1. "startFromEnd=true" in your source configuration means data > missing can happen at restart in tail side because flume will > ignore any data event generated during restart and start at the > end all the time. > 2. With agentSink, data duplication can happen due to ack delay > from master or at agent restart. > > I think it is why Flume-NG doesn't support tail any more but does > let user handle using script or program; tailing is a tricky job. > > My suggestion is to use agentBEChain in agent tier, and DFO in > collector tier; you can still lose some data during failover at > failure. > To minimize loss and duplication, implementing checkpoint function > in tail also can help. > > Having monitoring system to detecting failure is very important as > well, so that you can notice failure and do recovering reaction > quickly. > > -JS > > > On 2/4/13 4:27 PM, 周梦想 wrote: >> Hi JS, >> We can't accept agentBESink. Because this logs are important for >> data analysis, we can't make any errors of the data. losing data, >> duplication are all not acceptable. >> one agent's configure is : >> tail("H:/game.log", startFromEnd=true) agentSink("hadoop48", 35853) >> >> >> every time this windows agent restart, it will resend all the >> data to collector server. >> if some reason we restart the agent node, we can't get the mark >> of log where the agent have sent. >> >> >> 2013/1/29 Jeong-shik Jang <[email protected] >> <mailto:[email protected]>> >> >> Hi Andy, >> >> As you set startFromEnd option true, resend might be caused >> by DFO mechanism (agentDFOSink); when you restart flume node >> in DFO mode, all events in different stages(logged, writing, >> sending and so on) rolls back to logged stage, which means >> resending and duplication. >> >> And, for better performance, you may want to use agentBESink >> instead of agentDFOSink. >> I recommend to use agentBEChain for failover in case of >> failure in collector tier if you have multiple collectors. >> >> -JS >> >> >> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote: >> >> Hi, >> >> you could use tail -F, but this depends on the external >> source. Flume hasn't control about. You can write your >> own script and include this. >> >> What's the content of: >> /tmp/flume/agent/agent*.*/ - directories? Are sent and >> sending clean? >> >> - Alex >> >> On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected] >> <mailto:[email protected]>> wrote: >> >> hello, >> 1. I want to tail a log source and write it to hdfs. >> below is configure: >> config [ag1, >> tail("/home/zhouhh/game.log",startFromEnd=true), >> agentDFOSink("hadoop48",35853) ;] >> config [ag2, >> tail("/home/zhouhh/game.log",startFromEnd=true), >> agentDFOSink("hadoop48",35853) ;] >> config [co1, collectorSource( 35853 ), [collectorSink( >> >> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink( >> >> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >> >> >> I found if I restart the agent node, it will resend >> the content of game.log >> to collector. There are some solutions to send logs >> from where I haven't >> sent before? Or I have to make a mark myself or >> remove the logs manually >> when restart the agent node? >> >> 2. I tested performance of flume, and found it's a >> bit slow. >> if I using configure as above, there are only >> 50MB/minute. >> I changed the configure to below: >> >> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) >> gzip >> agentDFOSink("hadoop48",35853); >> >> config [co1, collectorSource( 35853 ), [collectorSink( >> >> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink( >> >> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >> >> I sent 300MB log, it will spent about 3 minutes, so >> it's about 100MB/minute. >> >> while I send the log from ag1 to co1 via scp, It's >> about 30MB/second. >> >> someone give me any ideas? >> >> thanks! >> >> Andy >> >> -- >> Alexander Alten-Lorenz >> http://mapredit.blogspot.com >> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >> >> >> >> >> >> -- >> Jeong-shik Jang / [email protected] <mailto:[email protected]> >> Gruter, Inc., R&D Team Leader >> www.gruter.com <http://www.gruter.com> >> Enjoy Connecting >> >> > > > -- > Jeong-shik Jang / [email protected] <mailto:[email protected]> > Gruter, Inc., R&D Team Leader > www.gruter.com <http://www.gruter.com> > Enjoy Connecting > > -- Jeong-shik Jang / [email protected] Gruter, Inc., R&D Team Leader www.gruter.com Enjoy Connecting
