Thank you guys for the responses. I actually was able to get around this problem by tinkering around with my setting. I finally ended up with a capacity of 10000 and commented out transactionCapacity (i originally set it to 10) and it started working. Thanks for the insight. It took me a bit of time to figure out the inner workings of AVRO to get it to send data in correct format. So, i got over that hump:-). Here is my flow for POC.
Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>> --conf ../conf/) Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel (flume-ng avro-client -H <<Host>> -p <<port>> -F <<file to read>> --conf ../conf/) Host C agent --> avro-collector source --> file sink to local directory -- Memory channel The issue i am running into is, I am unable to uniquely identify the source of the log in the sink (means the log events from Host A and Host B are combined into the same log on the disk and mixed up). Is there a way to provide unique identifier from the source so that we can track the origin of the log? I am hoping to see in my sink log, Host A -- some log entry Host B -- Some log entry etc Is this feasible or are there any alternative mechanisms to achieve this? I am putting together a new bee guide that might help answer some of these questions for others as i explore this architecture. As always thanks for your assistance, Bhaskar On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly < juhani_conno...@cyberagent.co.jp> wrote: > Hello Bhaskar, > > Using Avro is generally the recommended method to handle multi-hop flows, > so no concerns there. > > Have you tried this setup using memory channels instead of jdbc? Last time > I tested it, the JDBC channel had poor throughput, so you may be getting a > logjam somewhere. How much data is getting entered into your logfile? Try > raising the capacity on your jdbc channel by a lot(10000?). With a capacity > of 10, if the reading side(host b) isn't polling frequently enough, there's > going to be problems. This is probably why you get the "failed to persist > event". As far as FLUME-1259 is concerned, that should only be happening if > bad data is being sent. You're not sending anything else to the same port > are you? Make sure that only the source and sink are set to that port and > that nothing else is. > > If the problem continues, please post a chunk of the logs leading up to > the OOM error(the full trace for the cause should be enough) > > > > > > > On 06/16/2012 12:01 AM, Bhaskar wrote: > > Sorry to be a pest with stream of questions. I think i am going two steps > forward and four steps back:-). After my first successful attempt, i tried > running flume with the following flow: > > 1. HostA > -- Source is tail web server log > -- channel jdbc > -- sink is AVRO collection on Host B > Configuraiton: > agent3.sources = tail > agent3.sinks = avro-forward-sink > agent3.channels = jdbc-channel > > # Define source flow > agent3.sources.tail.type = exec > agent3.sources.tail.command = tail -f /common/log/access.log > agent3.sources.tail.channels = jdbc-channel > > # define the flow > agent3.sinks.avro-forward-sink.channel = jdbc-channel > > # avro sink properties > agent3.sources.avro-forward-sink.type = avro > agent3.sources.avro-forward-sink.hostname = <<IP Address>> > agent3.sources.avro-forward-sink.port = <<PORT>> > > # Define channels > agent3.channels.jdbc-channel.type = jdbc > agent3.channels.jdbc-channel.maximum.capacity = 10 > agent3.channels.jdbc-channel.maximum.connections = 2 > > > 2. HostB > -- Source is AVRO collection > -- channel is memory > -- sink is local file system > > Configuration: > # list sources, sinks and channels in the agent4 > agent4.sources = avro-collection-source > agent4.sinks = svc_0_sink > agent4.channels = MemoryChannel-2 > > # define the flow > agent4.sources.avro-collection-source.channels = MemoryChannel-2 > agent4.sinks.svc_0_sink.channel = MemoryChannel-2 > > # avro sink properties > agent4.sources.avro-collection-source.type = avro > agent4.sources.avro-collection-source.bind = <<IP Address>> > agent4.sources.avro-collection-source.port = <<PORT>> > > agent4.sinks.svc_0_sink.type = FILE_ROLL > agent4.sinks.svc_0_sink.sink.directory=/logs/agent4 > agent4.sinks.svc_0_sink.rollInterval=600 > agent4.sinks.svc_0_sink.channel = MemoryChannel-2 > > agent4.channels.MemoryChannel-2.type = memory > agent4.channels.MemoryChannel-2.capacity = 100 > agent4.channels.MemoryChannel-2.transactionCapacity = 10 > > > Basically i am trying to tail a file on one host, stream it to another > host running sink. During the trial run, the configuration is loaded fine > and i see the channels created fine. I see an exception from the jdbc > channel first (Failed to persist event). I am getting a java heap space > OOM exception from Host B when Host A attempts to write. > > 2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from > downstream. > java.lang.OutOfMemoryError: Java heap space > > This issue was already reported > https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if > there is a work around to this problem. I have couple questions: > > 1. Am i force fitting a wrong solution here using AVRO? > 2. if so, what would be a right solution for streaming data from Host A > to Host B (or thru intermediaries)? > > Thanks, > Bhaskar > > On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <donta...@gmail.com>wrote: > >> Since you are thinking of using multi-hop flow I would suggest to go >> for "JDBC Channel" as there is higher chance of error than single-hop >> flow and in JDBC Channel events are stored in a persistent storage >> that’s backed by a database. For detailed guidelines you can refer >> Flume 1.x User Guide at - >> >> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html >> >> Regards, >> Mohammad Tariq >> >> >> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmar...@gmail.com> wrote: >> > Hi Mohammad, >> > Thanks for the pointer there. Do you think using a message queue (like >> > rabbitmq) would be a choice of communication channel between each hop? >> i am >> > struggling to get a handle on how i need to configure my sink in >> > intermediary hops in a multi-hop flow. Appreciate any >> guidance/examples. >> > >> > Thanks, >> > Bhaskar >> > >> > >> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <donta...@gmail.com> >> wrote: >> >> >> >> Hello Bhaskar, >> >> >> >> That's great..And the best approach to stream logs depends upon >> >> the type of source you want to watch for..And by looking at your >> >> usecase, I would suggest to go for "multi-hop" flows where events >> >> travel through multiple agents before reaching the final destination. >> >> >> >> Regards, >> >> Mohammad Tariq >> >> >> >> >> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmar...@gmail.com> wrote: >> >> > I know what i am missing:-) I missed connecting the sink with the >> >> > channel. >> >> > My small POC works now and i am able to view the streamed logs. >> Thank >> >> > you >> >> > all for the guidance and patience in answering all questions. So, >> whats >> >> > the >> >> > best approach to stream logs from other hosts? Basically my next >> task >> >> > would >> >> > be to set up collector (sort of) model to stream logs to intermediary >> >> > and >> >> > then stream from collector to a sink location. I'd appreciate any >> >> > thoughts/guidance in this regard. >> >> > >> >> > Bhaskar >> >> > >> >> > >> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bmar...@gmail.com> wrote: >> >> >> >> >> >> For testing purposes, I tried with the following configuration >> without >> >> >> much luck. I see that the process started fine but it just does not >> >> >> write >> >> >> anything to the sink. I guess i am missing something here. Can >> one of >> >> >> you >> >> >> gurus take a look and suggest what i am doing wrong? >> >> >> >> >> >> Thanks, >> >> >> Bhaskar >> >> >> >> >> >> agent1.sources = tail >> >> >> agent1.channels = MemoryChannel-2 >> >> >> agent1.sinks = svc_0_sink >> >> >> >> >> >> >> >> >> agent1.sources.tail.type = exec >> >> >> agent1.sources.tail.command = tail -f /var/log/access.log >> >> >> agent1.sources.tail.channels = MemoryChannel-2 >> >> >> >> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL >> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs >> >> >> agent1.sinks.svc_0_sink.rollInterval=0 >> >> >> >> >> >> agent1.channels.MemoryChannel-2.type = memory >> >> >> >> >> >> >> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert < >> gpola...@cyres.fr> >> >> >> wrote: >> >> >>> >> >> >>> Hi Bhaskar, >> >> >>> >> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm >> using. >> >> >>> I have an avro server on the hadoop-m host and one agent per node >> >> >>> (slave >> >> >>> hosts). Each agent send the ouput of a exec command to avro server. >> >> >>> >> >> >>> Host1 : exec -> memory -> avro (sink) >> >> >>> >> >> >>> Host2 : exec -> memory -> avro >> >> >>> >>>>> MainHost : >> >> >>> avro >> >> >>> (source) -> memory -> rolling file (local FS) >> >> >>> ... >> >> >>> >> >> >>> Host3 : exec -> memory -> avro >> >> >>> >> >> >>> >> >> >>> Use your own exec command to read Apache log. >> >> >>> >> >> >>> Guillaume Polaert | Cyrès Conseil >> >> >>> >> >> >>> De : Bhaskar [mailto:bmar...@gmail.com] >> >> >>> Envoyé : mercredi 13 juin 2012 19:16 >> >> >>> À : flume-user@incubator.apache.org >> >> >>> Objet : Newbee question about flume 1.2 set up >> >> >>> >> >> >>> Good Afternoon, >> >> >>> I am a newbee to flume and read thru limited documentation >> available. >> >> >>> I >> >> >>> would like to set up the following to test out. >> >> >>> >> >> >>> 1. Read apache access logs (as source) >> >> >>> 2. Use memory channel >> >> >>> 3. Write it to a NFS (or even local) file system >> >> >>> >> >> >>> Can some one help me with the necessary configuration. I am having >> >> >>> difficult time to glean that information from available >> documentation. >> >> >>> I am >> >> >>> sure someone has done such test before and i appreciate if you can >> >> >>> pass on >> >> >>> that information. Secondly, I also would like to stream the logs >> to a >> >> >>> remote server. Is that a log4j configuration or do i need to run >> an >> >> >>> agent >> >> >>> on each host to do so? Any configuration examples would be of >> great >> >> >>> help. >> >> >>> >> >> >>> Thanks, >> >> >>> Bhaskar >> >> >> >> >> >> >> >> > >> > >> > >> > > >