Re: Newbee question about flume 1.2 set up

Bhaskar Tue, 19 Jun 2012 06:47:03 -0700

Thank you guys for the responses.  I actually was able to get around this
problem by tinkering around with my setting.  I finally ended up with a
capacity of 10000 and commented out transactionCapacity (i originally set
it to 10) and it started working.  Thanks for the insight.  It took me a
bit of time to figure out the inner workings of AVRO to get it to send data
in correct format.  So, i got over that hump:-).  Here is my flow for POC.


Host A agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
(flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
../conf/)
Host B agent --> Source tail exec --> AVRO Client Sink --> jdbc channel
(flume-ng avro-client  -H <<Host>> -p <<port>> -F <<file to read>> --conf
../conf/)
Host C agent --> avro-collector source --> file sink to local directory --
Memory channel

The issue i am running into is, I am unable to uniquely identify the source
of the log in the sink (means the log events from Host A and Host B are
combined into the same log on the disk and mixed up).  Is there a way to
provide unique identifier from the source so that we can track the origin
of the log?  I am hoping to see in my sink log,

Host A -- some log entry
Host B -- Some log entry  etc

Is this feasible or are there any alternative mechanisms to achieve this?
I am putting together a new bee guide that might help answer some of these
questions for others as i explore this architecture.

As always thanks for your assistance,
Bhaskar

On Tue, Jun 19, 2012 at 2:59 AM, Juhani Connolly <
juhani_conno...@cyberagent.co.jp> wrote:

>  Hello Bhaskar,
>
> Using Avro is generally the recommended method to handle multi-hop flows,
> so no concerns there.
>
> Have you tried this setup using memory channels instead of jdbc? Last time
> I tested it, the JDBC channel had poor throughput, so you may be getting a
> logjam somewhere. How much data is getting entered into your logfile? Try
> raising the capacity on your jdbc channel by a lot(10000?). With a capacity
> of 10, if the reading side(host b) isn't polling frequently enough, there's
> going to be problems. This is probably why you get the "failed to persist
> event". As far as FLUME-1259 is concerned, that should only be happening if
> bad data is being sent. You're not sending anything else to the same port
> are you? Make sure that only the source and sink are set to that port and
> that nothing else is.
>
> If the problem continues, please post a chunk of the logs leading up to
> the OOM error(the full trace for the cause should be enough)
>
>
>
>
>
>
> On 06/16/2012 12:01 AM, Bhaskar wrote:
>
> Sorry to be a pest with stream of questions.  I think i am going two steps
> forward and four steps back:-).  After my first successful attempt, i tried
> running flume with the following flow:
>
>  1.  HostA
>   -- Source  is tail web server log
>   -- channel jdbc
>   -- sink is AVRO collection on Host B
> Configuraiton:
>  agent3.sources = tail
> agent3.sinks = avro-forward-sink
> agent3.channels = jdbc-channel
>
>  # Define source flow
> agent3.sources.tail.type = exec
> agent3.sources.tail.command = tail -f /common/log/access.log
> agent3.sources.tail.channels = jdbc-channel
>
>  # define the flow
> agent3.sinks.avro-forward-sink.channel = jdbc-channel
>
>  # avro sink properties
> agent3.sources.avro-forward-sink.type = avro
> agent3.sources.avro-forward-sink.hostname = <<IP Address>>
> agent3.sources.avro-forward-sink.port = <<PORT>>
>
>  # Define channels
> agent3.channels.jdbc-channel.type = jdbc
> agent3.channels.jdbc-channel.maximum.capacity = 10
> agent3.channels.jdbc-channel.maximum.connections = 2
>
>
>  2.  HostB
>   -- Source is AVRO collection
>   -- channel is memory
>   -- sink is local file system
>
>  Configuration:
>  # list sources, sinks and channels in the agent4
> agent4.sources = avro-collection-source
> agent4.sinks = svc_0_sink
> agent4.channels = MemoryChannel-2
>
>  # define the flow
> agent4.sources.avro-collection-source.channels = MemoryChannel-2
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
>  # avro sink properties
> agent4.sources.avro-collection-source.type = avro
> agent4.sources.avro-collection-source.bind = <<IP Address>>
> agent4.sources.avro-collection-source.port = <<PORT>>
>
>  agent4.sinks.svc_0_sink.type = FILE_ROLL
> agent4.sinks.svc_0_sink.sink.directory=/logs/agent4
> agent4.sinks.svc_0_sink.rollInterval=600
> agent4.sinks.svc_0_sink.channel = MemoryChannel-2
>
>  agent4.channels.MemoryChannel-2.type = memory
> agent4.channels.MemoryChannel-2.capacity = 100
> agent4.channels.MemoryChannel-2.transactionCapacity = 10
>
>
>  Basically i am trying to tail a file on one host, stream it to another
> host running sink.  During the trial run, the configuration is loaded fine
> and i see the channels created fine.  I see an exception from the jdbc
> channel first (Failed to persist event).  I am getting a java heap space
> OOM exception from Host B when Host A attempts to write.
>
>  2012-06-15 10:31:44,503 WARN ipc.NettyServer: Unexpected exception from
> downstream.
> java.lang.OutOfMemoryError: Java heap space
>
>  This issue was already reported
> https://issues.apache.org/jira/browse/FLUME-1259 but i am not sure if
> there is a work around to this problem.  I have couple questions:
>
>  1.  Am i force fitting a wrong solution here using AVRO?
> 2.  if so, what would be a right solution for streaming data from Host A
> to Host B (or thru intermediaries)?
>
>  Thanks,
> Bhaskar
>
> On Thu, Jun 14, 2012 at 4:31 PM, Mohammad Tariq <donta...@gmail.com>wrote:
>
>> Since you are thinking of using multi-hop flow I would suggest to go
>> for "JDBC Channel" as there is higher chance of error than single-hop
>> flow and in JDBC Channel events are stored in a persistent storage
>> that’s backed by a database. For detailed guidelines you can refer
>> Flume 1.x User Guide at -
>>
>> https://people.apache.org/~mpercy/flume/flume-1.2.0-incubating-SNAPSHOT/docs/FlumeUserGuide.html
>>
>> Regards,
>>     Mohammad Tariq
>>
>>
>> On Fri, Jun 15, 2012 at 12:46 AM, Bhaskar <bmar...@gmail.com> wrote:
>> > Hi Mohammad,
>> > Thanks for the pointer there.  Do you think using a message queue (like
>> > rabbitmq) would be a choice of communication channel between each hop?
>>  i am
>> > struggling to get a handle on how i need to configure my sink in
>> > intermediary hops in a multi-hop flow.    Appreciate any
>> guidance/examples.
>> >
>> > Thanks,
>> > Bhaskar
>> >
>> >
>> > On Thu, Jun 14, 2012 at 1:57 PM, Mohammad Tariq <donta...@gmail.com>
>> wrote:
>> >>
>> >> Hello Bhaskar,
>> >>
>> >>      That's great..And the best approach to stream logs depends upon
>> >> the type of source you want to watch for..And by looking at your
>> >> usecase, I would suggest to go for "multi-hop" flows where events
>> >> travel through multiple agents before reaching the final destination.
>> >>
>> >> Regards,
>> >>     Mohammad Tariq
>> >>
>> >>
>> >> On Thu, Jun 14, 2012 at 10:48 PM, Bhaskar <bmar...@gmail.com> wrote:
>> >> > I know what i am missing:-)  I missed connecting the sink with the
>> >> > channel.
>> >> >  My small POC works now and i am able to view the streamed logs.
>>  Thank
>> >> > you
>> >> > all for the guidance and patience in answering all questions.  So,
>> whats
>> >> > the
>> >> > best approach to stream logs from other hosts?  Basically my next
>> task
>> >> > would
>> >> > be to set up collector (sort of) model to stream logs to intermediary
>> >> > and
>> >> > then stream from collector to a sink location.  I'd appreciate any
>> >> > thoughts/guidance in this regard.
>> >> >
>> >> > Bhaskar
>> >> >
>> >> >
>> >> > On Thu, Jun 14, 2012 at 12:52 PM, Bhaskar <bmar...@gmail.com> wrote:
>> >> >>
>> >> >> For testing purposes, I tried with the following configuration
>> without
>> >> >> much luck.  I see that the process started fine but it just does not
>> >> >> write
>> >> >> anything to the sink.  I guess i am missing something here.  Can
>> one of
>> >> >> you
>> >> >> gurus take a look and suggest what i am doing wrong?
>> >> >>
>> >> >> Thanks,
>> >> >> Bhaskar
>> >> >>
>> >> >> agent1.sources = tail
>> >> >> agent1.channels = MemoryChannel-2
>> >> >> agent1.sinks = svc_0_sink
>> >> >>
>> >> >>
>> >> >> agent1.sources.tail.type = exec
>> >> >> agent1.sources.tail.command = tail -f /var/log/access.log
>> >> >> agent1.sources.tail.channels = MemoryChannel-2
>> >> >>
>> >> >> agent1.sinks.svc_0_sink.type = FILE_ROLL
>> >> >> agent1.sinks.svc_0_sink.sink.directory=/flume_runtime/logs
>> >> >> agent1.sinks.svc_0_sink.rollInterval=0
>> >> >>
>> >> >> agent1.channels.MemoryChannel-2.type = memory
>> >> >>
>> >> >>
>> >> >> On Thu, Jun 14, 2012 at 4:26 AM, Guillaume Polaert <
>> gpola...@cyres.fr>
>> >> >> wrote:
>> >> >>>
>> >> >>> Hi Bhaskar,
>> >> >>>
>> >> >>> This is the flume.conf (http://pastebin.com/WULgUuaf) what I'm
>> using.
>> >> >>> I have an avro server on the hadoop-m host and one agent per node
>> >> >>> (slave
>> >> >>> hosts). Each agent send the ouput of a exec command to avro server.
>> >> >>>
>> >> >>> Host1 : exec -> memory -> avro (sink)
>> >> >>>
>> >> >>> Host2 : exec -> memory -> avro
>> >> >>>                                                >>>>>    MainHost :
>> >> >>> avro
>> >> >>> (source) -> memory -> rolling file (local FS)
>> >> >>> ...
>> >> >>>
>> >> >>> Host3 : exec -> memory -> avro
>> >> >>>
>> >> >>>
>> >> >>> Use your own exec command to read Apache log.
>> >> >>>
>> >> >>> Guillaume Polaert | Cyrès Conseil
>> >> >>>
>> >> >>> De : Bhaskar [mailto:bmar...@gmail.com]
>> >> >>> Envoyé : mercredi 13 juin 2012 19:16
>> >> >>> À : flume-user@incubator.apache.org
>> >> >>> Objet : Newbee question about flume 1.2 set up
>> >> >>>
>> >> >>> Good Afternoon,
>> >> >>> I am a newbee to flume and read thru limited documentation
>> available.
>> >> >>>  I
>> >> >>> would like to set up the following to test out.
>> >> >>>
>> >> >>> 1.  Read apache access logs (as source)
>> >> >>> 2.  Use memory channel
>> >> >>> 3.  Write it to a NFS (or even local) file system
>> >> >>>
>> >> >>> Can some one help me with the necessary configuration.  I am having
>> >> >>> difficult time to glean that information from available
>> documentation.
>> >> >>>  I am
>> >> >>> sure someone has done such test before and i appreciate if you can
>> >> >>> pass on
>> >> >>> that information.  Secondly, I also would like to stream the logs
>> to a
>> >> >>> remote server.  Is that a log4j configuration or do i need to run
>> an
>> >> >>> agent
>> >> >>> on each host to do so?  Any configuration examples would be of
>> great
>> >> >>> help.
>> >> >>>
>> >> >>> Thanks,
>> >> >>> Bhaskar
>> >> >>
>> >> >>
>> >> >
>> >
>> >
>>
>
>
>

Re: Newbee question about flume 1.2 set up

Reply via email to