Re: distributed weblogs ingestion on HDFS via flume

ed Wed, 05 Feb 2014 18:03:36 -0800

HI Asim,

You would not do AvroSink to HDFS as both of these are sinks.  You would
only use AvroSource and AvroSink if you need to create a multi-hop topology
where you have to forward logs from one flume agent to another.  If you're
using sylogd to do the forwarding then you would not need to use AvroSource
and AvroSink.  You can store the data in HDFS using any format you want.
 Avro is nice because Avro containers are splittable, the binary format is
very compact, the data is more portable, and it's faster to serialize and
deserialize than say reading in a line of text, splitting it, and type
casting everything to the correct type within your MR job.  Sort of a side
benefit but HUE will display deflate compressed avro files for you in a
human readable way via the File Browser plugin which is pretty darn nice
when you're working with your data.


I think AvroSource and AvroSink are actually a little confusing in the
naming.  They don't convert your data to avro for you other than the
intermediate format they use to pass data between themselves via the Avro
RPC mechanism.  If you want to convert your data to Avro for storage in
HDFS you would do this using an event serializer in the HDFS Sink.  By
default I think the HDFS Sink will write your data out as a SequenceFiles.

Flume has a built in AvroEventSerializer but you will need to write your
own EventSerializer if you have a specific Avro schema you want to use
(which you probably will), otherwise SequenceFiles work well and have been
around a long time so are supported by pretty much everything in the Hadoop
ecosystem.

I've never used Impala but Hive works just fine with deflate compressed
Avro files using the AvroSerde and I think Snappy should work to.

~Ed



On Thu, Feb 6, 2014 at 10:29 AM, Asim Zafir <[email protected]> wrote:

> Ed,
>
> thanks for the response!.. I was wondering if we do use avro sink to hdfs,
> I assume the resident file format in HDFS will be avro?, the reason i am
> asking this question is because  hive/impala and map/reduce supposed to
> have dependency on file format and compression.as stated here
> http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_file_formats.html
>
> will be really interested to see your response as to how do you guys have
> handled or suggest to handle these issues?
>
> thanks
>
> Asim
>
>
> On Wed, Feb 5, 2014 at 4:09 PM, ed <[email protected]> wrote:
>
>> Hi Asim,
>>
>>
>> Here's some information that might be helpful based on my relatively new
>> experience with Flume:
>>
>>
>>  *1) do all the webserver in our case needs to run a flume agent?*
>>
>>
>> They could but don't necessarily have to.  For example, if you don't want
>> to put a flume agent on all your web servers you could forward the logs
>> using syslog to another server running a flume agent listening for the logs
>> using the syslog source.  If you do want to put a flume agent on your web
>> servers then you could send the logs to a local syslog source which would
>> use the avro sink to pass the logs to the flume collection server which
>> would do the actually writing to HDFS, or you could use a file spooler
>> source to read the logs from disk and then forward them to the collector
>> (again using avro source and sink)
>>
>>
>> *Not Using Flume on the Webservers:*
>>
>>
>> [webserver1: apache -> syslogd] ==>
>>
>> [webserver2: apache -> syslogd] ==> [flume collection server: flume
>> syslog source --> flume hdfs sink]
>>
>> [webserver3: apache -> syslogd] ==>
>>
>>
>> *Using Flume on the Webservers Option1:*
>>
>>
>> [webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>
>>
>> [webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>  [flume collection server: flume avro source --> flume hdfs sink]
>>
>> [webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
>> ==>
>>
>>
>> *Using Flume on Webservers Option2:*
>>
>>
>> [webserver1: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==>
>>
>> [webserver2: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==> [flume collection server: flume avro source --> flume hdfs
>> sink]
>>
>> [webserver3: apache -> filesystem -> flume file spooler source -> flume
>> avro sink] ==>
>>
>>
>> (by the way there are probably other ways to do this and you could even
>> split out the collection tier from the storage tier (currently done by the
>> same final agent)
>>
>>
>> *2) do all the webserver will be acting as source in our setup ?*
>>
>>
>> They will be acting as a source in the general sense that you want to
>> ingest their logs.  However, they don't necessarily have to run a flume
>> agent if you have some other way to ship the logs to a listening flume
>> agent somewhere (most likely using syslog but we've also had success with
>> receiving logs via the netcat source).
>>
>>
>> *3) can we sync webservers logs directly to HDFS store by passing
>> channels?*
>>
>>
>> Not sure what you mean here but you will need a flume source and sink
>> running (in this case an HDFS sink).  You can't get the logs into HDFS
>> using only a channel.
>>
>>
>> *4) do we have a choice of directly synching the weblogs to HDFS store
>> and not let the webserver right locally? what is the best practice?*
>>
>>
>> If for example you're using Apache you could configure apache to send the
>> logs directly to syslog which would forward them to the listening Flume
>> syslog source on a remote server which would then write the logs to HDFS
>> using the HDFS sink over a memory channel.  In this case you could avoid
>> having the logs written to disk but if one part of the data flow goes down
>> (e.g., the flume agent crashes) you will lose log data.  You could switch
>> to a file channel which is durable and would help minimize the risk of data
>> loss.  If you don't care about potential data loss then memory channel is
>> much faster and a bit easier to setup.
>>
>>
>> *5) what setup will that be where i would let the flume, sync a local
>> datadire on weblogs, and sync it as soon as the date arrives to this
>> directory?*
>>
>>
>> You would want to use a file spooler source to read the log directory
>> then send to a collector using the avrosource/sink.
>>
>>
>> *6) do i need a dedicated flume server for this setup?*
>>
>>
>> It depends on what else the flume server is doing.  Personally I think
>> it's much easier if you dedicate a box to the task as you don't have to
>> worry about resource contention and monitoring becomes easier.  In
>> addition, if you use the file channel you will want dedicated disks for
>> that purpose.  Note that I'm referring to your collector/storage tier.
>>  Obviously if you use a flume agent on the webserver it will not be a
>> dedicated box but this shouldn't be an issue as that agent is only
>> responsible for collecting logs off a single machine and forwarding them on
>> (this blog post has some good info on tuning and topology design:
>> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)
>>
>>
>> *7) if i do use  memory based channel and then do HDFS sync do I need a
>> dedicated server, or can run those agents on the webserver itself, provided
>> there is enough memory OR would it be recommended to position my config to
>> a centralize flume server and the establish the sync.*
>>
>>
>> I would not recommend running flume agents on all the webservers with
>> HDFS sink.  It seems much better to funnel the logs to 1 or more agents
>> that write to HDFS but not have all 50 webservers writing themselves.
>>
>>
>> *8) how should we do the capacity planning for a memory based channel?*
>>
>>
>> You have to decide how long you want to be able to hold data in the
>> memory channel in the event a downstream agent does down (or the HDFS sink
>> gets backed up).  Once you have that value you need to figure out what your
>> average event size is and the rate at which you are collecting events.
>>  This will give you a rough idea.  I'm sure there is some per event memory
>> overhead as well (but I don't know the exact value for that).  If you're
>> using Cloudera Manager you can monitor the memory channel usage directly
>> from the Cloudera Manager interface which is very useful.
>>
>>
>> *9) how should we do the capacity planning for a file based channel ?*
>>
>>
>> Assuming you're referring to heap memory, I think I saw in a different
>> thread that you need 32 bytes per event you want to store (the channel
>> capacity) + whatever Flume core will use. So if your channel capacity is 1
>> million events you will need ~32MB of heap space + 100-500MB for Flume
>> core.  You will of course need enough disk space to store the actual logs
>> themselves.
>>
>>
>> Best,
>>
>>
>> Ed
>>
>>
>>
>>
>>
>> On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <[email protected]> wrote:
>>
>>> Flume Users,
>>>
>>>
>>> Here is the problem statement, will be very much interested to have your
>>> valuable input and feedback on the following:
>>>
>>>
>>> *Assuming that fact that we generate  200GB of logs PER DAY from 50
>>> webservers *
>>>
>>>
>>>
>>> Goal is to sync that to HDFS repository
>>>
>>>
>>>
>>>
>>>
>>> 1) do all the webserver in our case needs to run a flume agent?
>>>
>>> 2) do all the webserver will be acting as source in our setup ?
>>>
>>> 3) can we sync webservers logs directly to HDFS store by passing
>>> channels?
>>>
>>> 4) do we have a choice of directly synching the weblogs to HDFS store
>>> and not let the webserver right locally? what is the best practice?
>>>
>>> 5) what setup will that be where i would let the flume, sync a local
>>> datadire on weblogs, and sync it as soon as the date arrives to this
>>> directory?
>>>
>>> 6) do i need a dedicated flume server for this setup?
>>>
>>> 7) if i do use  memory based channel and then do HDFS sync do I need a
>>> dedicated server, or can run those agents on the webserver itself, provided
>>> there is enough memory OR would it be recommended to position my config to
>>> a centralize flume server and the establish the sync.
>>>
>>> 8) how should we do the capacity planning for a memory based channel?
>>>
>>> 9) how should we do the capacity planning for a file based channel ?
>>>
>>>
>>>
>>> sincerely,
>>>
>>> AZ
>>>
>>
>>
>

Re: distributed weblogs ingestion on HDFS via flume

Reply via email to