Re: File Channel Best Practice

Brock Noland Wed, 18 Dec 2013 12:11:56 -0800

No problem. I am glad you started this email discussion and as I said
earlier, thank you for using our software! :)



On Wed, Dec 18, 2013 at 2:02 PM, Devin Suiter RDX <[email protected]> wrote:

> Yes, excellent - I was a little muddy on some of the finer points, and I
> am glad you clarified for the sake of other mailing list users - I forgot I
> have the whole context in my head, but other readers might not.
>
> Thanks again!
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <[email protected]> wrote:
>
>> Hi Devin,
>>
>> Please find my response below.
>>
>> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <[email protected]>wrote:
>>
>>>
>>> So, if I understand your position on sizing the source properly, you are
>>> saying that the "fsync" operation is the costly part - it locks the device
>>> it is flushing to until the operation completes, and takes some time, so if
>>> you are committing small batches to the channel frequently, you are
>>> monopolizing the device frequently
>>>
>>
>> Correct, when using file channel, small batches spend most of the time
>> actually performing fsyncs.
>>
>>
>>> , but if you set the batch size at the source large enough,
>>>
>>
>> The language here is troublesome because "source" is overloaded. The term
>> "source" could refer to the flume source or the "source of events" for a
>> tiered architecture. Additionally some flume sources cannot control batch
>> size (avro source, http source, syslog) and some have a batch size + a
>> configured timeout (exec source) that still results in small batches most
>> of the time.
>>
>> When using file channel the upstream "source" should send large batches
>> of events. This might be the source connected directly to the file channel
>> or in a tiered architecture with say n application servers each running a
>> local agent which uses memory channel and then forwards events to a
>> "collector" tier which uses file channel. In either case the upstream
>> "sources" should use a large batch size.
>>
>>
>>> you will "take" from the source less frequently, with more data
>>> committed in every operation.
>>>
>>
>> The concept here is correct - larger batch sizes result in large number
>> of I/O's per fsync, thus increasing throughput of the system.
>>
>> Reading goes much faster, and HDFS will manage disk scheduling through
>>> RecordWriter in the HDFS sink, so those are not as problematic - is that
>>> accurate?
>>>
>>
>> Just to level set for anyone reading this, File Channel doesn't use HDFS,
>> HDFS is not aware of File Channel, and the disks we are referring to are
>> disks used by the File Channel not HDFS.
>>
>>
>>> So, if you are using a syslog source, that doesn't really offer a batch
>>> size parameter, would you set up a tiered flow with an Avro hop in the
>>> middle to aggregate log streams?
>>>
>>
>> Yes, that is a common and recommended configuration. Large setups will
>> have a local agent using memory channel, a first tier using memory channel
>> and then a second tier using file channel.
>>
>>
>>> Something like syslog source>--memory channel-->Avro sink > Avro source
>>> (large batch) >--file channel-->HDFS sink(s) for example?
>>>
>>
>> Avro Source doesn't have a batch size parameter....here you need to set a
>> large batch at the Avro Sink layer.
>>
>>
>>> I appreciate the help you've given on this topic. It's also good to know
>>> that the best practices are going into the doc, that will push everything
>>> forward. I've read the Packt publishing book on Flume but it didn't get
>>> into as much detail as I would like. The Cloudera blogs have been really
>>> helpful too.
>>>
>>> Thanks so much!
>>>
>>
>> No problem!  Thank you for using our software!
>>
>> Brock
>>
>
>


-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: File Channel Best Practice

Reply via email to