Re: File Channel Best Practice

Devin Suiter RDX Wed, 18 Dec 2013 10:26:01 -0800

Brock,

I saw your reply on this come through the other day, and meant to respond
but my day got away from me.


So, if I understand your position on sizing the source properly, you are
saying that the "fsync" operation is the costly part - it locks the device
it is flushing to until the operation completes, and takes some time, so if
you are committing small batches to the channel frequently, you are
monopolizing the device frequently, but if you set the batch size at the
source large enough, you will "take" from the source less frequently, with
more data committed in every operation. Reading goes much faster, and HDFS
will manage disk scheduling through RecordWriter in the HDFS sink, so those
are not as problematic - is that accurate?

So, if you are using a syslog source, that doesn't really offer a batch
size parameter, would you set up a tiered flow with an Avro hop in the
middle to aggregate log streams? Something like syslog source>--memory
channel-->Avro sink > Avro source (large batch) >--file channel-->HDFS
sink(s) for example?

I appreciate the help you've given on this topic. It's also good to know
that the best practices are going into the doc, that will push everything
forward. I've read the Packt publishing book on Flume but it didn't get
into as much detail as I would like. The Cloudera blogs have been really
helpful too.

Thanks so much!


*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Wed, Dec 18, 2013 at 12:51 PM, Brock Noland <[email protected]> wrote:

> FYI I am trying to capture some of the best practices in the Flume doc
> itself:
>
> https://issues.apache.org/jira/browse/FLUME-2277
>
>
> On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <[email protected]> wrote:
>
>> Hi,
>>
>> I'd also add the biggest issue I see with the file channel is batch size
>> at the source. Long story short is that file channel was written to
>> guarantee no data loss. In order to do that when a transaction is committed
>> we need to perform a "fsync" on the disk the transaction was written to.
>> fsync's are very expensive so in order to obtain good performance, the
>> source must have written a large batch of data. Here is some more
>> information on this topic:
>>
>> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/
>>
>> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>>
>> Brock
>>
>>
>> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <[email protected]> wrote:
>>
>>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for
>>> this as it seems like a good use case.
>>>
>>> something like:
>>>
>>> pool
>>>   sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting
>>> buffers in event of sudden power loss)
>>>   mirror
>>>     sda1
>>>     sda2
>>>   mirror
>>>     sda3
>>>     sda4
>>>
>>> theres probably further tuning that can be done as well within ZFS, but
>>> i believe the ZIL will allow for immediate responses to flumes
>>> checkpoint/data fsync's while the "actual data" is flushed asynchronously
>>> to the spindles.
>>>
>>> Haven't tried this and YMMV. Some good reading available here:
>>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/
>>>
>>> Cheers
>>>
>>>
>>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <[email protected]> wrote:
>>>
>>>> Hi,
>>>>
>>>> There has been a lot of discussion about file channel speed today, and
>>>> I have had a dilemma I was hoping for some feedback on, since the topic is
>>>> hot.
>>>>
>>>>  Regarding this:
>>>> "Hi,
>>>>
>>>> 1) You are only using a single disk for file channel and it looks like
>>>> a single disk for both checkpoint and data directories therefore throughput
>>>> is going to be extremely slow."
>>>>
>>>> How do you solve in a practical sense the requirement for file channel
>>>> to have a range of disks for best R/W speed, yet still have network
>>>> visibility to source data sources and the Hadoop cluster at the same time?
>>>>
>>>> It seems like for production file channel implementation, the best
>>>> solution is to give Flume a dedicated server somewhere near the edge with a
>>>> JBOD pile properly mounted and partitioned. But that adds to implementation
>>>> cost.
>>>>
>>>> The alternative seems to be to run Flume on a  physical Cloudera
>>>> Manager SCM server that has some extra disks, or run Flume agents
>>>> concurrent with datanode processes on worker nodes, but those don't seem
>>>> good to do, especially piggybacking on worker nodes, and file channel >
>>>> HDFS will compound the issue...
>>>>
>>>> I know the namenode should definitely not be involved.
>>>>
>>>> I suppose you could virtualize a few servers on a properly networked
>>>> host and a fast SANS/NAS connection and get by ok, but that will merge your
>>>> parallelization at some point...
>>>>
>>>> Any ideas on the subject?
>>>>
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>
>>>
>>
>>
>> --
>> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>>
>
>
>
> --
> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
>

Re: File Channel Best Practice

Reply via email to