Brock, I saw your reply on this come through the other day, and meant to respond but my day got away from me.
So, if I understand your position on sizing the source properly, you are saying that the "fsync" operation is the costly part - it locks the device it is flushing to until the operation completes, and takes some time, so if you are committing small batches to the channel frequently, you are monopolizing the device frequently, but if you set the batch size at the source large enough, you will "take" from the source less frequently, with more data committed in every operation. Reading goes much faster, and HDFS will manage disk scheduling through RecordWriter in the HDFS sink, so those are not as problematic - is that accurate? So, if you are using a syslog source, that doesn't really offer a batch size parameter, would you set up a tiered flow with an Avro hop in the middle to aggregate log streams? Something like syslog source>--memory channel-->Avro sink > Avro source (large batch) >--file channel-->HDFS sink(s) for example? I appreciate the help you've given on this topic. It's also good to know that the best practices are going into the doc, that will push everything forward. I've read the Packt publishing book on Flume but it didn't get into as much detail as I would like. The Cloudera blogs have been really helpful too. Thanks so much! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Wed, Dec 18, 2013 at 12:51 PM, Brock Noland <[email protected]> wrote: > FYI I am trying to capture some of the best practices in the Flume doc > itself: > > https://issues.apache.org/jira/browse/FLUME-2277 > > > On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <[email protected]> wrote: > >> Hi, >> >> I'd also add the biggest issue I see with the file channel is batch size >> at the source. Long story short is that file channel was written to >> guarantee no data loss. In order to do that when a transaction is committed >> we need to perform a "fsync" on the disk the transaction was written to. >> fsync's are very expensive so in order to obtain good performance, the >> source must have written a large batch of data. Here is some more >> information on this topic: >> >> http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/ >> >> http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/ >> >> Brock >> >> >> On Tue, Dec 17, 2013 at 11:50 AM, iain wright <[email protected]> wrote: >> >>> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for >>> this as it seems like a good use case. >>> >>> something like: >>> >>> pool >>> sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting >>> buffers in event of sudden power loss) >>> mirror >>> sda1 >>> sda2 >>> mirror >>> sda3 >>> sda4 >>> >>> theres probably further tuning that can be done as well within ZFS, but >>> i believe the ZIL will allow for immediate responses to flumes >>> checkpoint/data fsync's while the "actual data" is flushed asynchronously >>> to the spindles. >>> >>> Haven't tried this and YMMV. Some good reading available here: >>> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/ >>> >>> Cheers >>> >>> >>> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> There has been a lot of discussion about file channel speed today, and >>>> I have had a dilemma I was hoping for some feedback on, since the topic is >>>> hot. >>>> >>>> Regarding this: >>>> "Hi, >>>> >>>> 1) You are only using a single disk for file channel and it looks like >>>> a single disk for both checkpoint and data directories therefore throughput >>>> is going to be extremely slow." >>>> >>>> How do you solve in a practical sense the requirement for file channel >>>> to have a range of disks for best R/W speed, yet still have network >>>> visibility to source data sources and the Hadoop cluster at the same time? >>>> >>>> It seems like for production file channel implementation, the best >>>> solution is to give Flume a dedicated server somewhere near the edge with a >>>> JBOD pile properly mounted and partitioned. But that adds to implementation >>>> cost. >>>> >>>> The alternative seems to be to run Flume on a physical Cloudera >>>> Manager SCM server that has some extra disks, or run Flume agents >>>> concurrent with datanode processes on worker nodes, but those don't seem >>>> good to do, especially piggybacking on worker nodes, and file channel > >>>> HDFS will compound the issue... >>>> >>>> I know the namenode should definitely not be involved. >>>> >>>> I suppose you could virtualize a few servers on a properly networked >>>> host and a fast SANS/NAS connection and get by ok, but that will merge your >>>> parallelization at some point... >>>> >>>> Any ideas on the subject? >>>> >>>> *Devin Suiter* >>>> Jr. Data Solutions Software Engineer >>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>>> Google Voice: 412-256-8556 | www.rdx.com >>>> >>> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >> > > > > -- > Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org >
