FYI I am trying to capture some of the best practices in the Flume doc itself:
https://issues.apache.org/jira/browse/FLUME-2277 On Tue, Dec 17, 2013 at 12:17 PM, Brock Noland <[email protected]> wrote: > Hi, > > I'd also add the biggest issue I see with the file channel is batch size > at the source. Long story short is that file channel was written to > guarantee no data loss. In order to do that when a transaction is committed > we need to perform a "fsync" on the disk the transaction was written to. > fsync's are very expensive so in order to obtain good performance, the > source must have written a large batch of data. Here is some more > information on this topic: > > http://blog.cloudera.com/blog/2012/09/about-apache-flume-filechannel/ > > http://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/ > > Brock > > > On Tue, Dec 17, 2013 at 11:50 AM, iain wright <[email protected]> wrote: > >> Ive been meaning to try ZFS with an SSD based SLOG/ZIL (intent log) for >> this as it seems like a good use case. >> >> something like: >> >> pool >> sdaN - ZIL (enterprise grade ssd with capacitor/battery for persisting >> buffers in event of sudden power loss) >> mirror >> sda1 >> sda2 >> mirror >> sda3 >> sda4 >> >> theres probably further tuning that can be done as well within ZFS, but i >> believe the ZIL will allow for immediate responses to flumes >> checkpoint/data fsync's while the "actual data" is flushed asynchronously >> to the spindles. >> >> Haven't tried this and YMMV. Some good reading available here: >> https://pthree.org/2013/04/19/zfs-administration-appendix-a-visualizing-the-zfs-intent-log/ >> >> Cheers >> >> >> On Dec 17, 2013 8:30 AM, "Devin Suiter RDX" <[email protected]> wrote: >> >>> Hi, >>> >>> There has been a lot of discussion about file channel speed today, and I >>> have had a dilemma I was hoping for some feedback on, since the topic is >>> hot. >>> >>> Regarding this: >>> "Hi, >>> >>> 1) You are only using a single disk for file channel and it looks like a >>> single disk for both checkpoint and data directories therefore throughput >>> is going to be extremely slow." >>> >>> How do you solve in a practical sense the requirement for file channel >>> to have a range of disks for best R/W speed, yet still have network >>> visibility to source data sources and the Hadoop cluster at the same time? >>> >>> It seems like for production file channel implementation, the best >>> solution is to give Flume a dedicated server somewhere near the edge with a >>> JBOD pile properly mounted and partitioned. But that adds to implementation >>> cost. >>> >>> The alternative seems to be to run Flume on a physical Cloudera Manager >>> SCM server that has some extra disks, or run Flume agents concurrent with >>> datanode processes on worker nodes, but those don't seem good to do, >>> especially piggybacking on worker nodes, and file channel > HDFS will >>> compound the issue... >>> >>> I know the namenode should definitely not be involved. >>> >>> I suppose you could virtualize a few servers on a properly networked >>> host and a fast SANS/NAS connection and get by ok, but that will merge your >>> parallelization at some point... >>> >>> Any ideas on the subject? >>> >>> *Devin Suiter* >>> Jr. Data Solutions Software Engineer >>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>> Google Voice: 412-256-8556 | www.rdx.com >>> >> > > > -- > Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org > -- Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
