No problem. I am glad you started this email discussion and as I said earlier, thank you for using our software! :)
On Wed, Dec 18, 2013 at 2:02 PM, Devin Suiter RDX <[email protected]> wrote: > Yes, excellent - I was a little muddy on some of the finer points, and I > am glad you clarified for the sake of other mailing list users - I forgot I > have the whole context in my head, but other readers might not. > > Thanks again! > > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com > > > On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <[email protected]> wrote: > >> Hi Devin, >> >> Please find my response below. >> >> On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <[email protected]>wrote: >> >>> >>> So, if I understand your position on sizing the source properly, you are >>> saying that the "fsync" operation is the costly part - it locks the device >>> it is flushing to until the operation completes, and takes some time, so if >>> you are committing small batches to the channel frequently, you are >>> monopolizing the device frequently >>> >> >> Correct, when using file channel, small batches spend most of the time >> actually performing fsyncs. >> >> >>> , but if you set the batch size at the source large enough, >>> >> >> The language here is troublesome because "source" is overloaded. The term >> "source" could refer to the flume source or the "source of events" for a >> tiered architecture. Additionally some flume sources cannot control batch >> size (avro source, http source, syslog) and some have a batch size + a >> configured timeout (exec source) that still results in small batches most >> of the time. >> >> When using file channel the upstream "source" should send large batches >> of events. This might be the source connected directly to the file channel >> or in a tiered architecture with say n application servers each running a >> local agent which uses memory channel and then forwards events to a >> "collector" tier which uses file channel. In either case the upstream >> "sources" should use a large batch size. >> >> >>> you will "take" from the source less frequently, with more data >>> committed in every operation. >>> >> >> The concept here is correct - larger batch sizes result in large number >> of I/O's per fsync, thus increasing throughput of the system. >> >> Reading goes much faster, and HDFS will manage disk scheduling through >>> RecordWriter in the HDFS sink, so those are not as problematic - is that >>> accurate? >>> >> >> Just to level set for anyone reading this, File Channel doesn't use HDFS, >> HDFS is not aware of File Channel, and the disks we are referring to are >> disks used by the File Channel not HDFS. >> >> >>> So, if you are using a syslog source, that doesn't really offer a batch >>> size parameter, would you set up a tiered flow with an Avro hop in the >>> middle to aggregate log streams? >>> >> >> Yes, that is a common and recommended configuration. Large setups will >> have a local agent using memory channel, a first tier using memory channel >> and then a second tier using file channel. >> >> >>> Something like syslog source>--memory channel-->Avro sink > Avro source >>> (large batch) >--file channel-->HDFS sink(s) for example? >>> >> >> Avro Source doesn't have a batch size parameter....here you need to set a >> large batch at the Avro Sink layer. >> >> >>> I appreciate the help you've given on this topic. It's also good to know >>> that the best practices are going into the doc, that will push everything >>> forward. I've read the Packt publishing book on Flume but it didn't get >>> into as much detail as I would like. The Cloudera blogs have been really >>> helpful too. >>> >>> Thanks so much! >>> >> >> No problem! Thank you for using our software! >> >> Brock >> > > -- Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org
