Yes, excellent - I was a little muddy on some of the finer points, and I am glad you clarified for the sake of other mailing list users - I forgot I have the whole context in my head, but other readers might not.
Thanks again! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Wed, Dec 18, 2013 at 2:23 PM, Brock Noland <[email protected]> wrote: > Hi Devin, > > Please find my response below. > > On Wed, Dec 18, 2013 at 12:24 PM, Devin Suiter RDX <[email protected]>wrote: > >> >> So, if I understand your position on sizing the source properly, you are >> saying that the "fsync" operation is the costly part - it locks the device >> it is flushing to until the operation completes, and takes some time, so if >> you are committing small batches to the channel frequently, you are >> monopolizing the device frequently >> > > Correct, when using file channel, small batches spend most of the time > actually performing fsyncs. > > >> , but if you set the batch size at the source large enough, >> > > The language here is troublesome because "source" is overloaded. The term > "source" could refer to the flume source or the "source of events" for a > tiered architecture. Additionally some flume sources cannot control batch > size (avro source, http source, syslog) and some have a batch size + a > configured timeout (exec source) that still results in small batches most > of the time. > > When using file channel the upstream "source" should send large batches of > events. This might be the source connected directly to the file channel or > in a tiered architecture with say n application servers each running a > local agent which uses memory channel and then forwards events to a > "collector" tier which uses file channel. In either case the upstream > "sources" should use a large batch size. > > >> you will "take" from the source less frequently, with more data committed >> in every operation. >> > > The concept here is correct - larger batch sizes result in large number of > I/O's per fsync, thus increasing throughput of the system. > > Reading goes much faster, and HDFS will manage disk scheduling through >> RecordWriter in the HDFS sink, so those are not as problematic - is that >> accurate? >> > > Just to level set for anyone reading this, File Channel doesn't use HDFS, > HDFS is not aware of File Channel, and the disks we are referring to are > disks used by the File Channel not HDFS. > > >> So, if you are using a syslog source, that doesn't really offer a batch >> size parameter, would you set up a tiered flow with an Avro hop in the >> middle to aggregate log streams? >> > > Yes, that is a common and recommended configuration. Large setups will > have a local agent using memory channel, a first tier using memory channel > and then a second tier using file channel. > > >> Something like syslog source>--memory channel-->Avro sink > Avro source >> (large batch) >--file channel-->HDFS sink(s) for example? >> > > Avro Source doesn't have a batch size parameter....here you need to set a > large batch at the Avro Sink layer. > > >> I appreciate the help you've given on this topic. It's also good to know >> that the best practices are going into the doc, that will push everything >> forward. I've read the Packt publishing book on Flume but it didn't get >> into as much detail as I would like. The Cloudera blogs have been really >> helpful too. >> >> Thanks so much! >> > > No problem! Thank you for using our software! > > Brock >
