Thanks for all the great replies. My specific situation is a bit more complex than I let on initially.
Flume running multiple agents will absolutely be able to scale to the size we need for production. But since our system is time-based, waiting for real-world measurements to arrive, we have a simulation layer making convincing real data to be pushed in for development & demos. (ie, create events at 1000x accelerated time, so we can see the effects of our change without waiting weeks). So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our laptops as we're doing development. I suppose memory channel is fine in this case, since it's all test data, but maximum single-agent speed is needed to support the higher time accelerations I want. Unfortunately, our production system demands a horizontal scaling system (flume is great), and our dev environment would be best with a vertically scaling system (not as much flume's goal from what I can tell). Are there any tricks / tweaks that can get single-agent speeds up? What's the fastest (maybe not 100% safe?) source type? Can we minimize the cost of ACKing messages in the source? On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane <[email protected]>wrote: > I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1 > and it failed miserably - I was able to make a change to improve it's > performance, arguably a bug fix (I think socket acknowledgement it was > expecting) but Netcat source was still my bottle neck. > > Have you read the blog on performance tuning - I'm not sure where you are > in your flume implementation but I found it helpful. > https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 & > https://blogs.apache.org/flume/entry/apache_flume_filechannel > > Since you need persistent storage I believe your only option still is the > file channel. To get the performance you need you'll need dedicated disks > for the queue and write ahead log - I had good luck with a solid state > drive. With a single disk drive performance was awful. > > To get the throughput I wanted with compression I had one source tied to 6 > file channels with compression on each channel. Perhaps there is a better > way but that is how I got it working. > > We also configured Forced Write Back on centos boxes serving as flume > agents. That was an optimization our IT Operations team made that helped > throughput. That is a skill I don't have but I believe it does put you at > risk of data loss if the server fails because it does more caching before > flushing to disk. > > We are currently fluming between 40 and 50 billion log lines per day > (10-12TB of data) from 14 servers "collector tier" sinking the data to 8 > servers in the "storage tier" that writes to HDFS (MapR's implementation) > with problem. We had no problem with 1/2 the servers however we configured > fail over and paired up the servers for this purpose. Which by the way > works flawlessly - able to pull one server out for maintenance and add back > in no problem. > > Here are some high level points to our implementation. > > 1. Instead of netcat source I made use of the Embedded Agent - When I > created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a > configurable number of log lines in the payload, usually 200 lines of log > data. Ultimately I went away from text data all together and serialized > 200 avro "log objects" as a avro data file byte array and that was my > payload. > > 2. Keep your batch size large. I set mine to 50 - so 10,000 log lines > (or objects) in a single batch. > > 3. You will get duplicates so be prepared to either customize flume to > prevent duplicates (our solution) or write map reduce jobs to remove > duplicates. > > > Regards, > > Mike > > ________________________________ > From: Andrew Ehrlich [[email protected]] > Sent: Thursday, March 27, 2014 1:07 PM > To: [email protected] > Subject: Re: Fastest way to get data into flume? > > What about having more than one flume agent? > > You could have two agents that read the small messages and sink to HDFS, > or two agents that read the messages, serialize them, and send them to a > third agent which sinks them into HDFS. > > > On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider < > [email protected]<mailto:[email protected]>> > wrote: > I have a fair bit of data continually being created in the form of > smallish messages (a few hundred bytes), which needs to enter flume, and > eventually sink into HDFS. > > I need to be sure that the data lands in persistent storage and won't be > lost, but otherwise throughput isn't important. It just needs to be fast > enough to not back up. > > I'm running into a bottleneck in the initial ingestion of data. > > I've tried the netcat source, and the thrift source but both have capped > out at a thousand or so records per second. > > Batching up the thrift api items into sets of 10 and using appendBatch is > a pretty large speedup, but still not enough. > > Here's a gist of my ruby test script, and some example runs, and my config. > > https://gist.github.com/cschneid/9792305 > > > 1. Are there any obvious performance changes I can do to speed up > ingestion? > 2. How fast can flume reasonably go? Should I switch my source to be > something else that's faster? What? > 3. Is there a better tool for this kind of task? (rapid, safe ingestion > small messages). > > Thanks! > Chris > > > > > > This email and any files included with it may contain privileged, > proprietary and/or confidential information that is for the sole use > of the intended recipient(s). Any disclosure, copying, distribution, > posting, or use of the information contained in or attached to this > email is prohibited unless permitted by the sender. If you have > received this email in error, please immediately notify the sender > via return email, telephone, or fax and destroy this original transmission > and its included files without reading or saving it in any manner. > Thank you. > >
