Increase your batch sizes
On Thu, Mar 27, 2014 at 12:29 PM, Chris Schneider < [email protected]> wrote: > Thanks for all the great replies. > > My specific situation is a bit more complex than I let on initially. > > Flume running multiple agents will absolutely be able to scale to the size > we need for production. But since our system is time-based, waiting for > real-world measurements to arrive, we have a simulation layer making > convincing real data to be pushed in for development & demos. (ie, create > events at 1000x accelerated time, so we can see the effects of our change > without waiting weeks). > > So we have a VM (Vagrant + Virtualbox) running HDFS & Flume on our laptops > as we're doing development. I suppose memory channel is fine in this case, > since it's all test data, but maximum single-agent speed is needed to > support the higher time accelerations I want. > > Unfortunately, our production system demands a horizontal scaling system > (flume is great), and our dev environment would be best with a vertically > scaling system (not as much flume's goal from what I can tell). > > Are there any tricks / tweaks that can get single-agent speeds up? What's > the fastest (maybe not 100% safe?) source type? Can we minimize the cost of > ACKing messages in the source? > > > On Thu, Mar 27, 2014 at 12:10 PM, Mike Keane > <[email protected]>wrote: > >> I tried to do a proof of concept with Netcat source with 1.3.0 or 1.3.1 >> and it failed miserably - I was able to make a change to improve it's >> performance, arguably a bug fix (I think socket acknowledgement it was >> expecting) but Netcat source was still my bottle neck. >> >> Have you read the blog on performance tuning - I'm not sure where you are >> in your flume implementation but I found it helpful. >> https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1 & >> https://blogs.apache.org/flume/entry/apache_flume_filechannel >> >> Since you need persistent storage I believe your only option still is the >> file channel. To get the performance you need you'll need dedicated disks >> for the queue and write ahead log - I had good luck with a solid state >> drive. With a single disk drive performance was awful. >> >> To get the throughput I wanted with compression I had one source tied to >> 6 file channels with compression on each channel. Perhaps there is a >> better way but that is how I got it working. >> >> We also configured Forced Write Back on centos boxes serving as flume >> agents. That was an optimization our IT Operations team made that helped >> throughput. That is a skill I don't have but I believe it does put you at >> risk of data loss if the server fails because it does more caching before >> flushing to disk. >> >> We are currently fluming between 40 and 50 billion log lines per day >> (10-12TB of data) from 14 servers "collector tier" sinking the data to 8 >> servers in the "storage tier" that writes to HDFS (MapR's implementation) >> with problem. We had no problem with 1/2 the servers however we configured >> fail over and paired up the servers for this purpose. Which by the way >> works flawlessly - able to pull one server out for maintenance and add back >> in no problem. >> >> Here are some high level points to our implementation. >> >> 1. Instead of netcat source I made use of the Embedded Agent - When I >> created an event to flume (EventBuilder.withBody(payload, hdrs)) I put a >> configurable number of log lines in the payload, usually 200 lines of log >> data. Ultimately I went away from text data all together and serialized >> 200 avro "log objects" as a avro data file byte array and that was my >> payload. >> >> 2. Keep your batch size large. I set mine to 50 - so 10,000 log lines >> (or objects) in a single batch. >> >> 3. You will get duplicates so be prepared to either customize flume to >> prevent duplicates (our solution) or write map reduce jobs to remove >> duplicates. >> >> >> Regards, >> >> Mike >> >> ________________________________ >> From: Andrew Ehrlich [[email protected]] >> Sent: Thursday, March 27, 2014 1:07 PM >> To: [email protected] >> Subject: Re: Fastest way to get data into flume? >> >> What about having more than one flume agent? >> >> You could have two agents that read the small messages and sink to HDFS, >> or two agents that read the messages, serialize them, and send them to a >> third agent which sinks them into HDFS. >> >> >> On Thu, Mar 27, 2014 at 9:43 AM, Chris Schneider < >> [email protected]<mailto:[email protected]>> >> wrote: >> I have a fair bit of data continually being created in the form of >> smallish messages (a few hundred bytes), which needs to enter flume, and >> eventually sink into HDFS. >> >> I need to be sure that the data lands in persistent storage and won't be >> lost, but otherwise throughput isn't important. It just needs to be fast >> enough to not back up. >> >> I'm running into a bottleneck in the initial ingestion of data. >> >> I've tried the netcat source, and the thrift source but both have capped >> out at a thousand or so records per second. >> >> Batching up the thrift api items into sets of 10 and using appendBatch is >> a pretty large speedup, but still not enough. >> >> Here's a gist of my ruby test script, and some example runs, and my >> config. >> >> https://gist.github.com/cschneid/9792305 >> >> >> 1. Are there any obvious performance changes I can do to speed up >> ingestion? >> 2. How fast can flume reasonably go? Should I switch my source to be >> something else that's faster? What? >> 3. Is there a better tool for this kind of task? (rapid, safe ingestion >> small messages). >> >> Thanks! >> Chris >> >> >> >> >> >> This email and any files included with it may contain privileged, >> proprietary and/or confidential information that is for the sole use >> of the intended recipient(s). Any disclosure, copying, distribution, >> posting, or use of the information contained in or attached to this >> email is prohibited unless permitted by the sender. If you have >> received this email in error, please immediately notify the sender >> via return email, telephone, or fax and destroy this original transmission >> and its included files without reading or saving it in any manner. >> Thank you. >> >> >
