Scaling hdfs writes when dealing with large numbers of files

Juhani Connolly Sun, 01 Sep 2013 22:01:25 -0700

Hey guys, trying to figure out an approach around what I personally feelis a bit of an abuse of the system, would appreciate your input, andperhaps can come up with a solution to share with others.


Short version, see below for detail:

Lot's of files being simulataneously written. Number of connections fromeach data path is defined by path file no * sink no, with a lot of filesthis will cause even increasing from 2->3 hdfs sinks to not scale eventhough nothing(flume aggregators or HDFS datanodes) is fully loaded.Experimenting with alternate configs, increasing data paths thusreducing files per path, allowing 2 sinks per path has shown the numberof connections up are causing this, but what exactly might be causingthe bottleneck and is there a way to get rid of it?


Long version:

We've lately been running into somewhat interesting issues withperformance seemingly capped by the number of hdfs connections that areup. This is due to the very significant number of different files thatare being streamed simultaneously.

I have serious doubts about the viability of the approach and havesuggested storing files together and post-processing but at the momentthis appears not to be a possibility, so I'm looking for alternateapproaches to allow scaling of throughput.

Some basic stats: on each aggregator node we have roughly 2,500 filesbeing written to every hour, and a bit over 25,000,000 lines in thatperiod of time. Approx 10k events per second. There are multipleaggregators writing to the same hdfs though each one writes to separatefiles from one another.

Originally we were scaling individual aggregator nodes by increasing theHDFS sink count, but this wasn't giving any increases beyond the secondnode, despite the results from Mike Percy'stests(https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30).By increasing sinks we were also increasing the number of hdfsconnections for every single file, hitting some kind ofbottleneck(datanode transfer threads? Considering each has 4k and thereare a lot more datanodes than aggregators, there should be plentyavailable).

Afterwards, splitting incoming data into multiple paths(separate sourcesand sinks) allowed each path to have two sinks and thus scales ourthroughput by increasing the number of sinks without increasing theconnections(because each sink was only handling a fraction of thefilepaths). Now a single node has 5 avro sources set up with a filechannel and 2 sinks attached to each. With all of the transferthreads(resulting from all the different files), each node has about4000 threads running though at peak it can be double that.

So we have a rough idea of what is wrong(too many connections arehitting some kind of bottleneck) but can't track any exact cause(neitherthe aggregators nor datanodes are at full load, the only blocked waitingthreads in flume are those waiting on HDFS). I personally feel we justneed to reduce the number of files being simultaneously written sinceHDFS isn't really made to deal with such small files, and batches arenot getting efficiently processed(waiting on dozens, possible hundredsof small transfers before being able to commit). That being said, cananyone provide specific insight into what may cause the bottleneck athigh connection numbers, and if there are ways around it(other thanreducing file counts and proportion of files to each sink which I'malready pushing for)?

Scaling hdfs writes when dealing with large numbers of files

Reply via email to