Hey guys, trying to figure out an approach around what I personally feel is a bit of an abuse of the system, would appreciate your input, and perhaps can come up with a solution to share with others.

Short version, see below for detail:
Lot's of files being simulataneously written. Number of connections from each data path is defined by path file no * sink no, with a lot of files this will cause even increasing from 2->3 hdfs sinks to not scale even though nothing(flume aggregators or HDFS datanodes) is fully loaded. Experimenting with alternate configs, increasing data paths thus reducing files per path, allowing 2 sinks per path has shown the number of connections up are causing this, but what exactly might be causing the bottleneck and is there a way to get rid of it?

Long version:
We've lately been running into somewhat interesting issues with performance seemingly capped by the number of hdfs connections that are up. This is due to the very significant number of different files that are being streamed simultaneously.

I have serious doubts about the viability of the approach and have suggested storing files together and post-processing but at the moment this appears not to be a possibility, so I'm looking for alternate approaches to allow scaling of throughput.

Some basic stats: on each aggregator node we have roughly 2,500 files being written to every hour, and a bit over 25,000,000 lines in that period of time. Approx 10k events per second. There are multiple aggregators writing to the same hdfs though each one writes to separate files from one another.

Originally we were scaling individual aggregator nodes by increasing the HDFS sink count, but this wasn't giving any increases beyond the second node, despite the results from Mike Percy's tests(https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30). By increasing sinks we were also increasing the number of hdfs connections for every single file, hitting some kind of bottleneck(datanode transfer threads? Considering each has 4k and there are a lot more datanodes than aggregators, there should be plenty available).

Afterwards, splitting incoming data into multiple paths(separate sources and sinks) allowed each path to have two sinks and thus scales our throughput by increasing the number of sinks without increasing the connections(because each sink was only handling a fraction of the filepaths). Now a single node has 5 avro sources set up with a file channel and 2 sinks attached to each. With all of the transfer threads(resulting from all the different files), each node has about 4000 threads running though at peak it can be double that.

So we have a rough idea of what is wrong(too many connections are hitting some kind of bottleneck) but can't track any exact cause(neither the aggregators nor datanodes are at full load, the only blocked waiting threads in flume are those waiting on HDFS). I personally feel we just need to reduce the number of files being simultaneously written since HDFS isn't really made to deal with such small files, and batches are not getting efficiently processed(waiting on dozens, possible hundreds of small transfers before being able to commit). That being said, can anyone provide specific insight into what may cause the bottleneck at high connection numbers, and if there are ways around it(other than reducing file counts and proportion of files to each sink which I'm already pushing for)?

Reply via email to