Hello, We're doing some tests with Kafka-Flume.
We have four kafka and Flumes installed, There are 8 Datanodes installed in others machines. We have developed a injector to Kafka and want to read messages with Flume, we have been trying these configurations: Injector --> Kafka --> SoruceFlume --> Memory Channel --> Sink HDFS Injector --> Kafka Channel --> Sink HDFS We start to execute Flume when our injector ends to inject 1M message of 1024bytes and measure how many messages are processed per second. I mean, time from reading of kafka until writting them in hdfs. Kafka --> SoruceFlume --> Memory Channel --> Sink HDFS A.1 agent, one topic with 4 partitions 1 min 53 sg 8849 msg/sg B.1 agent, one topic with 8 partitions 1 min 47 sg 9345 sg/sg C.4 agent, one topic with 4 partitions, one agent for each partition 1 min 12 sg 13888 msg/sg D.4 agent, one topic with 8 partitions, one agent for every two partitions 46 sg 21739 msg/sg E.4 agent, one topic with 12 partitions, one agent for every three partitions 50 sg 20000 msg/sg Kafka Channel --> Sink HDFS F. 1 agent ,One topic with one partition 2 min 50 sg. 5882 msg/sg G.1 agent, one topic with 4 partitions 3 min 5555 msg/sg H.4 agents, 4 partitions, one agent for each partition 46 sg 21739 msg/sg Kafka channel, no source K.4 agents, 8 partitions, one agent for every two partitions 69 sg 14925 msg/sg Kafka channel, no source I'm confused with H and K, I guess that the sink is monothread, so, you need to have at least as many hdfs sinks as partitions in Kafka. That's why H is four times better than G. It's weird the different between D and K, Could someone tell me the reason? Is it the KafkaSource monotheard? On th other hand, it seems like the number of messages per seconds it's pretty low. We'll try to tune Flume with a bigger batchSize and others parameters to improve the performance.. Any advise about it? I thought as well to try with Null Sink to isolate Flume of HDFS.
