Hi, I have some machines with Kafka and DataNotes in different machines. I want to get with Flume the data from Kafka and store in HDFS. What's the best architecture? I assume that all the machines have access to the others.
Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs) There are a agent in each machine where Kafka is installed and the sink writes in HDFS directly, it could be configured some compress option in the sink, etc.. Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS) There are a agent in each machine where Kafka is installed. Flume sends data to another flume through Avro and Flume which is installed in the DataNode writes data in HDFS. Cluster1 (Kafka) --> Cluster2(Flume + HDFS) Flume is just installed in the DataNodes I don't like to install Flume in the DataNodes because these machines execute process as Spark, Hive, Impala, MapReduce and they spend so many resources on theirs tasks. On other hand, it is where data have to be sent. I could be configure more than one source to get data from Kafka and more than one Flume to have more htan one VM. Could someone comment about advantages and disvantages that finds in each scenario?
