Hi I try to set up flume high availability From rsyslog comes same feed to two different servers s1 and s2. In both servers are configured flume-agents to listen feed from rsyslog. Both agents are writing feed to HDFS. What I am getting into HDFS is different files with duplicated content.
Is there any best practice architecture how to use flume in situations like this. What I am trying to avoid is in situation when one server is down then syslog is forwarded into two servers and at least one can transport events to HDFS.
At the moment I thought I can clean after some time duplicates before hive will use directory.
-- Margus (margusja) Roo http://margus.roo.ee skype: margusja +372 51 48 780
