We use Flume 1.4 to pass logs into HDFS as well as ElasticSearch for storage. The pipeline looks roughly like this:
Client to Server Flow... (local_app -> local_host_flume_agent) ---- AVRO/SSL ----> (remote_flume_agent)... Agent Server Flow ... (inbound avro -> FC1 -> ElasticSearch) (inbound avro -> FC2 -> S3/HDFS) In the last week we've made a few changes and now we're seeing a bit of a problem. We'e seen 3 different occurrences of a single flume agent server node beginning to back up its FC1 channel indefinitely until we log in and restart Flume entirely. The data just stops flowing -- we can't find any errors in the logs on either the ES or Flume side. A simple restart of Flume fixes it. Our sink config looks like this: > agent.sinks.elasticsearch.type = > org.apache.flume.sink.elasticsearch.ElasticSearchSink > agent.sinks.elasticsearch.hostNames = xxx:9300 > agent.sinks.elasticsearch.indexName = flume > agent.sinks.elasticsearch.clusterName = > flume-elasticsearch-production-useast1 > agent.sinks.elasticsearch.batchSize = 1000 > agent.sinks.elasticsearch.ttl = 30 > agent.sinks.elasticsearch.serializer = > org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer > agent.sinks.elasticsearch.channel = fc-unstructured-es This ONLY happens at Midnight, and only happens on one flume server. I'm wondering whether it has to do with the time it takes our ES nodes to create a new index ... and the first flume agent that triggers "index creation" could be getting blocked or stuck? Matt Wise Sr. Systems Architect Nextdoor.com
