We have 4 agents with RabbitMQ sources, and we are seeing heartbeat lost
errors in Flume logs occasionally when there is a load on the RMQ queues.
Now, we think this is probably because RMQ is under load and not responding
to Flume fast enough; however, is there any way we can handle this from
Flume end, by increasing the heartbeat timeout setting?
This happens only when there is a load in RMQ, and RMQ seems to stabilize
after some time; if we kill the Flume agent and restart, it works fine and
consumes the messages.
Any inputs on handling such scenario? Also, the Flume agent continues to
run despite this error, but doesn't consume any more messages. Is there a
way to have Flume abort when this happens?
Thank you for any help with this. The error from the logs is as below.
Suresh.
Exception in thread "RabbitMQ Consumer #0"
com.rabbitmq.client.ShutdownSignalException: connection error
at
com.rabbitmq.client.QueueingConsumer.handle(QueueingConsumer.java:198)
at
com.rabbitmq.client.QueueingConsumer.nextDelivery(QueueingConsumer.java:215)
at com.aweber.flume.source.rabbitmq.Consumer.run(Consumer.java:164)
at java.lang.Thread.run(Thread.java:745)
Caused by: com.rabbitmq.client.MissedHeartbeatException: Heartbeat missing
with heartbeat = 60 seconds
at
com.rabbitmq.client.impl.AMQConnection.handleSocketTimeout(AMQConnection.java:597)
at
com.rabbitmq.client.impl.AMQConnection.access$600(AMQConnection.java:65)
at
com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:560)
... 1 more