Folks, I have a NiFi 1.0 deployed in a non-secure cluster across 3 nodes.
I have a flow pipeline that reads from a Kafka topic using ConsumeKafka and kicks off an ExecuteStreamCommand mediated job based on attributes included in the notification message. What I observe is that jobs are being kicked off and they complete successfully on 2 of the nodes. The 3rd node however never seems to make progress on any of the jobs scheduled on it. I do see the node receiving the notification messages (based on PutRiemann events posted when message is received by ConsumeKafka) but thereafter there is no progress at all. The consequence is that the queue in front of the ExecuteStreamCommand processor keeps growing whenever a job is scheduled on the 'stuck' node. I don't see anything obvious to me in the nifi-app logs on any of the nodes that helps me get insight into what is afoot. I figured that some state is out-of-sync on the stuck node and decided to restart it. When that node went down, the queue in front of the ExecuteStreamCommand immediately went to 0 (I happened to be watching using the UI on one of the other nodes). When that node came back up, the queue is restored to the value it had prior to the restart. I am looking for debugging hints / ideas to help get insight into what is really going on. Thanks, A.B.
