Thanks for the information! I did collect some of those diagnostics files,
but nothing really jumped out at me - the system in question is not under
great load or anything exciting. Is there anything specific I should be
looking for?

I am running zookeeper as an external process (version 3.5.5 I think)
across three nodes (one more than nifi in order to avoid deadlock, as per
the admin guide). I might update this if you think it could be a source of
issues.

On Thu, 3 Jun 2021 at 23:09, Mark Bean <[email protected]> wrote:

> In addition to Mark Payne's suggestion - which you can run and evaluate
> locally, just not upload results - you can also look at several aspects of
> ZooKeeper. Firstly, are you running embedded or external ZooKeeper? We have
> found that external is far more reliable especially for "busy" clusters.
> Likely, this is due to the external ZooKeeper having its own JVM and
> associated resources.
>
> Also, take a look at the following in nifi.properties. It's a bit more art
> than science; there is no one right answer. The values depend on your
> environment. But, you can begin to experiment a reasonable number of
> threads, and relax some of the timeouts.
>
> nifi.cluster.protocol.heartbeat.interval=
> nifi.cluster.node.protocol.threads=
> nifi.cluster.node.protocol.max.threads=
> nifi.cluster.node.connection.timeout=
> nifi.cluster.node.read.timeout=
> nifi.cluster.node.max.concurrent.requests=
>
> -Mark
>
> On Wed, Jun 2, 2021 at 8:30 PM Phil H <[email protected]> wrote:
>
> > Thanks for getting back to me Mark.
> >
> > Unfortunately it is running on an intranet so I can’t get logs and so
> forth
> > off the system. Is there anything in particular I can look out for?
> >
> > I am running zookeeper as a separate service (not embedded) on three
> > nodes.  The nifi cluster currently has two nodes.
> >
> > Cheers,
> > Phil
> >
> > On Thu, 3 Jun 2021 at 10:17, Mark Payne <[email protected]> wrote:
> >
> > > Hey Phil,
> > >
> > > Can you grab a diagnostics dump from one of the nodes (preferably the
> > > cluster coordinator)? Ideally grab 3 of them, with about 5 mins in
> > between.
> > >
> > > To do that, run:
> > >
> > > bin/nifi.sh diagnostics <filename>
> > >
> > > So run something like:
> > >
> > > bin/nifi.sh diagnostics diagnostics1.txt
> > > <wait 3-5 mins>
> > > bin/nifi.sh diagnostics diagnostics2.txt
> > > <wait 3-5 mins>
> > > bin/nifi.sh diagnostics diagnostics3.txt
> > >
> > > And then upload those diagnostics text files?
> > > They should not contain any sensitive information, aside from maybe
> file
> > > paths (which most don’t consider sensitive but you may). But recommend
> > you
> > > glance through it to make sure that you leave out any sensitive
> > information.
> > >
> > > Those dumps should help in understanding the problem, or at least
> zeroing
> > > in on it.
> > >
> > > Also, is NiFi using its own dedicated zookeeper or is it shared with
> > other
> > > services? How many nodes is the zookeeper?
> > >
> > >
> > >
> > > > On Jun 2, 2021, at 7:54 PM, Phil H <[email protected]> wrote:
> > > >
> > > > Hi there,
> > > >
> > > > I am getting a lot of these both in the web interface to my servers,
> > and
> > > in
> > > > the cluster communication between the nodes. All other aspects of the
> > > > servers are fine. TCP connections to NiFi, as well as SSH connections
> > to
> > > > the servers are stable (running for days at a time). I’m lucky if I
> go
> > 5
> > > > minutes without the web UI dropping out or a cluster re-election due
> > to a
> > > > heartbeat being missed.
> > > >
> > > > Running 1.13.2, recently upgraded from 1.9.2. I was getting the same
> > > issues
> > > > with the old version, but they seem to be much more frequent now.
> > > >
> > > > Help!
> > >
> > >
> >
>

Reply via email to