Denis,

>> Node availability check is based on the fact, that it receives fresh
>> metrics once in metricsUpdateFreq ms.

I see the problem here.
Node availability should be checked using some ping (fast and small)
message instead of huge and slow metrics message.

On Wed, Jan 30, 2019 at 4:08 PM Denis Mekhanikov <dmekhani...@gmail.com>
wrote:

> Yakov,
>
> > You can put hard limit and process enqued MetricsUpdate message
> > if last one of the kind was processed more than metricsUpdFreq millisecs
> ago.
>  Makes sense. I'll try implementing it.
>
> > I would suggest we allow queue overflow for 1 min, but if situation does
> not go to normal then node
> > should fire a special event and then kill itself.
> Let's start with a warning in log and see, how they correlate with problems
> with network/GC.
> I'd like to make sure we don't kill innocents.
>
> Anton,
>
> > Maybe, better case it to have special "discovery like" channel (with ring
> or analog) for metrics like messages
> I don't think, that creating another data channel is reasonable. It will
> require additional network connections and more complex configuration.
> But splitting pings and metrics into different types of messages, as it was
> before, and moving metrics distribution to communication
> makes sense to me. Some kind of a gossip protocol could be used for it.
>
> > Anyway, Why are fighting with duplicates inside the queue instead of
> > fighting with new message initial creation while previous not yet
> processed
> > on the cluster?
>
> A situation, when multiple metrics update messages exist in the cluster, is
> normal.
> Node availability check is based on the fact, that it receives fresh
> metrics once in metricsUpdateFreq ms.
> If you make a coordinator wait for a previous metrics update message to be
> delivered before issuing a new one,
> then this frequency will depend on the number of nodes in the cluster,
> since time of one round-trip with differ on different topologies.
>
> Alex,
>
> I didn't check it yet. Theoretically, nodes will fail a bit more often,
> when their discovery worker queues are flooded with messages.
> This change definitely requires extensive testing.
>
> I think you can make metric update messages have a regular priority
> separately from fixing the issue, that I described.
>
> Denis
>
> вт, 29 янв. 2019 г. в 20:44, Alexey Goncharuk <alexey.goncha...@gmail.com
> >:
>
> > Folks,
> >
> > Did we already check that omitting hearbeat priority does not break
> > discovery? I am currently working on another issue with discovery and
> > skipping hearbeat priority would help a lot in my case.
> >
> > --AG
> >
> > пт, 11 янв. 2019 г. в 23:21, Yakov Zhdanov <yzhda...@apache.org>:
> >
> > > > How big the message worker's queue may grow until it becomes a
> problem?
> > >
> > > Denis, you never know. Imagine node may be flooded with messages
> because
> > of
> > > the increased timeouts and network problems. I remember some cases with
> > > hundreds of messages in queue on large topologies. Please, no O(n)
> > > approaches =)
> > >
> > > > So, we may never come to a point, when an actual
> > > TcpDiscoveryMetricsUpdateMessage is processed.
> > >
> > > Good catch! You can put hard limit and process enqued MetricsUpdate
> > message
> > > if last one of the kind was processed more than metricsUpdFreq
> millisecs
> > > ago.
> > >
> > > Denis, also note - initial problem is message queue growth. When we
> > choose
> > > to skip messages it means that node cannot process certain messages and
> > > most probably experiencing problems. We need to think of killing such
> > > nodes. I would suggest we allow queue overflow for 1 min, but if
> > situation
> > > does not go to normal then node should fire a special event and then
> kill
> > > itself. Thoughts?
> > >
> > > --Yakov
> > >
> >
>

Reply via email to