Denis, >> Node availability check is based on the fact, that it receives fresh >> metrics once in metricsUpdateFreq ms.
I see the problem here. Node availability should be checked using some ping (fast and small) message instead of huge and slow metrics message. On Wed, Jan 30, 2019 at 4:08 PM Denis Mekhanikov <[email protected]> wrote: > Yakov, > > > You can put hard limit and process enqued MetricsUpdate message > > if last one of the kind was processed more than metricsUpdFreq millisecs > ago. > Makes sense. I'll try implementing it. > > > I would suggest we allow queue overflow for 1 min, but if situation does > not go to normal then node > > should fire a special event and then kill itself. > Let's start with a warning in log and see, how they correlate with problems > with network/GC. > I'd like to make sure we don't kill innocents. > > Anton, > > > Maybe, better case it to have special "discovery like" channel (with ring > or analog) for metrics like messages > I don't think, that creating another data channel is reasonable. It will > require additional network connections and more complex configuration. > But splitting pings and metrics into different types of messages, as it was > before, and moving metrics distribution to communication > makes sense to me. Some kind of a gossip protocol could be used for it. > > > Anyway, Why are fighting with duplicates inside the queue instead of > > fighting with new message initial creation while previous not yet > processed > > on the cluster? > > A situation, when multiple metrics update messages exist in the cluster, is > normal. > Node availability check is based on the fact, that it receives fresh > metrics once in metricsUpdateFreq ms. > If you make a coordinator wait for a previous metrics update message to be > delivered before issuing a new one, > then this frequency will depend on the number of nodes in the cluster, > since time of one round-trip with differ on different topologies. > > Alex, > > I didn't check it yet. Theoretically, nodes will fail a bit more often, > when their discovery worker queues are flooded with messages. > This change definitely requires extensive testing. > > I think you can make metric update messages have a regular priority > separately from fixing the issue, that I described. > > Denis > > вт, 29 янв. 2019 г. в 20:44, Alexey Goncharuk <[email protected] > >: > > > Folks, > > > > Did we already check that omitting hearbeat priority does not break > > discovery? I am currently working on another issue with discovery and > > skipping hearbeat priority would help a lot in my case. > > > > --AG > > > > пт, 11 янв. 2019 г. в 23:21, Yakov Zhdanov <[email protected]>: > > > > > > How big the message worker's queue may grow until it becomes a > problem? > > > > > > Denis, you never know. Imagine node may be flooded with messages > because > > of > > > the increased timeouts and network problems. I remember some cases with > > > hundreds of messages in queue on large topologies. Please, no O(n) > > > approaches =) > > > > > > > So, we may never come to a point, when an actual > > > TcpDiscoveryMetricsUpdateMessage is processed. > > > > > > Good catch! You can put hard limit and process enqued MetricsUpdate > > message > > > if last one of the kind was processed more than metricsUpdFreq > millisecs > > > ago. > > > > > > Denis, also note - initial problem is message queue growth. When we > > choose > > > to skip messages it means that node cannot process certain messages and > > > most probably experiencing problems. We need to think of killing such > > > nodes. I would suggest we allow queue overflow for 1 min, but if > > situation > > > does not go to normal then node should fire a special event and then > kill > > > itself. Thoughts? > > > > > > --Yakov > > > > > >
