1) It is. 2a) Ignite has retry mechanics for all messages, including PME-related ones. 2b) In this situation PME will hang, but it isn't a "deadlock". 3) Sorry, I didn't understand your question. If a node is down, but DiscoverySpi doesn't detect it, it isn't PME-related problem. 4) How can you ensure that partition maps on coordinator are *latest *without "freezing" cluster state for some time?
On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <eugene.miret...@gmail.com> wrote: > Thanks! > > We are using persistence, so I am not sure if shutting down nodes will be > the desired outcome for us since we would need to modify the baseline > topolgy. > > A couple more follow up questions > > 1) Is PME triggered when client nodes join us well? We are using Spark > client, so new nodes are created/destroy every time. > 2) It sounds to me like there is a pontential for the cluster to get into > a deadlock if > a) single PME message is lost (PME never finishes, there are no > retries, and all future operations are blocked on the pending PME) > b) one of the nodes has a long running/stuck pending operation > 3) Under what circumastance can PME fail, while DiscoverySpi fails to > detect the node being down? We are using ZookeeperSpi so I would expect the > split brain resolver to shut down the node. > 4) Why is PME needed? Doesn't the coordinator know the altest > toplogy/pertition map of the cluster through regualr gossip? > > Cheers, > Eugene > > On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <ilant...@gridgain.com> wrote: > >> Hi Eugene, >> >> 1) PME happens when topology is modified (TopologyVersion is >> incremented). The most common events that trigger it are: node >> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop. >> 2) It is done by a separate ExchangeWorker. Events that trigger PME are >> transferred using DiscoverySpi instead of CommunicationSpi. >> 3) All nodes wait for all pending cache operations to finish and then >> send their local partition maps to the coordinator (oldest node). Then >> coordinator calculates new global partition maps and sends them to every >> node. >> 4) All cache operations. >> 5) Exchange is never retried. Ignite community is currently working on >> PME failure handling that should kick all problematic nodes after timeout >> is reached (see https://cwiki.apache.org/confluence/display/IGNITE/IEP- >> 25%3A+Partition+Map+Exchange+hangs+resolving for details), but it isn't >> done yet. >> 6) You shouldn't consider PME failure as a error by itself, but rather as >> a result of some other error. The most common reason of PME hang-up is >> pending cache operation that couldn't finish. Check your logs - it should >> list pending transactions and atomic updates. Search for "Found long >> running" substring. >> >> Hope this helps. >> >> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky < >> eugene.miret...@gmail.com> wrote: >> >>> Hello, >>> >>> Out cluster occasionally fails with "partition map exchange failure" >>> errors, I have searched around and it seems that a lot of people have had a >>> similar issue in the past. My high-level understanding is that when one of >>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange >>> partition maps. However, I have a few questions >>> 1) When does partition map exchange happen? Periodically, when a node >>> joins, etc. >>> 2) Is it done in the same thread as communication SPI, or is a separate >>> worker? >>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc? >>> 4) What does the exchange block? >>> 5) When is the exchange retried? >>> 5) How to resolve the error? The only thing I have seen online is to >>> decrease failureDetectionTimeout >>> >>> Our settings are >>> - Zookeeper SPI >>> - Persistence enabled >>> >>> Cheers, >>> Eugene >>> >> >> >> >> -- >> Best regards, >> Ilya >> > -- Best regards, Ilya