Pavel K., can you please answer about Zookeeper discovery? On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <eugene.miret...@gmail.com> wrote:
> Thanks for the patience with my questions - just trying to understand the > system better. > > 3) I was referring to https://apacheignite.readme.io/docs/ > zookeeper-discovery#section-failures-and-split-brain-handling. How come > it doesn't get the node to shut down? > 4) Are there any docs/JIRAs that explain how counters are used, and why > they are required in the state? > > Cheers, > Eugene > > > On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <ilant...@gridgain.com> > wrote: > >> 3) Such mechanics will be implemented in IEP-25 (linked above). >> 4) Partition map states include update counters, which are incremented on >> every cache update and play important role in new state calculation. So, >> technically, every cache operation can lead to partition map change, and >> for obvious reasons we can't route them through coordinator. Ignite is a >> more complex system than Akka or Kafka and such simple solutions won't work >> here (in general case). However, it is true that PME could be simplified or >> completely avoid for certain cases and the community is currently working >> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558 >> for example). >> >> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky < >> eugene.miret...@gmail.com> wrote: >> >>> 2b) I had a few situations where the cluster went into a state where PME >>> constantly failed, and could never recover. I think the root cause was that >>> a transaction got stuck and didn't timeout/rollback. I will try to >>> reproduce it again and get back to you >>> 3) If a node is down, I would expect it to get detected and the node to >>> get removed from the cluster. In such case, PME should not even be >>> attempted with that node. Hence you would expect PME to fail very rarely >>> (any faulty node will be removed before it has a chance to fail PME) >>> 4) Don't all partition map changes go through the coordinator? I believe >>> a lot of distributed systems work in this way (all decisions are made by >>> the coordinator/leader) - In Akka the leader is responsible for making all >>> cluster membership changes, in Kafka the controller does the leader >>> election. >>> >>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <ilant...@gridgain.com> >>> wrote: >>> >>>> 1) It is. >>>> 2a) Ignite has retry mechanics for all messages, including PME-related >>>> ones. >>>> 2b) In this situation PME will hang, but it isn't a "deadlock". >>>> 3) Sorry, I didn't understand your question. If a node is down, but >>>> DiscoverySpi doesn't detect it, it isn't PME-related problem. >>>> 4) How can you ensure that partition maps on coordinator are *latest >>>> *without >>>> "freezing" cluster state for some time? >>>> >>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky < >>>> eugene.miret...@gmail.com> wrote: >>>> >>>>> Thanks! >>>>> >>>>> We are using persistence, so I am not sure if shutting down nodes will >>>>> be the desired outcome for us since we would need to modify the baseline >>>>> topolgy. >>>>> >>>>> A couple more follow up questions >>>>> >>>>> 1) Is PME triggered when client nodes join us well? We are using Spark >>>>> client, so new nodes are created/destroy every time. >>>>> 2) It sounds to me like there is a pontential for the cluster to get >>>>> into a deadlock if >>>>> a) single PME message is lost (PME never finishes, there are no >>>>> retries, and all future operations are blocked on the pending PME) >>>>> b) one of the nodes has a long running/stuck pending operation >>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to >>>>> detect the node being down? We are using ZookeeperSpi so I would expect >>>>> the >>>>> split brain resolver to shut down the node. >>>>> 4) Why is PME needed? Doesn't the coordinator know the altest >>>>> toplogy/pertition map of the cluster through regualr gossip? >>>>> >>>>> Cheers, >>>>> Eugene >>>>> >>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <ilant...@gridgain.com> >>>>> wrote: >>>>> >>>>>> Hi Eugene, >>>>>> >>>>>> 1) PME happens when topology is modified (TopologyVersion is >>>>>> incremented). The most common events that trigger it are: node >>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache >>>>>> start/stop. >>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME >>>>>> are transferred using DiscoverySpi instead of CommunicationSpi. >>>>>> 3) All nodes wait for all pending cache operations to finish and then >>>>>> send their local partition maps to the coordinator (oldest node). Then >>>>>> coordinator calculates new global partition maps and sends them to every >>>>>> node. >>>>>> 4) All cache operations. >>>>>> 5) Exchange is never retried. Ignite community is currently working >>>>>> on PME failure handling that should kick all problematic nodes after >>>>>> timeout is reached (see https://cwiki.apache.org/ >>>>>> confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+ >>>>>> hangs+resolving for details), but it isn't done yet. >>>>>> 6) You shouldn't consider PME failure as a error by itself, but >>>>>> rather as a result of some other error. The most common reason of PME >>>>>> hang-up is pending cache operation that couldn't finish. Check your logs >>>>>> - >>>>>> it should list pending transactions and atomic updates. Search for "Found >>>>>> long running" substring. >>>>>> >>>>>> Hope this helps. >>>>>> >>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky < >>>>>> eugene.miret...@gmail.com> wrote: >>>>>> >>>>>>> Hello, >>>>>>> >>>>>>> Out cluster occasionally fails with "partition map exchange failure" >>>>>>> errors, I have searched around and it seems that a lot of people have >>>>>>> had a >>>>>>> similar issue in the past. My high-level understanding is that when one >>>>>>> of >>>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to >>>>>>> exchange >>>>>>> partition maps. However, I have a few questions >>>>>>> 1) When does partition map exchange happen? Periodically, when a >>>>>>> node joins, etc. >>>>>>> 2) Is it done in the same thread as communication SPI, or is a >>>>>>> separate worker? >>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, >>>>>>> etc? >>>>>>> 4) What does the exchange block? >>>>>>> 5) When is the exchange retried? >>>>>>> 5) How to resolve the error? The only thing I have seen online is to >>>>>>> decrease failureDetectionTimeout >>>>>>> >>>>>>> Our settings are >>>>>>> - Zookeeper SPI >>>>>>> - Persistence enabled >>>>>>> >>>>>>> Cheers, >>>>>>> Eugene >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Ilya >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Ilya >>>> >>> >> >> >> -- >> Best regards, >> Ilya >> > -- Best regards, Ilya