Re: Partition map exchange in detail

Ilya Lantukh Wed, 12 Sep 2018 07:55:08 -0700

Pavel K., can you please answer about Zookeeper discovery?

On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky <eugene.miret...@gmail.com>
wrote:


> Thanks for the patience with my questions - just trying to understand the
> system better.
>
> 3) I was referring to https://apacheignite.readme.io/docs/
> zookeeper-discovery#section-failures-and-split-brain-handling. How come
> it doesn't get the node to shut down?
> 4) Are there any docs/JIRAs that explain how counters are used, and why
> they are required in the state?
>
> Cheers,
> Eugene
>
>
> On Wed, Sep 12, 2018 at 10:04 AM Ilya Lantukh <ilant...@gridgain.com>
> wrote:
>
>> 3) Such mechanics will be implemented in IEP-25 (linked above).
>> 4) Partition map states include update counters, which are incremented on
>> every cache update and play important role in new state calculation. So,
>> technically, every cache operation can lead to partition map change, and
>> for obvious reasons we can't route them through coordinator. Ignite is a
>> more complex system than Akka or Kafka and such simple solutions won't work
>> here (in general case). However, it is true that PME could be simplified or
>> completely avoid for certain cases and the community is currently working
>> on such optimizations (https://issues.apache.org/jira/browse/IGNITE-9558
>> for example).
>>
>> On Wed, Sep 12, 2018 at 9:08 AM, eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> 2b) I had a few situations where the cluster went into a state where PME
>>> constantly failed, and could never recover. I think the root cause was that
>>> a transaction got stuck and didn't timeout/rollback.  I will try to
>>> reproduce it again and get back to you
>>> 3) If a node is down, I would expect it to get detected and the node to
>>> get removed from the cluster. In such case, PME should not even be
>>> attempted with that node. Hence you would expect PME to fail very rarely
>>> (any faulty node will be removed before it has a chance to fail PME)
>>> 4) Don't all partition map changes go through the coordinator? I believe
>>> a lot of distributed systems work in this way (all decisions are made by
>>> the coordinator/leader) - In Akka the leader is responsible for making all
>>> cluster membership changes, in Kafka the controller does the leader
>>> election.
>>>
>>> On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <ilant...@gridgain.com>
>>> wrote:
>>>
>>>> 1) It is.
>>>> 2a) Ignite has retry mechanics for all messages, including PME-related
>>>> ones.
>>>> 2b) In this situation PME will hang, but it isn't a "deadlock".
>>>> 3) Sorry, I didn't understand your question. If a node is down, but
>>>> DiscoverySpi doesn't detect it, it isn't PME-related problem.
>>>> 4) How can you ensure that partition maps on coordinator are *latest 
>>>> *without
>>>> "freezing" cluster state for some time?
>>>>
>>>> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <
>>>> eugene.miret...@gmail.com> wrote:
>>>>
>>>>> Thanks!
>>>>>
>>>>> We are using persistence, so I am not sure if shutting down nodes will
>>>>> be the desired outcome for us since we would need to modify the baseline
>>>>> topolgy.
>>>>>
>>>>> A couple more follow up questions
>>>>>
>>>>> 1) Is PME triggered when client nodes join us well? We are using Spark
>>>>> client, so new nodes are created/destroy every time.
>>>>> 2) It sounds to me like there is a pontential for the cluster to get
>>>>> into a deadlock if
>>>>>    a) single PME message is lost (PME never finishes, there are no
>>>>> retries, and all future operations are blocked on the pending PME)
>>>>>    b) one of the nodes has a  long running/stuck pending operation
>>>>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>>>>> detect the node being down? We are using ZookeeperSpi so I would expect 
>>>>> the
>>>>> split brain resolver to shut down the node.
>>>>> 4) Why is PME needed? Doesn't the coordinator know the altest
>>>>> toplogy/pertition map of the cluster through regualr gossip?
>>>>>
>>>>> Cheers,
>>>>> Eugene
>>>>>
>>>>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <ilant...@gridgain.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Eugene,
>>>>>>
>>>>>> 1) PME happens when topology is modified (TopologyVersion is
>>>>>> incremented). The most common events that trigger it are: node
>>>>>> start/stop/fail, cluster activation/deactivation, dynamic cache 
>>>>>> start/stop.
>>>>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME
>>>>>> are transferred using DiscoverySpi instead of CommunicationSpi.
>>>>>> 3) All nodes wait for all pending cache operations to finish and then
>>>>>> send their local partition maps to the coordinator (oldest node). Then
>>>>>> coordinator calculates new global partition maps and sends them to every
>>>>>> node.
>>>>>> 4) All cache operations.
>>>>>> 5) Exchange is never retried. Ignite community is currently working
>>>>>> on PME failure handling that should kick all problematic nodes after
>>>>>> timeout is reached (see https://cwiki.apache.org/
>>>>>> confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+
>>>>>> hangs+resolving for details), but it isn't done yet.
>>>>>> 6) You shouldn't consider PME failure as a error by itself, but
>>>>>> rather as a result of some other error. The most common reason of PME
>>>>>> hang-up is pending cache operation that couldn't finish. Check your logs 
>>>>>> -
>>>>>> it should list pending transactions and atomic updates. Search for "Found
>>>>>> long running" substring.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>>>>> eugene.miret...@gmail.com> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>>>>> errors, I have searched around and it seems that a lot of people have 
>>>>>>> had a
>>>>>>> similar issue in the past. My high-level understanding is that when one 
>>>>>>> of
>>>>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to 
>>>>>>> exchange
>>>>>>> partition maps. However, I have a few questions
>>>>>>> 1) When does partition map exchange happen? Periodically, when a
>>>>>>> node joins, etc.
>>>>>>> 2) Is it done in the same thread as communication SPI, or is a
>>>>>>> separate worker?
>>>>>>> 3) How does the exchange happen? Via a coordinator, peer to peer,
>>>>>>> etc?
>>>>>>> 4) What does the exchange block?
>>>>>>> 5) When is the exchange retried?
>>>>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>>>>> decrease failureDetectionTimeout
>>>>>>>
>>>>>>> Our settings are
>>>>>>> - Zookeeper SPI
>>>>>>> - Persistence enabled
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Eugene
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best regards,
>>>>>> Ilya
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Best regards,
>>>> Ilya
>>>>
>>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>


-- 
Best regards,
Ilya

Re: Partition map exchange in detail

Reply via email to