Re: Partition map exchange in detail

eugene miretsky Tue, 11 Sep 2018 23:09:02 -0700

2b) I had a few situations where the cluster went into a state where PME
constantly failed, and could never recover. I think the root cause was that
a transaction got stuck and didn't timeout/rollback.  I will try to
reproduce it again and get back to you
3) If a node is down, I would expect it to get detected and the node to get
removed from the cluster. In such case, PME should not even be attempted
with that node. Hence you would expect PME to fail very rarely (any faulty
node will be removed before it has a chance to fail PME)
4) Don't all partition map changes go through the coordinator? I believe a
lot of distributed systems work in this way (all decisions are made by the
coordinator/leader) - In Akka the leader is responsible for making all
cluster membership changes, in Kafka the controller does the leader
election.


On Tue, Sep 11, 2018 at 11:11 AM Ilya Lantukh <ilant...@gridgain.com> wrote:

> 1) It is.
> 2a) Ignite has retry mechanics for all messages, including PME-related
> ones.
> 2b) In this situation PME will hang, but it isn't a "deadlock".
> 3) Sorry, I didn't understand your question. If a node is down, but
> DiscoverySpi doesn't detect it, it isn't PME-related problem.
> 4) How can you ensure that partition maps on coordinator are *latest *without
> "freezing" cluster state for some time?
>
> On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <eugene.miret...@gmail.com
> > wrote:
>
>> Thanks!
>>
>> We are using persistence, so I am not sure if shutting down nodes will be
>> the desired outcome for us since we would need to modify the baseline
>> topolgy.
>>
>> A couple more follow up questions
>>
>> 1) Is PME triggered when client nodes join us well? We are using Spark
>> client, so new nodes are created/destroy every time.
>> 2) It sounds to me like there is a pontential for the cluster to get into
>> a deadlock if
>>    a) single PME message is lost (PME never finishes, there are no
>> retries, and all future operations are blocked on the pending PME)
>>    b) one of the nodes has a  long running/stuck pending operation
>> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
>> detect the node being down? We are using ZookeeperSpi so I would expect the
>> split brain resolver to shut down the node.
>> 4) Why is PME needed? Doesn't the coordinator know the altest
>> toplogy/pertition map of the cluster through regualr gossip?
>>
>> Cheers,
>> Eugene
>>
>> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <ilant...@gridgain.com>
>> wrote:
>>
>>> Hi Eugene,
>>>
>>> 1) PME happens when topology is modified (TopologyVersion is
>>> incremented). The most common events that trigger it are: node
>>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>>> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
>>> transferred using DiscoverySpi instead of CommunicationSpi.
>>> 3) All nodes wait for all pending cache operations to finish and then
>>> send their local partition maps to the coordinator (oldest node). Then
>>> coordinator calculates new global partition maps and sends them to every
>>> node.
>>> 4) All cache operations.
>>> 5) Exchange is never retried. Ignite community is currently working on
>>> PME failure handling that should kick all problematic nodes after timeout
>>> is reached (see
>>> https://cwiki.apache.org/confluence/display/IGNITE/IEP-25%3A+Partition+Map+Exchange+hangs+resolving
>>> for details), but it isn't done yet.
>>> 6) You shouldn't consider PME failure as a error by itself, but rather
>>> as a result of some other error. The most common reason of PME hang-up is
>>> pending cache operation that couldn't finish. Check your logs - it should
>>> list pending transactions and atomic updates. Search for "Found long
>>> running" substring.
>>>
>>> Hope this helps.
>>>
>>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>>> eugene.miret...@gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> Out cluster occasionally fails with "partition map exchange failure"
>>>> errors, I have searched around and it seems that a lot of people have had a
>>>> similar issue in the past. My high-level understanding is that when one of
>>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>>> partition maps. However, I have a few questions
>>>> 1) When does partition map exchange happen? Periodically, when a node
>>>> joins, etc.
>>>> 2) Is it done in the same thread as communication SPI, or is a separate
>>>> worker?
>>>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>>>> 4) What does the exchange block?
>>>> 5) When is the exchange retried?
>>>> 5) How to resolve the error? The only thing I have seen online is to
>>>> decrease failureDetectionTimeout
>>>>
>>>> Our settings are
>>>> - Zookeeper SPI
>>>> - Persistence enabled
>>>>
>>>> Cheers,
>>>> Eugene
>>>>
>>>
>>>
>>>
>>> --
>>> Best regards,
>>> Ilya
>>>
>>
>
>
> --
> Best regards,
> Ilya
>

Re: Partition map exchange in detail

Reply via email to