Re: Partition map exchange in detail

Ilya Lantukh Tue, 11 Sep 2018 08:12:02 -0700

1) It is.
2a) Ignite has retry mechanics for all messages, including PME-related ones.
2b) In this situation PME will hang, but it isn't a "deadlock".
3) Sorry, I didn't understand your question. If a node is down, but
DiscoverySpi doesn't detect it, it isn't PME-related problem.
4) How can you ensure that partition maps on coordinator are *latest *without
"freezing" cluster state for some time?


On Sat, Sep 8, 2018 at 3:21 AM, eugene miretsky <eugene.miret...@gmail.com>
wrote:

> Thanks!
>
> We are using persistence, so I am not sure if shutting down nodes will be
> the desired outcome for us since we would need to modify the baseline
> topolgy.
>
> A couple more follow up questions
>
> 1) Is PME triggered when client nodes join us well? We are using Spark
> client, so new nodes are created/destroy every time.
> 2) It sounds to me like there is a pontential for the cluster to get into
> a deadlock if
>    a) single PME message is lost (PME never finishes, there are no
> retries, and all future operations are blocked on the pending PME)
>    b) one of the nodes has a  long running/stuck pending operation
> 3) Under what circumastance can PME fail, while DiscoverySpi fails to
> detect the node being down? We are using ZookeeperSpi so I would expect the
> split brain resolver to shut down the node.
> 4) Why is PME needed? Doesn't the coordinator know the altest
> toplogy/pertition map of the cluster through regualr gossip?
>
> Cheers,
> Eugene
>
> On Fri, Sep 7, 2018 at 5:18 PM Ilya Lantukh <ilant...@gridgain.com> wrote:
>
>> Hi Eugene,
>>
>> 1) PME happens when topology is modified (TopologyVersion is
>> incremented). The most common events that trigger it are: node
>> start/stop/fail, cluster activation/deactivation, dynamic cache start/stop.
>> 2) It is done by a separate ExchangeWorker. Events that trigger PME are
>> transferred using DiscoverySpi instead of CommunicationSpi.
>> 3) All nodes wait for all pending cache operations to finish and then
>> send their local partition maps to the coordinator (oldest node). Then
>> coordinator calculates new global partition maps and sends them to every
>> node.
>> 4) All cache operations.
>> 5) Exchange is never retried. Ignite community is currently working on
>> PME failure handling that should kick all problematic nodes after timeout
>> is reached (see https://cwiki.apache.org/confluence/display/IGNITE/IEP-
>> 25%3A+Partition+Map+Exchange+hangs+resolving for details), but it isn't
>> done yet.
>> 6) You shouldn't consider PME failure as a error by itself, but rather as
>> a result of some other error. The most common reason of PME hang-up is
>> pending cache operation that couldn't finish. Check your logs - it should
>> list pending transactions and atomic updates. Search for "Found long
>> running" substring.
>>
>> Hope this helps.
>>
>> On Fri, Sep 7, 2018 at 11:45 PM, eugene miretsky <
>> eugene.miret...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> Out cluster occasionally fails with "partition map exchange failure"
>>> errors, I have searched around and it seems that a lot of people have had a
>>> similar issue in the past. My high-level understanding is that when one of
>>> the nodes fails (out of memory, exception, GC etc.) nodes fail to exchange
>>> partition maps. However, I have a few questions
>>> 1) When does partition map exchange happen? Periodically, when a node
>>> joins, etc.
>>> 2) Is it done in the same thread as communication SPI, or is a separate
>>> worker?
>>> 3) How does the exchange happen? Via a coordinator, peer to peer, etc?
>>> 4) What does the exchange block?
>>> 5) When is the exchange retried?
>>> 5) How to resolve the error? The only thing I have seen online is to
>>> decrease failureDetectionTimeout
>>>
>>> Our settings are
>>> - Zookeeper SPI
>>> - Persistence enabled
>>>
>>> Cheers,
>>> Eugene
>>>
>>
>>
>>
>> --
>> Best regards,
>> Ilya
>>
>


-- 
Best regards,
Ilya

Re: Partition map exchange in detail

Reply via email to