Re: Failed to wait for partition map exchange on cluster activation

Andrey Davydov Sat, 19 Jan 2019 10:52:03 -0800

Yes. It seems like safe to use. Seems like some race cause fail or deadlock
on the first try of exchange and after timeout, ignite run second try which
is successful. As I see in my tests (60 ignite activation per build),
second try is stable.
  But it cause unnecessary 10-15 seconds delay on each activation if run
tests on portable drive.


On Thursday, January 17, 2019, Ilya Kasnacheev <[email protected]>
wrote:
> Hello!
>> As I investigate during reproduction, evrething works fine when I run
test on local drive (SSD in case of my laptop), but if I copy project to
external HDD mounted over USB3.0, problem present. So it may be race caused
by condition on slow IO. Initialy I got problem when I build my project on
external drive.
> "Failed to wait for PME" is not a problem but a symptom. It is a simple
timeout. If you are hitting this timeout due to slow I/O you will see PME
messages which will then go away. This is by design.
> Nevertheless I will try to run your reproducer.
>
> Regards,
> --
> Ilya Kasnacheev
>
>
> ср, 16 янв. 2019 г. в 21:28, Andrey Davydov <[email protected]>:
>>
>> I create small reproducer project/ It is available on:
>>
https://drive.google.com/file/d/1A2_i1YBI7OGtJM0b8bxuIJrTI61ZQXoR/view?usp=sharing
>>
>> Thera are project, and some logs and dumps in archive.
>> There is only one DemoTest.java to run to reproduce.
>> test run 3 ignite nodes with configuration very similar to our real
project without any logic or additional services.
>> IMPORTANT. As I investigate during reproduction, evrething works fine
when I run test on local drive (SSD in case of my laptop), but if I copy
project to external HDD mounted over USB3.0, problem present. So it may be
race caused by condition on slow IO. Initialy I got problem when I build my
project on external drive.
>> Andrey.
>>
>>
>> On Wed, Jan 9, 2019 at 8:43 PM Andrey Davydov <[email protected]>
wrote:
>>>
>>>
>>>
>>> Hello,
>>>
>>> I found in test logs of my project that Ignite warns about failed
partition maps exchange. In test environment 3 Ignite 2.7 server nodes run
in the same JVM8 on Win10, using localhost networking.
>>>
>>>
>>>
>>> 2019-01-09 20:15:27,719 [sys-#164%TestNode-2%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Affinity changes applied in 10 ms.
>>>
>>> 2019-01-09 20:15:27,719 [sys-#163%TestNode-1%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Affinity changes applied in 10 ms.
>>>
>>> 2019-01-09 20:15:27,724 [sys-#164%TestNode-2%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Full map updating for 5 groups performed in 4 ms.
>>>
>>> 2019-01-09 20:15:27,724 [sys-#163%TestNode-1%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Full map updating for 5 groups performed in 5 ms.
>>>
>>> 2019-01-09 20:15:27,725 [sys-#163%TestNode-1%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Finish exchange future [startVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1], resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
err=null]
>>>
>>> 2019-01-09 20:15:27,725 [sys-#164%TestNode-2%] INFO
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102
- Finish exchange future [startVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1], resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
err=null]
>>>
>>> 2019-01-09 20:15:28,710 [db-checkpoint-thread-#157%TestNode-1%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint started [checkpointId=443748a9-c1a5-4b3b-96e4-04a0862829ec,
startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143],
checkpointLockWait=0ms, checkpointLockHoldTime=6ms,
walCpRecordFsyncDuration=248ms, pages=204, reason='node started']
>>>
>>> 2019-01-09 20:15:28,713 [db-checkpoint-thread-#151%TestNode-0%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint started [checkpointId=cbc928e1-4ecd-40ae-9791-c6ba20c3669b,
startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143],
checkpointLockWait=0ms, checkpointLockHoldTime=8ms,
walCpRecordFsyncDuration=257ms, pages=204, reason='node started']
>>>
>>> 2019-01-09 20:15:28,715 [db-checkpoint-thread-#146%TestNode-2%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint started [checkpointId=ef4c3d02-ca01-4d67-8128-48d4dc99aabc,
startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143],
checkpointLockWait=0ms, checkpointLockHoldTime=22ms,
walCpRecordFsyncDuration=289ms, pages=204, reason='node started']
>>>
>>> 2019-01-09 20:15:30,788 [db-checkpoint-thread-#157%TestNode-1%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint finished [cpId=443748a9-c1a5-4b3b-96e4-04a0862829ec,
pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143],
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1103ms,
pagesWrite=84ms, fsync=1992ms, total=3179ms]
>>>
>>> 2019-01-09 20:15:30,858 [db-checkpoint-thread-#151%TestNode-0%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint finished [cpId=cbc928e1-4ecd-40ae-9791-c6ba20c3669b,
pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143],
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1213ms,
pagesWrite=79ms, fsync=2066ms, total=3358ms]
>>>
>>> 2019-01-09 20:15:30,998 [db-checkpoint-thread-#146%TestNode-2%] INFO
org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102
- Checkpoint finished [cpId=ef4c3d02-ca01-4d67-8128-48d4dc99aabc,
pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143],
walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1262ms,
pagesWrite=79ms, fsync=2203ms, total=3544ms]
>>>
>>> 2019-01-09 20:15:37,510 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - Failed to wait for partition
map exchange [topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
node=454d2051-cea6-4f2c-99a7-7c5698494175]. Dumping pending objects that
might be the cause:
>>>
>>> 2019-01-09 20:15:37,510 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - Ready affinity version:
AffinityTopologyVersion [topVer=-1, minorTopVer=0]
>>>
>>> 2019-01-09 20:15:37,515 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - Last exchange future: …
>>>
>>> 2019-01-09 20:15:37,515 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:118
- First 10 pending exchange futures [total=0]
>>>
>>> 2019-01-09 20:15:37,518 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - Last 10 exchange futures
(total: 1):
>>>
>>> 2019-01-09 20:15:37,519 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - >>>
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, evtNode=TcpDiscoveryNode
[id=454d2051-cea6-4f2c-99a7-7c5698494175, addrs=[127.0.0.1], sockAddrs=[/
127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1547054137330, loc=true, ver=2.7.0#20181130-sha1:256ae401,
isClient=false], done=false]
>>>
>>> 2019-01-09 20:15:37,520 [exchange-worker-#44%TestNode-0%] WARN
org.apache.ignite.internal.diagnostic:118 - Latch manager state:
ExchangeLatchManager [serverLatches={}, clientLatches={}]
>>>
>>> 2019-01-09 20:15:37,535 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - Failed to wait for partition
map exchange [topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1],
node=a8e212dc-303e-4bc6-8bb6-fa2bc4c0478d]. Dumping pending objects that
might be the cause:
>>>
>>> 2019-01-09 20:15:37,535 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - Ready affinity version:
AffinityTopologyVersion [topVer=-1, minorTopVer=0]
>>>
>>> 2019-01-09 20:15:37,536 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - Last exchange future:
>>>
>>> 2019-01-09 20:15:37,537 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:118
- First 10 pending exchange futures [total=0]
>>>
>>> 2019-01-09 20:15:37,538 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - Last 10 exchange futures
(total: 1):
>>>
>>> 2019-01-09 20:15:37,539 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - >>>
GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=3,
minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, evtNode=TcpDiscoveryNode
[id=454d2051-cea6-4f2c-99a7-7c5698494175, addrs=[127.0.0.1], sockAddrs=[/
127.0.0.1:47500], discPort=47500, order=1, intOrder=1,
lastExchangeTime=1547054124462, loc=false,
ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=false]
>>>
>>> 2019-01-09 20:15:37,540 [exchange-worker-#132%TestNode-2%] WARN
org.apache.ignite.internal.diagnostic:118 - Latch manager state:
ExchangeLatchManager [serverLatches={}, clientLatches={}]
>>>
>>>
>>>
>>> After that, everything works fine, but it cause every test to take more
10 seconds to run. And I am afraid to use it in prod before I understand
problem.
>>>
>>>
>>>
>>> Andrey.
>>>
>>>

Re: Failed to wait for partition map exchange on cluster activation

Reply via email to