Hello! It's not a deadlock it's just a (warning) timeout: It is expected that all work is to be done by this moment, but due to slow I/O cluster is still busy.
Regards, -- Ilya Kasnacheev сб, 19 янв. 2019 г. в 21:51, Andrey Davydov <[email protected]>: > Yes. It seems like safe to use. Seems like some race cause fail or > deadlock on the first try of exchange and after timeout, ignite run second > try which is successful. As I see in my tests (60 ignite activation per > build), second try is stable. > But it cause unnecessary 10-15 seconds delay on each activation if run > tests on portable drive. > > On Thursday, January 17, 2019, Ilya Kasnacheev <[email protected]> > wrote: > > Hello! > >> As I investigate during reproduction, evrething works fine when I run > test on local drive (SSD in case of my laptop), but if I copy project to > external HDD mounted over USB3.0, problem present. So it may be race caused > by condition on slow IO. Initialy I got problem when I build my project on > external drive. > > "Failed to wait for PME" is not a problem but a symptom. It is a simple > timeout. If you are hitting this timeout due to slow I/O you will see PME > messages which will then go away. This is by design. > > Nevertheless I will try to run your reproducer. > > > > Regards, > > -- > > Ilya Kasnacheev > > > > > > ср, 16 янв. 2019 г. в 21:28, Andrey Davydov <[email protected]>: > >> > >> I create small reproducer project/ It is available on: > >> > https://drive.google.com/file/d/1A2_i1YBI7OGtJM0b8bxuIJrTI61ZQXoR/view?usp=sharing > >> > >> Thera are project, and some logs and dumps in archive. > >> There is only one DemoTest.java to run to reproduce. > >> test run 3 ignite nodes with configuration very similar to our real > project without any logic or additional services. > >> IMPORTANT. As I investigate during reproduction, evrething works fine > when I run test on local drive (SSD in case of my laptop), but if I copy > project to external HDD mounted over USB3.0, problem present. So it may be > race caused by condition on slow IO. Initialy I got problem when I build my > project on external drive. > >> Andrey. > >> > >> > >> On Wed, Jan 9, 2019 at 8:43 PM Andrey Davydov <[email protected]> > wrote: > >>> > >>> > >>> > >>> Hello, > >>> > >>> I found in test logs of my project that Ignite warns about failed > partition maps exchange. In test environment 3 Ignite 2.7 server nodes run > in the same JVM8 on Win10, using localhost networking. > >>> > >>> > >>> > >>> 2019-01-09 20:15:27,719 [sys-#164%TestNode-2%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Affinity changes applied in 10 ms. > >>> > >>> 2019-01-09 20:15:27,719 [sys-#163%TestNode-1%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Affinity changes applied in 10 ms. > >>> > >>> 2019-01-09 20:15:27,724 [sys-#164%TestNode-2%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Full map updating for 5 groups performed in 4 ms. > >>> > >>> 2019-01-09 20:15:27,724 [sys-#163%TestNode-1%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Full map updating for 5 groups performed in 5 ms. > >>> > >>> 2019-01-09 20:15:27,725 [sys-#163%TestNode-1%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Finish exchange future [startVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1], resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > err=null] > >>> > >>> 2019-01-09 20:15:27,725 [sys-#164%TestNode-2%] INFO > org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture:102 > - Finish exchange future [startVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1], resVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > err=null] > >>> > >>> 2019-01-09 20:15:28,710 [db-checkpoint-thread-#157%TestNode-1%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint started [checkpointId=443748a9-c1a5-4b3b-96e4-04a0862829ec, > startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143], > checkpointLockWait=0ms, checkpointLockHoldTime=6ms, > walCpRecordFsyncDuration=248ms, pages=204, reason='node started'] > >>> > >>> 2019-01-09 20:15:28,713 [db-checkpoint-thread-#151%TestNode-0%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint started [checkpointId=cbc928e1-4ecd-40ae-9791-c6ba20c3669b, > startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143], > checkpointLockWait=0ms, checkpointLockHoldTime=8ms, > walCpRecordFsyncDuration=257ms, pages=204, reason='node started'] > >>> > >>> 2019-01-09 20:15:28,715 [db-checkpoint-thread-#146%TestNode-2%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint started [checkpointId=ef4c3d02-ca01-4d67-8128-48d4dc99aabc, > startPtr=FileWALPointer [idx=0, fileOff=929726, len=31143], > checkpointLockWait=0ms, checkpointLockHoldTime=22ms, > walCpRecordFsyncDuration=289ms, pages=204, reason='node started'] > >>> > >>> 2019-01-09 20:15:30,788 [db-checkpoint-thread-#157%TestNode-1%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint finished [cpId=443748a9-c1a5-4b3b-96e4-04a0862829ec, > pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143], > walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1103ms, > pagesWrite=84ms, fsync=1992ms, total=3179ms] > >>> > >>> 2019-01-09 20:15:30,858 [db-checkpoint-thread-#151%TestNode-0%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint finished [cpId=cbc928e1-4ecd-40ae-9791-c6ba20c3669b, > pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143], > walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1213ms, > pagesWrite=79ms, fsync=2066ms, total=3358ms] > >>> > >>> 2019-01-09 20:15:30,998 [db-checkpoint-thread-#146%TestNode-2%] INFO > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager:102 > - Checkpoint finished [cpId=ef4c3d02-ca01-4d67-8128-48d4dc99aabc, > pages=204, markPos=FileWALPointer [idx=0, fileOff=929726, len=31143], > walSegmentsCleared=0, walSegmentsCovered=[], markDuration=1262ms, > pagesWrite=79ms, fsync=2203ms, total=3544ms] > >>> > >>> 2019-01-09 20:15:37,510 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - Failed to wait for partition > map exchange [topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > node=454d2051-cea6-4f2c-99a7-7c5698494175]. Dumping pending objects that > might be the cause: > >>> > >>> 2019-01-09 20:15:37,510 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - Ready affinity version: > AffinityTopologyVersion [topVer=-1, minorTopVer=0] > >>> > >>> 2019-01-09 20:15:37,515 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - Last exchange future: … > >>> > >>> 2019-01-09 20:15:37,515 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:118 > - First 10 pending exchange futures [total=0] > >>> > >>> 2019-01-09 20:15:37,518 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - Last 10 exchange futures > (total: 1): > >>> > >>> 2019-01-09 20:15:37,519 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - >>> > GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, evtNode=TcpDiscoveryNode > [id=454d2051-cea6-4f2c-99a7-7c5698494175, addrs=[127.0.0.1], sockAddrs=[/ > 127.0.0.1:47500], discPort=47500, order=1, intOrder=1, > lastExchangeTime=1547054137330, loc=true, ver=2.7.0#20181130-sha1:256ae401, > isClient=false], done=false] > >>> > >>> 2019-01-09 20:15:37,520 [exchange-worker-#44%TestNode-0%] WARN > org.apache.ignite.internal.diagnostic:118 - Latch manager state: > ExchangeLatchManager [serverLatches={}, clientLatches={}] > >>> > >>> 2019-01-09 20:15:37,535 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - Failed to wait for partition > map exchange [topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], > node=a8e212dc-303e-4bc6-8bb6-fa2bc4c0478d]. Dumping pending objects that > might be the cause: > >>> > >>> 2019-01-09 20:15:37,535 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - Ready affinity version: > AffinityTopologyVersion [topVer=-1, minorTopVer=0] > >>> > >>> 2019-01-09 20:15:37,536 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - Last exchange future: > >>> > >>> 2019-01-09 20:15:37,537 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager:118 > - First 10 pending exchange futures [total=0] > >>> > >>> 2019-01-09 20:15:37,538 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - Last 10 exchange futures > (total: 1): > >>> > >>> 2019-01-09 20:15:37,539 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - >>> > GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion [topVer=3, > minorTopVer=1], evt=DISCOVERY_CUSTOM_EVT, evtNode=TcpDiscoveryNode > [id=454d2051-cea6-4f2c-99a7-7c5698494175, addrs=[127.0.0.1], sockAddrs=[/ > 127.0.0.1:47500], discPort=47500, order=1, intOrder=1, > lastExchangeTime=1547054124462, loc=false, > ver=2.7.0#20181130-sha1:256ae401, isClient=false], done=false] > >>> > >>> 2019-01-09 20:15:37,540 [exchange-worker-#132%TestNode-2%] WARN > org.apache.ignite.internal.diagnostic:118 - Latch manager state: > ExchangeLatchManager [serverLatches={}, clientLatches={}] > >>> > >>> > >>> > >>> After that, everything works fine, but it cause every test to take > more 10 seconds to run. And I am afraid to use it in prod before I > understand problem. > >>> > >>> > >>> > >>> Andrey. > >>> > >>>
