[jira] [Commented] (IGNITE-14197) Checkpoint thread can't take checkpoint write lock because it waits for parked threads to complete their work
[ https://issues.apache.org/jira/browse/IGNITE-14197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17507667#comment-17507667 ] Anton Kalashnikov commented on IGNITE-14197: It is actually a good question. I remember that we discussed that but I can't find the decision about closing it. Perhaps, we expected to fix this problem in a different ticket but I don't see a linked ticket here as well. [~sergey-chugunov] or [~ibessonov] can you check how relevant is this task? and if it is we can reopen the PR > Checkpoint thread can't take checkpoint write lock because it waits for > parked threads to complete their work > - > > Key: IGNITE-14197 > URL: https://issues.apache.org/jira/browse/IGNITE-14197 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > In case of enabled write throttling, when, for example, node parks data > streamer thread, it still holds checkpoint read lock and it leads to the long > pauses on waiting for checkpoint lock: > [2020-07-23 07:09:21,614][INFO > ][db-checkpoint-thread-#371][GridCacheDatabaseSharedManager] Checkpoint > started [checkpointId=f964c8f2-daa5-41b2-80ef-944326f26f8a, > startPtr=FileWALPointer [idx=56913, fileOff=10362905, len=41972], > checkpointBeforeLockTime=1983ms, *checkpointLockWait=812117ms*, > checkpointListenersExecuteTime=90ms, checkpointLockHoldTime=93ms, > walCpRecordFsyncDuration=123ms, writeCheckpointEntryDuration=4ms, > splitAndSortCpPagesDuration=4155ms, pages=10516815, reason='too big size of > WAL without checkpoint'] > All operations at this moment are blocked. > Sometimes, it can lead to a complete disaster: > Parking thread=data-streamer-stripe-47-#144 for timeout(ms)=*21278855* > {quote}“data-streamer-stripe-78-#175” #209 prio=5 os_prio=0 > tid=0x7f6161d6a800 nid=0xf932 waiting on condition [0x7f5c292d1000] > java.lang.Thread.State: TIMED_WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:244) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:227) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1730) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:491) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:483) > at > org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:394) > at > org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:369) > at > org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:296) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$11300(BPlusTree.java:98) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Put.tryInsert(BPlusTree.java:3864) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Put.access$7100(BPlusTree.java:3544) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.onNotFound(BPlusTree.java:4103) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5800(BPlusTree.java:3894) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2022) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1997) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1904) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1662) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1645) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2473) > at > org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:436) > at > org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4306) > at
[jira] [Created] (IGNITE-14197) Checkpoint thread can't take checkpoint write lock because it waits for parked threads to complete their work
Anton Kalashnikov created IGNITE-14197: -- Summary: Checkpoint thread can't take checkpoint write lock because it waits for parked threads to complete their work Key: IGNITE-14197 URL: https://issues.apache.org/jira/browse/IGNITE-14197 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov In case of enabled write throttling, when, for example, node parks data streamer thread, it still holds checkpoint read lock and it leads to the long pauses on waiting for checkpoint lock: [2020-07-23 07:09:21,614][INFO ][db-checkpoint-thread-#371][GridCacheDatabaseSharedManager] Checkpoint started [checkpointId=f964c8f2-daa5-41b2-80ef-944326f26f8a, startPtr=FileWALPointer [idx=56913, fileOff=10362905, len=41972], checkpointBeforeLockTime=1983ms, *checkpointLockWait=812117ms*, checkpointListenersExecuteTime=90ms, checkpointLockHoldTime=93ms, walCpRecordFsyncDuration=123ms, writeCheckpointEntryDuration=4ms, splitAndSortCpPagesDuration=4155ms, pages=10516815, reason='too big size of WAL without checkpoint'] All operations at this moment are blocked. Sometimes, it can lead to a complete disaster: Parking thread=data-streamer-stripe-47-#144 for timeout(ms)=*21278855* {quote}“data-streamer-stripe-78-#175” #209 prio=5 os_prio=0 tid=0x7f6161d6a800 nid=0xf932 waiting on condition [0x7f5c292d1000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.doPark(PagesWriteSpeedBasedThrottle.java:244) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PagesWriteSpeedBasedThrottle.onMarkDirty(PagesWriteSpeedBasedThrottle.java:227) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlockPage(PageMemoryImpl.java:1730) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:491) at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.writeUnlock(PageMemoryImpl.java:483) at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writeUnlock(PageHandler.java:394) at org.apache.ignite.internal.processors.cache.persistence.tree.util.PageHandler.writePage(PageHandler.java:369) at org.apache.ignite.internal.processors.cache.persistence.DataStructure.write(DataStructure.java:296) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$11300(BPlusTree.java:98) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Put.tryInsert(BPlusTree.java:3864) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Put.access$7100(BPlusTree.java:3544) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.onNotFound(BPlusTree.java:4103) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5800(BPlusTree.java:3894) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2022) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:1997) at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1904) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1662) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1645) at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.invoke(GridCacheOffheapManager.java:2473) at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:436) at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.storeValue(GridCacheMapEntry.java:4306) at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.initialValue(GridCacheMapEntry.java:3441) at org.apache.ignite.internal.processors.cache.GridCacheEntryEx.initialValue(GridCacheEntryEx.java:770) at org.apache.ignite.internal.processors.datastreamer.DataStreamerImpl$IsolatedUpdater.receive(DataStreamerImpl.java:2278) at org.apache.ignite.internal.processors.datastreamer.DataStreamerUpdateJob.call(DataStreamerUpdateJob.java:139) at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:7104) at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:966) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119) at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:559) at
[jira] [Commented] (IGNITE-13761) Implement Segmented-LRU and CLOCK page replacement algorithms
[ https://issues.apache.org/jira/browse/IGNITE-13761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17285289#comment-17285289 ] Anton Kalashnikov commented on IGNITE-13761: [~alex_pl] thanks for your changes. It looks good to me too. > Implement Segmented-LRU and CLOCK page replacement algorithms > - > > Key: IGNITE-13761 > URL: https://issues.apache.org/jira/browse/IGNITE-13761 > Project: Ignite > Issue Type: Improvement >Reporter: Aleksey Plekhanov >Assignee: Aleksey Plekhanov >Priority: Major > Labels: iep-62 > Attachments: GetBenchmark.zip, PutBenchmark.zip > > Time Spent: 2h 40m > Remaining Estimate: 0h > > See IEP-62 for more details. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-14139) Incorrect initialize checkpoint-runner-cpu thread pool
[ https://issues.apache.org/jira/browse/IGNITE-14139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283024#comment-17283024 ] Anton Kalashnikov commented on IGNITE-14139: [~v.pyatkov] it looks good to me. [~ibessonov] can you help with merge please? > Incorrect initialize checkpoint-runner-cpu thread pool > -- > > Key: IGNITE-14139 > URL: https://issues.apache.org/jira/browse/IGNITE-14139 > Project: Ignite > Issue Type: Bug >Reporter: Vladislav Pyatkov >Assignee: Vladislav Pyatkov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > First initialization of checkpoint thread pool for CPU is incorrect. > Look at the constructor of {{CheckpointWorkflow}}: > At start, we initialize the pool: > {code:java} > this.checkpointCollectPagesInfoPool = initializeCheckpointPool(); > {code} > and only after, we set a size of the pool: > {code:java} > this.checkpointCollectInfoThreads = checkpointCollectInfoThreads; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout'
[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17280843#comment-17280843 ] Anton Kalashnikov commented on IGNITE-14055: [~ibessonov] can you take a look and merge please? > Deadlock in timeoutObjectProcessor between 'send message' & 'handshake > timeout' > --- > > Key: IGNITE-14055 > URL: https://issues.apache.org/jira/browse/IGNITE-14055 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Attachments: StartServerWithTxPuts (1).java, freeze (1).sh > > Time Spent: 20m > Remaining Estimate: 0h > > Cluster hangs after jvm pauses on one of server nodes. > Scenario: > 1. Start three server nodes with put operations using StartServerWithTxPuts. > 2. Emulate jvm freezes on one server node by running the attached script: > {{*sh freeze.sh *}} > 3. Wait until the script has finished. > Result: > The cluster hangs on tx put operations. > The first server node continuously prints: > {noformat} > [2020-11-03 09:36:01,719][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,124][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,326][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,528][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} > {noformat} > The second node prints long running transactions in prepared state ignoring > the default tx timeout: > > {noformat} > [2020-11-03 09:36:46,199][WARN > ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future > [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture > [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal > [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true,
[jira] [Updated] (IGNITE-14110) Create networking module
[ https://issues.apache.org/jira/browse/IGNITE-14110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14110: --- Labels: iep-66 (was: ) > Create networking module > > > Key: IGNITE-14110 > URL: https://issues.apache.org/jira/browse/IGNITE-14110 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Labels: iep-66 > > It needs to create a networking module with some API and simple > implementation for further improvment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14110) Create networking module
Anton Kalashnikov created IGNITE-14110: -- Summary: Create networking module Key: IGNITE-14110 URL: https://issues.apache.org/jira/browse/IGNITE-14110 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov It needs to create a networking module with some API and simple implementation for further improvment -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14091) Implement messaging service
[ https://issues.apache.org/jira/browse/IGNITE-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14091: --- Labels: iep-66 ignite-3 (was: ) > Implement messaging service > --- > > Key: IGNITE-14091 > URL: https://issues.apache.org/jira/browse/IGNITE-14091 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to implement the ability to send/receive messages to/from network > members: > * there's a requirements of being able to send idempotent messages with very > weak guarantees: > ** no delivery guarantees required; > ** multiple copies of the same message might be sent; > ** no need to have any kind of acknowledgement; > * there's another requirement for the common use: > ** message must be sent exactly once with an acknowledgement that it has > actually been received (not necessarily processed); > ** messages must be received in the same order they were sent. > These types of messages might utilize current recovery protocol with acks > every 32 (or so) messages. This setting must be flexible enough so that we > won't get OOM in big topologies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14092) Design network address resolver
[ https://issues.apache.org/jira/browse/IGNITE-14092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14092: --- Labels: iep-66 ignite-3 (was: ) > Design network address resolver > --- > > Key: IGNITE-14092 > URL: https://issues.apache.org/jira/browse/IGNITE-14092 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to design network address resolver/ip finder/discovery which would > help to choose the right ip/port for connection. Perhaps we don't need such a > service at all but it should be explicitly agreed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14081) Networking module
[ https://issues.apache.org/jira/browse/IGNITE-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14081: --- Labels: iep-66 ignite-3 (was: ignite-3) > Networking module > - > > Key: IGNITE-14081 > URL: https://issues.apache.org/jira/browse/IGNITE-14081 > Project: Ignite > Issue Type: New Feature >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14090) Networking API
[ https://issues.apache.org/jira/browse/IGNITE-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14090: --- Labels: iep-66 ignite-3 (was: ) > Networking API > -- > > Key: IGNITE-14090 > URL: https://issues.apache.org/jira/browse/IGNITE-14090 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to design convinient public API for networking module which allow to > get information about network members and send/receive messages from them. > Draft: > {noformat} > public interface NetworkService { > static NetworkService create(NetworkConfiguration cfg); > void shutdown() throws ???;NetworkMember localMember(); > > Collection remoteMembers(); > > void weakSend(NetworkMember member, Message msg); > Future guaranteedSend(NetworkMember member, Message msg); > > void listenMembers(MembershipListener lsnr); > > void listenMessages(Consumer lsnr); > } > public interface MembershipListener { > void onAppeared(NetworkMember member); > void onDisappeared(NetworkMember member); > void onAcceptedByGroup(List remoteMembers); > } > public interface NetworkMember { > UUID id(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14089) Override scalecube internal message by custom one
[ https://issues.apache.org/jira/browse/IGNITE-14089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14089: --- Labels: iep-66 ignite-3 (was: ) > Override scalecube internal message by custom one > - > > Key: IGNITE-14089 > URL: https://issues.apache.org/jira/browse/IGNITE-14089 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > There is some custom logic in the networking module like a specific > handshake, message recovery etc. which requires to have specific messages but > at the same time default scalecube behaviour should be worked correctly. So > it needs to implement one logic over another. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14088) Implement scalecube transport API over netty
[ https://issues.apache.org/jira/browse/IGNITE-14088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14088: --- Labels: iep-66 ignite-3 (was: ) > Implement scalecube transport API over netty > > > Key: IGNITE-14088 > URL: https://issues.apache.org/jira/browse/IGNITE-14088 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > scalecube has its own netty inside but it is idea to integrate our expanded > netty into it. It will help us to support more features like our own > handshake, marshalling etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14085) Implement message recovery protocol over handshake
[ https://issues.apache.org/jira/browse/IGNITE-14085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14085: --- Labels: iep-66 ignite-3 (was: ) > Implement message recovery protocol over handshake > -- > > Key: IGNITE-14085 > URL: https://issues.apache.org/jira/browse/IGNITE-14085 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > The central idea of recovery protocol is the same as it is in the current > implementation. So it needs to implement a similar idea with the recovery > descriptor. This means information about last sending/received messages > should be sent during the handshake and according to this information > messages which were not received should be sent one more time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14084) Integrate direct marshalling to networking
[ https://issues.apache.org/jira/browse/IGNITE-14084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14084: --- Labels: iep-66 ignite-3 (was: ) > Integrate direct marshalling to networking > -- > > Key: IGNITE-14084 > URL: https://issues.apache.org/jira/browse/IGNITE-14084 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > Direct marshalling can be extracted from ignite2.x and integrate to > ignite3.0. It helps to avoid extra data copy during the sending/receiving > messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14086) Implement retry of establishing connection if it was lost
[ https://issues.apache.org/jira/browse/IGNITE-14086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14086: --- Labels: iep-66 ignite-3 (was: ) > Implement retry of establishing connection if it was lost > - > > Key: IGNITE-14086 > URL: https://issues.apache.org/jira/browse/IGNITE-14086 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to implement a retry of establishing the connection. It is not clear > which way is better to implement such idea because the current implementation > too difficult to configure(number of retries, several properties of retry > time). So it needs to think a better way to configure it. And it needs to be > implementeded. > Perhaps, scalecube(gossip protocol) do all work already and we should do > nothing here. Need to recheck. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14083) Add SSL support to networking
[ https://issues.apache.org/jira/browse/IGNITE-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14083: --- Labels: iep-66 ignite-3 (was: ) > Add SSL support to networking > - > > Key: IGNITE-14083 > URL: https://issues.apache.org/jira/browse/IGNITE-14083 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to add the ability to establish SSL connection. It looks like it > should not be a problem. But at least, it needs to design configuration which > allow to manage the ssl(path to certificate, password, etc.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14082) Implementation of handshake for new connection
[ https://issues.apache.org/jira/browse/IGNITE-14082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14082: --- Labels: iep-66 ignite-3 (was: ) > Implementation of handshake for new connection > -- > > Key: IGNITE-14082 > URL: https://issues.apache.org/jira/browse/IGNITE-14082 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > Labels: iep-66, ignite-3 > > It needs to implement the handshake after netty establish the connection. > Perhaps, It makes sense to use netty handlers. During the handshake, It needs > to exchange instanceId from one endpoint to another. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14081) Networking module
[ https://issues.apache.org/jira/browse/IGNITE-14081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14081: --- Labels: ignite-3 (was: ) > Networking module > - > > Key: IGNITE-14081 > URL: https://issues.apache.org/jira/browse/IGNITE-14081 > Project: Ignite > Issue Type: New Feature >Reporter: Anton Kalashnikov >Priority: Major > Labels: ignite-3 > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14092) Design network address resolver
Anton Kalashnikov created IGNITE-14092: -- Summary: Design network address resolver Key: IGNITE-14092 URL: https://issues.apache.org/jira/browse/IGNITE-14092 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to design network address resolver/ip finder/discovery which would help to choose the right ip/port for connection. Perhaps we don't need such a service at all but it should be explicitly agreed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14091) Implement messaging service
Anton Kalashnikov created IGNITE-14091: -- Summary: Implement messaging service Key: IGNITE-14091 URL: https://issues.apache.org/jira/browse/IGNITE-14091 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to implement the ability to send/receive messages to/from network members: * there's a requirements of being able to send idempotent messages with very weak guarantees: ** no delivery guarantees required; ** multiple copies of the same message might be sent; ** no need to have any kind of acknowledgement; * there's another requirement for the common use: ** message must be sent exactly once with an acknowledgement that it has actually been received (not necessarily processed); ** messages must be received in the same order they were sent. These types of messages might utilize current recovery protocol with acks every 32 (or so) messages. This setting must be flexible enough so that we won't get OOM in big topologies. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14090) Networking API
[ https://issues.apache.org/jira/browse/IGNITE-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14090: --- Description: It needs to design convinient public API for networking module which allow to get information about network members and send/receive messages from them. Draft: {noformat} public interface NetworkService { static NetworkService create(NetworkConfiguration cfg); void shutdown() throws ???;NetworkMember localMember(); Collection remoteMembers(); void weakSend(NetworkMember member, Message msg); Future guaranteedSend(NetworkMember member, Message msg); void listenMembers(MembershipListener lsnr); void listenMessages(Consumer lsnr); } public interface MembershipListener { void onAppeared(NetworkMember member); void onDisappeared(NetworkMember member); void onAcceptedByGroup(List remoteMembers); } public interface NetworkMember { UUID id(); } {noformat} was: It needs to design convinient public API for networking module which allow to get information about network members and send/receive messages from them. Draft: {noformat} public interface NetworkService { static NetworkService create(NetworkConfiguration cfg);void shutdown() throws ???;NetworkMember localMember(); Collection remoteMembers(); void weakSend(NetworkMember member, Message msg);Future guaranteedSend(NetworkMember member, Message msg); void listenMembers(MembershipListener lsnr); void listenMessages(Consumer lsnr); } public interface MembershipListener { void onAppeared(NetworkMember member); void onDisappeared(NetworkMember member); void onAcceptedByGroup(List remoteMembers); } public interface NetworkMember { UUID id(); } {noformat} > Networking API > -- > > Key: IGNITE-14090 > URL: https://issues.apache.org/jira/browse/IGNITE-14090 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > > It needs to design convinient public API for networking module which allow to > get information about network members and send/receive messages from them. > Draft: > {noformat} > public interface NetworkService { > static NetworkService create(NetworkConfiguration cfg); > void shutdown() throws ???;NetworkMember localMember(); > > Collection remoteMembers(); > > void weakSend(NetworkMember member, Message msg); > Future guaranteedSend(NetworkMember member, Message msg); > > void listenMembers(MembershipListener lsnr); > > void listenMessages(Consumer lsnr); > } > public interface MembershipListener { > void onAppeared(NetworkMember member); > void onDisappeared(NetworkMember member); > void onAcceptedByGroup(List remoteMembers); > } > public interface NetworkMember { > UUID id(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14090) Networking API
[ https://issues.apache.org/jira/browse/IGNITE-14090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14090: --- Description: It needs to design convinient public API for networking module which allow to get information about network members and send/receive messages from them. Draft: {noformat} public interface NetworkService { static NetworkService create(NetworkConfiguration cfg);void shutdown() throws ???;NetworkMember localMember(); Collection remoteMembers(); void weakSend(NetworkMember member, Message msg);Future guaranteedSend(NetworkMember member, Message msg); void listenMembers(MembershipListener lsnr); void listenMessages(Consumer lsnr); } public interface MembershipListener { void onAppeared(NetworkMember member); void onDisappeared(NetworkMember member); void onAcceptedByGroup(List remoteMembers); } public interface NetworkMember { UUID id(); } {noformat} was: It needs to design convinient public API for networking module which allow to get information about network members and send/receive messages from them. Draft: {noformat} public interface NetworkService \{ static NetworkService create(NetworkConfiguration cfg); void shutdown() throws ???; NetworkMember localMember(); Collection remoteMembers(); void weakSend(NetworkMember member, Message msg); Future guaranteedSend(NetworkMember member, Message msg); void listenMembers(MembershipListener lsnr); void listenMessages(Consumer lsnr); } public interface MembershipListener \{ void onAppeared(NetworkMember member); void onDisappeared(NetworkMember member); void onAcceptedByGroup(List remoteMembers); } public interface NetworkMember \{ UUID id(); } {noformat} > Networking API > -- > > Key: IGNITE-14090 > URL: https://issues.apache.org/jira/browse/IGNITE-14090 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Priority: Major > > It needs to design convinient public API for networking module which allow to > get information about network members and send/receive messages from them. > Draft: > {noformat} > public interface NetworkService { > static NetworkService create(NetworkConfiguration cfg);void > shutdown() throws ???;NetworkMember localMember(); > > Collection remoteMembers(); > > void weakSend(NetworkMember member, Message msg);Future > guaranteedSend(NetworkMember member, Message msg); > > void listenMembers(MembershipListener lsnr); > > void listenMessages(Consumer lsnr); > } > public interface MembershipListener { > void onAppeared(NetworkMember member); > void onDisappeared(NetworkMember member); > void onAcceptedByGroup(List remoteMembers); > } > public interface NetworkMember { > UUID id(); > } > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14090) Networking API
Anton Kalashnikov created IGNITE-14090: -- Summary: Networking API Key: IGNITE-14090 URL: https://issues.apache.org/jira/browse/IGNITE-14090 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to design convinient public API for networking module which allow to get information about network members and send/receive messages from them. Draft: {noformat} public interface NetworkService \{ static NetworkService create(NetworkConfiguration cfg); void shutdown() throws ???; NetworkMember localMember(); Collection remoteMembers(); void weakSend(NetworkMember member, Message msg); Future guaranteedSend(NetworkMember member, Message msg); void listenMembers(MembershipListener lsnr); void listenMessages(Consumer lsnr); } public interface MembershipListener \{ void onAppeared(NetworkMember member); void onDisappeared(NetworkMember member); void onAcceptedByGroup(List remoteMembers); } public interface NetworkMember \{ UUID id(); } {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14089) Override scalecube internal message by custom one
Anton Kalashnikov created IGNITE-14089: -- Summary: Override scalecube internal message by custom one Key: IGNITE-14089 URL: https://issues.apache.org/jira/browse/IGNITE-14089 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov There is some custom logic in the networking module like a specific handshake, message recovery etc. which requires to have specific messages but at the same time default scalecube behaviour should be worked correctly. So it needs to implement one logic over another. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14088) Implement scalecube transport API over netty
Anton Kalashnikov created IGNITE-14088: -- Summary: Implement scalecube transport API over netty Key: IGNITE-14088 URL: https://issues.apache.org/jira/browse/IGNITE-14088 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov scalecube has its own netty inside but it is idea to integrate our expanded netty into it. It will help us to support more features like our own handshake, marshalling etc. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14086) Implement retry of establishing connection if it was lost
Anton Kalashnikov created IGNITE-14086: -- Summary: Implement retry of establishing connection if it was lost Key: IGNITE-14086 URL: https://issues.apache.org/jira/browse/IGNITE-14086 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to implement a retry of establishing the connection. It is not clear which way is better to implement such idea because the current implementation too difficult to configure(number of retries, several properties of retry time). So it needs to think a better way to configure it. And it needs to be implementeded. Perhaps, scalecube(gossip protocol) do all work already and we should do nothing here. Need to recheck. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14085) Implement message recovery protocol over handshake
Anton Kalashnikov created IGNITE-14085: -- Summary: Implement message recovery protocol over handshake Key: IGNITE-14085 URL: https://issues.apache.org/jira/browse/IGNITE-14085 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov The central idea of recovery protocol is the same as it is in the current implementation. So it needs to implement a similar idea with the recovery descriptor. This means information about last sending/received messages should be sent during the handshake and according to this information messages which were not received should be sent one more time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14084) Integrate direct marshalling to networking
Anton Kalashnikov created IGNITE-14084: -- Summary: Integrate direct marshalling to networking Key: IGNITE-14084 URL: https://issues.apache.org/jira/browse/IGNITE-14084 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Direct marshalling can be extracted from ignite2.x and integrate to ignite3.0. It helps to avoid extra data copy during the sending/receiving messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14083) Add SSL support to networking
Anton Kalashnikov created IGNITE-14083: -- Summary: Add SSL support to networking Key: IGNITE-14083 URL: https://issues.apache.org/jira/browse/IGNITE-14083 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to add the ability to establish SSL connection. It looks like it should not be a problem. But at least, it needs to design configuration which allow to manage the ssl(path to certificate, password, etc.) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14082) Implementation of handshake for new connection
Anton Kalashnikov created IGNITE-14082: -- Summary: Implementation of handshake for new connection Key: IGNITE-14082 URL: https://issues.apache.org/jira/browse/IGNITE-14082 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to implement the handshake after netty establish the connection. Perhaps, It makes sense to use netty handlers. During the handshake, It needs to exchange instanceId from one endpoint to another. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-14081) Networking module
Anton Kalashnikov created IGNITE-14081: -- Summary: Networking module Key: IGNITE-14081 URL: https://issues.apache.org/jira/browse/IGNITE-14081 Project: Ignite Issue Type: New Feature Reporter: Anton Kalashnikov -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout'
[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14055: --- Summary: Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout' (was: Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout') > Deadlock in timeoutObjectProcessor between 'send message' & 'handshake > timeout' > --- > > Key: IGNITE-14055 > URL: https://issues.apache.org/jira/browse/IGNITE-14055 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Attachments: StartServerWithTxPuts (1).java, freeze (1).sh > > > Cluster hangs after jvm pauses on one of server nodes. > Scenario: > 1. Start three server nodes with put operations using StartServerWithTxPuts. > 2. Emulate jvm freezes on one server node by running the attached script: > {{*sh freeze.sh *}} > 3. Wait until the script has finished. > Result: > The cluster hangs on tx put operations. > The first server node continuously prints: > {noformat} > [2020-11-03 09:36:01,719][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,124][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,326][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,528][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} > {noformat} > The second node prints long running transactions in prepared state ignoring > the default tx timeout: > > {noformat} > [2020-11-03 09:36:46,199][WARN > ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future > [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture > [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal > [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [],
[jira] [Updated] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout'
[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14055: --- Attachment: StartServerWithTxPuts (1).java > Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake > timeout' > --- > > Key: IGNITE-14055 > URL: https://issues.apache.org/jira/browse/IGNITE-14055 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Attachments: StartServerWithTxPuts (1).java, freeze (1).sh > > > Cluster hangs after jvm pauses on one of server nodes. > Scenario: > 1. Start three server nodes with put operations using StartServerWithTxPuts. > 2. Emulate jvm freezes on one server node by running the attached script: > {{*sh freeze.sh *}} > 3. Wait until the script has finished. > Result: > The cluster hangs on tx put operations. > The first server node continuously prints: > {noformat} > [2020-11-03 09:36:01,719][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,124][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,326][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,528][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} > {noformat} > The second node prints long running transactions in prepared state ignoring > the default tx timeout: > > {noformat} > [2020-11-03 09:36:46,199][WARN > ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future > [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture > [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal > [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true, entry=GridDhtDetachedCacheEntry > [super=GridDistributedCacheEntry [super=GridCacheMapEntry >
[jira] [Updated] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout'
[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14055: --- Attachment: freeze (1).sh > Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake > timeout' > --- > > Key: IGNITE-14055 > URL: https://issues.apache.org/jira/browse/IGNITE-14055 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Attachments: StartServerWithTxPuts (1).java, freeze (1).sh > > > Cluster hangs after jvm pauses on one of server nodes. > Scenario: > 1. Start three server nodes with put operations using StartServerWithTxPuts. > 2. Emulate jvm freezes on one server node by running the attached script: > {{*sh freeze.sh *}} > 3. Wait until the script has finished. > Result: > The cluster hangs on tx put operations. > The first server node continuously prints: > {noformat} > [2020-11-03 09:36:01,719][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO > ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,124][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO > ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,326][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO > ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 > 09:36:02,528][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Accepted incoming communication connection [locAddr=/127.0.0.1:47100, > rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO > ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] > Received incoming connection from remote node while connecting to this node, > rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, > rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} > {noformat} > The second node prints long running transactions in prepared state ignoring > the default tx timeout: > > {noformat} > [2020-11-03 09:36:46,199][WARN > ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future > [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture > [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal > [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping > [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey > [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], > cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl > [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], > oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, > ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, > dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, > filtersSet=true, entry=GridDhtDetachedCacheEntry > [super=GridDistributedCacheEntry [super=GridCacheMapEntry >
[jira] [Updated] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout'
[ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-14055: --- Description: Cluster hangs after jvm pauses on one of server nodes. Scenario: 1. Start three server nodes with put operations using StartServerWithTxPuts. 2. Emulate jvm freezes on one server node by running the attached script: {{*sh freeze.sh *}} 3. Wait until the script has finished. Result: The cluster hangs on tx put operations. The first server node continuously prints: {noformat} [2020-11-03 09:36:01,719][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,124][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,326][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,528][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} {noformat} The second node prints long running transactions in prepared state ignoring the default tx timeout: {noformat} [2020-11-03 09:36:46,199][WARN ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, flags=0]]], prepared=0, locked=false, nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, last=false, nearEntries=0, clientFirst=false, node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], nearLocallyMapped=false, colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false, trackTimeout=false,
[jira] [Created] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout'
Anton Kalashnikov created IGNITE-14055: -- Summary: Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout' Key: IGNITE-14055 URL: https://issues.apache.org/jira/browse/IGNITE-14055 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov Cluster hangs after jvm pauses on one of server nodes. Scenario: 1. Start three server nodes with put operations using StartServerWithTxPuts. 2. Emulate jvm freezes on one server node by running the attached script: {{*sh freeze.sh *}} 3. Wait until the script has finished. Result: The cluster hangs on tx put operations. The first server node continuously prints: {{{noformat}}} {{}}{{[2020-11-03 09:36:01,719][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,124][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,326][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,528][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}} {{{noformat}}}{{}} The second node prints long running transactions in prepared state ignoring the default tx timeout: {{{noformat}}} {{1}}{{[2020-11-03 09:36:46,199][WARN ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, flags=0]]], prepared=0, locked=false, nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2]]], explicitLock=false, queryUpdate=false,
[jira] [Commented] (IGNITE-13836) Multiple property roots support
[ https://issues.apache.org/jira/browse/IGNITE-13836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17270095#comment-17270095 ] Anton Kalashnikov commented on IGNITE-13836: [~sergeychugunov], it looks good to me. > Multiple property roots support > --- > > Key: IGNITE-13836 > URL: https://issues.apache.org/jira/browse/IGNITE-13836 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Sergey Chugunov >Priority: Major > Fix For: 3.0 > > Time Spent: 1h 10m > Remaining Estimate: 0h > > Right now, Configurator is able to manage only one root. It looks like it is > not enough. The current idea is to provide the ability to maintain multiple > property roots, which allows other modules to create their own roots as > needed. > ex.: > * indexing.query.bufferSize > * persistence.pageSize > NB! There is not any local/cluster root because it looks like local/cluster > shouldn't be there at all. Perhaps it should be a storage-specific feature > rather than a property path specific. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13912) Incorrect calculation of WAL segments that should be deleted from WAL archive
[ https://issues.apache.org/jira/browse/IGNITE-13912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17267807#comment-17267807 ] Anton Kalashnikov commented on IGNITE-13912: [~ktkale...@gridgain.com] thanks for the changes it looks good to me. [~sergeychugunov] can you help with the merge please? > Incorrect calculation of WAL segments that should be deleted from WAL archive > - > > Key: IGNITE-13912 > URL: https://issues.apache.org/jira/browse/IGNITE-13912 > Project: Ignite > Issue Type: Bug > Components: persistence >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Critical > Fix For: 2.10 > > Attachments: wal_usage_dec12.PNG, wal_usage_dec22nd_binary.PNG > > Time Spent: 10m > Remaining Estimate: 0h > > Now there is an incorrect calculation of WAL segments that should be deleted > from WAL archive. Since we delete only those segments whose total size should > not exceed *DataStorageConfiguration#maxWalArchiveSize * > IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE*, but should be up to > DataStorageConfiguration#maxWalArchiveSize * > IGNITE_THRESHOLD_WAL_ARCHIVE_SIZE_PERCENTAGE*. Therefore, an excess of > *DataStorageConfiguration#maxWalArchiveSize* occurs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13972) Clear the item id before moving the page to the reuse bucket
Anton Kalashnikov created IGNITE-13972: -- Summary: Clear the item id before moving the page to the reuse bucket Key: IGNITE-13972 URL: https://issues.apache.org/jira/browse/IGNITE-13972 Project: Ignite Issue Type: Task Reporter: Anton Kalashnikov There is assert - 'Incorrectly recycled pageId in reuse bucket:'(org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList#takeEmptyPage). This assert sometimes fails. The reason is not clear because the same condition checked before putting this page in to reuse bucket. (Perhaps we have more than 1 link to this page?) There is an idea to reset item id to 1 before the putting page to reuse bucket in order of decreasing the possible invariants which can break this assert. It is already true for all data pages but item id can be still more than 1 if it is not a data page(ex. inner page). After that, we can change this assert from checking the range to checking the equality to 1 which theoretically will help us detect the problem fastly. Maybe it is also not a bad idea to set itemId to an impossible value(ex. 0 or 255). Then we can add the assert on every taking from the free list which checks that itemId more than 0 and if it is false that means we have a link to the reuse bucket page from the bucket which is not reused. Which is a bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13831) Move WAL archive cleanup from checkpoint to rollover
[ https://issues.apache.org/jira/browse/IGNITE-13831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254078#comment-17254078 ] Anton Kalashnikov commented on IGNITE-13831: [~ktkale...@gridgain.com] thanks for your changes. It looks good to me. > Move WAL archive cleanup from checkpoint to rollover > > > Key: IGNITE-13831 > URL: https://issues.apache.org/jira/browse/IGNITE-13831 > Project: Ignite > Issue Type: Improvement > Components: persistence >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Fix For: 2.10 > > Time Spent: 20m > Remaining Estimate: 0h > > Users expect *DataStorageConfiguration#maxWalArchiveSize* to mean that WAL > archive will not exceed this value, but it is not. > It seems that to reduce the chance of getting into a situation when we exceed > WAL archive, it will be lowed when we clean it when switching to a new > segment than at the end of the checkpoint. It is proposed to move the archive > cleanup to *FileWriteAheadLogManager#rollOver* when the > *DataStorageConfiguration#maxWalArchiveSize* is reached. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13856) Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, writer)
[ https://issues.apache.org/jira/browse/IGNITE-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17254073#comment-17254073 ] Anton Kalashnikov commented on IGNITE-13856: [~kazakov], thanks for your effort. Now it looks good to me. > Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, > writer) > - > > Key: IGNITE-13856 > URL: https://issues.apache.org/jira/browse/IGNITE-13856 > Project: Ignite > Issue Type: Improvement > Components: binary >Affects Versions: 2.9 >Reporter: Ilya Kazakov >Assignee: Ilya Kazakov >Priority: Major > Attachments: LongStringSQL.java > > Time Spent: 20m > Remaining Estimate: 0h > > {code:java} > @Override public void writeMessage(Message msg, MessageWriter writer) { > if (msg != null) { > if (buf.hasRemaining()) { > try { > writer.beforeInnerMessageWrite() > writer.setCurrentWriteClass(msg.getClass()); > lastFinished = msg.writeTo(buf, writer); > } > finally { > writer.afterInnerMessageWrite(lastFinished); > } > } > } > }{code} > It is going to do multiple invocations of msg.writeTo(). If msg is > GridH2String, it will to val.getBytes() on every invocation of writeTo(), > leading to spiking of CPU and RAM usage. > We should change this module to make sure that all serialization happens only > once. > > Reproducer is attached. If we increase string size in 10 times, then the > execution time increases more than 10 times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13720) Defragmentation parallelism implementation
[ https://issues.apache.org/jira/browse/IGNITE-13720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17253565#comment-17253565 ] Anton Kalashnikov commented on IGNITE-13720: [~sergeychugunov] can you take a look, please? > Defragmentation parallelism implementation > -- > > Key: IGNITE-13720 > URL: https://issues.apache.org/jira/browse/IGNITE-13720 > Project: Ignite > Issue Type: Sub-task > Components: persistence >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Defragmentation is executed in a single thread right now. It makes sense to > execute the defragmentation of partitions of one group in parallel. > Several parameters will be added to the defragmentation configuration: > * checkpointThreadPoolSize - the size of thread pool which would be used by > checkpointer for writing defragmented pages to disk. > * executionThreadPoolSize - the size of the thread pool which shows how many > partitions maximum can be defragmented at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13856) Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, writer)
[ https://issues.apache.org/jira/browse/IGNITE-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17252778#comment-17252778 ] Anton Kalashnikov commented on IGNITE-13856: [~kazakov], can you take a look at one more comment in PR. > Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, > writer) > - > > Key: IGNITE-13856 > URL: https://issues.apache.org/jira/browse/IGNITE-13856 > Project: Ignite > Issue Type: Improvement > Components: binary >Affects Versions: 2.9 >Reporter: Ilya Kazakov >Assignee: Ilya Kazakov >Priority: Major > Attachments: LongStringSQL.java > > Time Spent: 20m > Remaining Estimate: 0h > > {code:java} > @Override public void writeMessage(Message msg, MessageWriter writer) { > if (msg != null) { > if (buf.hasRemaining()) { > try { > writer.beforeInnerMessageWrite() > writer.setCurrentWriteClass(msg.getClass()); > lastFinished = msg.writeTo(buf, writer); > } > finally { > writer.afterInnerMessageWrite(lastFinished); > } > } > } > }{code} > It is going to do multiple invocations of msg.writeTo(). If msg is > GridH2String, it will to val.getBytes() on every invocation of writeTo(), > leading to spiking of CPU and RAM usage. > We should change this module to make sure that all serialization happens only > once. > > Reproducer is attached. If we increase string size in 10 times, then the > execution time increases more than 10 times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13856) Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, writer)
[ https://issues.apache.org/jira/browse/IGNITE-13856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17251827#comment-17251827 ] Anton Kalashnikov commented on IGNITE-13856: [~kazakov] I left a couple of comments in PR. The major one is about using the map for caching. I believe you can use a simple byte array instead.(you can look at the usage of arrOff) > Superlinear performance of DirectByteBufferStreamImplV2.writeMessage(msg, > writer) > - > > Key: IGNITE-13856 > URL: https://issues.apache.org/jira/browse/IGNITE-13856 > Project: Ignite > Issue Type: Improvement > Components: binary >Affects Versions: 2.9 >Reporter: Ilya Kazakov >Assignee: Ilya Kazakov >Priority: Major > Attachments: LongStringSQL.java > > Time Spent: 10m > Remaining Estimate: 0h > > {code:java} > @Override public void writeMessage(Message msg, MessageWriter writer) { > if (msg != null) { > if (buf.hasRemaining()) { > try { > writer.beforeInnerMessageWrite() > writer.setCurrentWriteClass(msg.getClass()); > lastFinished = msg.writeTo(buf, writer); > } > finally { > writer.afterInnerMessageWrite(lastFinished); > } > } > } > }{code} > It is going to do multiple invocations of msg.writeTo(). If msg is > GridH2String, it will to val.getBytes() on every invocation of writeTo(), > leading to spiking of CPU and RAM usage. > We should change this module to make sure that all serialization happens only > once. > > Reproducer is attached. If we increase string size in 10 times, then the > execution time increases more than 10 times. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13190) Core defragmentation functions
[ https://issues.apache.org/jira/browse/IGNITE-13190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250972#comment-17250972 ] Anton Kalashnikov commented on IGNITE-13190: [~timonin.maksim] thanks for your notice. it really looks suspicious, perhaps we lost some changes during the merge. I'll check this out. > Core defragmentation functions > -- > > Key: IGNITE-13190 > URL: https://issues.apache.org/jira/browse/IGNITE-13190 > Project: Ignite > Issue Type: Sub-task >Reporter: Sergey Chugunov >Assignee: Ivan Bessonov >Priority: Major > Labels: IEP-47 > Fix For: 2.10 > > Time Spent: 20h 50m > Remaining Estimate: 0h > > The following set of functions covering defragmentation happy-case needed: > * Initialization of defragmentation manager when node is started in > maintenance mode. > * Information about partition files is gathered by defrag mgr. > * For each partition file corresponding file of defragmented partition is > created and initialized. > * Keys are transferred from old partitions to new partitions. > * Checkpointer is aware of new partition files and flushes defragmented > memory to new partition files. > > No fault-tolerance code nor index defragmentation mappings are needed in this > task. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13848) Premature update SegmentReservationStorage#minReserveIdx during truncate of segments
[ https://issues.apache.org/jira/browse/IGNITE-13848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250967#comment-17250967 ] Anton Kalashnikov commented on IGNITE-13848: [~ktkale...@gridgain.com] LGTM > Premature update SegmentReservationStorage#minReserveIdx during truncate of > segments > - > > Key: IGNITE-13848 > URL: https://issues.apache.org/jira/browse/IGNITE-13848 > Project: Ignite > Issue Type: Bug > Components: persistence >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > It was found premature *SegmentReservationStorage#minReserveIdx* update in > *FileWriteAheadLogManager#truncate*. Which creates the wrong state of the > segments in the archive. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13847) Make GridEncryptionManager#onWalSegmentRemoved async
[ https://issues.apache.org/jira/browse/IGNITE-13847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250944#comment-17250944 ] Anton Kalashnikov commented on IGNITE-13847: [~ktkale...@gridgain.com] LGTM > Make GridEncryptionManager#onWalSegmentRemoved async > > > Key: IGNITE-13847 > URL: https://issues.apache.org/jira/browse/IGNITE-13847 > Project: Ignite > Issue Type: Improvement > Components: persistence >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Labels: IEP-18 > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > When implementing IGNITE-13831 I was faced with deadlock. > When execute *FileWriteAheadLogManager#rollOver*, begin to clean WAL archive > since we have reached the *DataStorageConfiguration#maxWalArchiveSize*, after > deleting a segment, execute the *GridEncryptionManager#onWalSegmentRemoved* > that wants to write to the metastore, but it will not succeed, since it will > wait for *FileWriteAheadLogManager#rollOver*. > I suggest making the *GridEncryptionManager#onWalSegmentRemoved* asynchronous > in a separate pool, for example, as a *CacheGroupPageScanner#singleExecSvc*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13843) Wrapper/Converter for primitive configuration
Anton Kalashnikov created IGNITE-13843: -- Summary: Wrapper/Converter for primitive configuration Key: IGNITE-13843 URL: https://issues.apache.org/jira/browse/IGNITE-13843 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Do we need the ability to use complex type such InternetAddress as wrapper of some string property? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13842) Creating the new configuration on old cluster
Anton Kalashnikov created IGNITE-13842: -- Summary: Creating the new configuration on old cluster Key: IGNITE-13842 URL: https://issues.apache.org/jira/browse/IGNITE-13842 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Do we need the ability to create a new configuration/property on the working cluster? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13841) Cluster bootstrapping
Anton Kalashnikov created IGNITE-13841: -- Summary: Cluster bootstrapping Key: IGNITE-13841 URL: https://issues.apache.org/jira/browse/IGNITE-13841 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov How cluster bootstrapping should look like? Format of files? What is the right moment fr applying configuration? What is the state of the cluster before applying? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13840) Rething API of Init*, change* classes
Anton Kalashnikov created IGNITE-13840: -- Summary: Rething API of Init*, change* classes Key: IGNITE-13840 URL: https://issues.apache.org/jira/browse/IGNITE-13840 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Right now, API of Init*, change* classes look too heavy and contain a lot of code boilerplate. It needs to think about how to simplify it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13815) Remove ability to delete segments from the middle of WAL archive
[ https://issues.apache.org/jira/browse/IGNITE-13815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247819#comment-17247819 ] Anton Kalashnikov commented on IGNITE-13815: [~ktkale...@gridgain.com], it looks good to me. I just want to propose to rename two methods incMinReserveIndex and incMinLockIndex to something without 'inc' because 'inc' associated to increment and it is expected the delta as a given parameter(or nothing). But in your case, the parameter is not the delta but is an absolute value which means it should not be 'inc' it should be 'set'. So maybe it is better to rename to setMinReserveIndex or just minReserveIndex. In my opinion, it is ok if any other restriction like 'setting only value which greater than current' can be described in java-doc rather than name because it is anyway impossible to make such an informative name. > Remove ability to delete segments from the middle of WAL archive > > > Key: IGNITE-13815 > URL: https://issues.apache.org/jira/browse/IGNITE-13815 > Project: Ignite > Issue Type: Improvement > Components: persistence >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > At the moment we have the option to delete segments from the middle of the > archive via the *FileWriteAheadLogManager#truncate*. This creates gaps in the > archive and makes it invalid. > It should be possible to delete segments sequentially up to the upper > boundary. It has also been found that there is no protection against segment > deletion, which may be needed for a binary recovery. > Also need to get rid of the physical check when reserving segments through > the *FileWriteAheadLogManager#reserve*. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13837) Configuration initialization
Anton Kalashnikov created IGNITE-13837: -- Summary: Configuration initialization Key: IGNITE-13837 URL: https://issues.apache.org/jira/browse/IGNITE-13837 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov It needs to think how the first initialization of node/cluster should look like. What is the format of initial properties(json/hocon etc.)? How should they be handled? -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13836) Multiple property roots support
Anton Kalashnikov created IGNITE-13836: -- Summary: Multiple property roots support Key: IGNITE-13836 URL: https://issues.apache.org/jira/browse/IGNITE-13836 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Right now, Configurator is able to manage only one root. It looks like it is not enough. The current idea is to provide the ability to maintain multiple property roots, which allows other modules to create their own roots as needed. ex.: * indexing.query.bufferSize * persistence.pageSize NB! There is not any local/cluster root because it looks like local/cluster shouldn't be there at all. Perhaps it should be a storage-specific feature rather than a property path specific. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13786) PDS defragmentation can inflate index size
[ https://issues.apache.org/jira/browse/IGNITE-13786?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17247220#comment-17247220 ] Anton Kalashnikov commented on IGNITE-13786: [~ibessonov] changes look good to me. [~agoncharuk] can you also take a look at the changes and then merge them(if everything is ok)? > PDS defragmentation can inflate index size > -- > > Key: IGNITE-13786 > URL: https://issues.apache.org/jira/browse/IGNITE-13786 > Project: Ignite > Issue Type: Sub-task >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > For huge caches it is possible that defragmentation will lead to bigger > indexes size. > The reason is that we only append new data to index trees and never insert > into the middle, this leads to under-utilization of B+Tree pages space. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13709) Control.sh API - status
[ https://issues.apache.org/jira/browse/IGNITE-13709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17245960#comment-17245960 ] Anton Kalashnikov commented on IGNITE-13709: [~ibessonov] it looks good to me. [~sergeychugunov] can you help with merge please? > Control.sh API - status > --- > > Key: IGNITE-13709 > URL: https://issues.apache.org/jira/browse/IGNITE-13709 > Project: Ignite > Issue Type: Sub-task >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: IEP-47 > Time Spent: 10m > Remaining Estimate: 0h > > _Prerequisites:_ command can be sent to nodes in maintenance mode and in > normal operations as well. > > _Command output:_ > # For node in normal operations: > defragmentation is scheduled for caches: > # For node in maintenance mode executing defragmentation: > defragmentation is completed for the caches: > cache0 - size before/after: 200GB/150GB, time took: 15 mins 42 secs > defragmentation is in progress for cache: > cache1 - partitions processed/all: 177/512, time elapsed: 7 mins 11 secs > awaiting defragmentation: cache2, cache3, cache4. > # For node in maintenance mode for other reason: > no defragmentation is scheduled for the node, the node is in maintenance to > perform tasks: -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13775) U.ReentrantReadWriteLockTracer improper realization.
[ https://issues.apache.org/jira/browse/IGNITE-13775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243944#comment-17243944 ] Anton Kalashnikov commented on IGNITE-13775: [~zstan], now changes look good to me. Waiting for TC... > U.ReentrantReadWriteLockTracer improper realization. > > > Key: IGNITE-13775 > URL: https://issues.apache.org/jira/browse/IGNITE-13775 > Project: Ignite > Issue Type: Improvement > Components: general >Affects Versions: 2.9 >Reporter: Stanilovsky Evgeny >Assignee: Stanilovsky Evgeny >Priority: Major > Attachments: image-2020-12-01-13-51-39-048.png, screenshot-1.png > > Time Spent: 0.5h > Remaining Estimate: 0h > > ReentrantReadWriteLockTracer accepts ReentrantReadWriteLock as a delegate and > stores delegates for readLock and writeLock. But > ReentrantReadWriteLock#isWriteLockedByCurrentThread uses sync object to > evaluate the result instead of writeLock, and ReentrantReadWriteLockTracer > has it's own sync object. > As a result, if ReentrantReadWriteLockTracer is used to create checkpoint > lock (when IGNITE_PDS_LOG_CP_READ_LOCK_HOLDERS=true), > GridCacheDatabaseSharedManager#checkpointLockIsHeldByThread doesn't work > correctly: it returns false when checkpoint lock is acquired. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13697) Control.sh API - schedule & cancel
[ https://issues.apache.org/jira/browse/IGNITE-13697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17243288#comment-17243288 ] Anton Kalashnikov commented on IGNITE-13697: [~ibessonov] changes look good to me. > Control.sh API - schedule & cancel > -- > > Key: IGNITE-13697 > URL: https://issues.apache.org/jira/browse/IGNITE-13697 > Project: Ignite > Issue Type: Sub-task >Reporter: Ivan Bessonov >Assignee: Ivan Bessonov >Priority: Major > Labels: IEP-47 > Time Spent: 10m > Remaining Estimate: 0h > > > From original draft by [~sergeychugunov]: > > Schedule > *control.sh defragmentation schedule nodes > nodeConsistentId0[,nodeConsistentId1] [caches > cacheName0,cacheName1,cacheName2]* > > Optional list of caches is passed to perform defragmentation for a > particular set of caches. By default all caches are defragmented. > > _Prerequisites_: command is sent to node in normal operations, node in > maintenance mode should not accept it > _Command output:_ > Defragmentation is successfully scheduled on nodes , on next > restart the following caches will be defragmented: . > Cancel > *control.sh defragmentation cancel nodeHost nodePort [cache cacheName0]* > _Prerequisites_: command is sent to node in maintenance mode or in normal mode > _Command output:_ > Defragmentation is already completed for caches: > Defragmentation is cancelled for caches: ; all intermediate > files are cleaned up. > > *Note:* Caches list for cancel command will not be implemented here. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13720) Defragmentation parallelism implementation
Anton Kalashnikov created IGNITE-13720: -- Summary: Defragmentation parallelism implementation Key: IGNITE-13720 URL: https://issues.apache.org/jira/browse/IGNITE-13720 Project: Ignite Issue Type: Sub-task Components: persistence Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov Defragmentation is executed in a single thread right now. It makes sense to execute the defragmentation of partitions of one group in parallel. Several parameters will be added to the defragmentation configuration: * checkpointThreadPoolSize - the size of thread pool which would be used by checkpointer for writing defragmented pages to disk. * executionThreadPoolSize - the size of the thread pool which shows how many partitions maximum can be defragmented at the same time. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13681) Non markers checkpoint implementation
[ https://issues.apache.org/jira/browse/IGNITE-13681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229319#comment-17229319 ] Anton Kalashnikov commented on IGNITE-13681: [~sergey-chugunov] can you, please, help with review and merge? > Non markers checkpoint implementation > - > > Key: IGNITE-13681 > URL: https://issues.apache.org/jira/browse/IGNITE-13681 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > It's needed to implement a new version of checkpoint which will be simpler > than the current one. The main differences compared to the current checkpoint: > * It doesn't contain any write operation to WAL. > * It doesn't create checkpoint markers. > * It should be possible to configure checkpoint listener only on the exact > data region > This checkpoint will be helpful for defragmentation and for recovery(it is > not possible to use the current checkpoint during recovery right now) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13684) Prepare PageStore/B+Tree to usage outside of standart lifecycle
[ https://issues.apache.org/jira/browse/IGNITE-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229318#comment-17229318 ] Anton Kalashnikov commented on IGNITE-13684: [~ibessonov] changes look good to me. Can you only remove useless TODOs which I emphasized in the pull-request? [~sergey-chugunov] can you help with merge please? > Prepare PageStore/B+Tree to usage outside of standart lifecycle > --- > > Key: IGNITE-13684 > URL: https://issues.apache.org/jira/browse/IGNITE-13684 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > Right now, PageStore and some other classes which responsible for persistent > too couple with many other dependencies which not allow to use it in > different initial conditions(ex. defragmentation). So it is needed to > refactor some places in order to improve this situation. > Changes are: > * static constant for cache group meta page; > * PageStore allocation tracker replaced with a more generic LongConsumer do > decouple it from metrics framework; > * PageReadWriteManager added to basically allow having same cache group in > different data regions; > * several methods and fields exposed as internally public/protected API; > * several inner classes refactored so that they become static classes; > * PageIOResolver interface created and used to make data structure more > flexible; > * InsertLast interface for B+Tree added that will optimize comparisons on > inserts. Unused for now; > * All this code doesn't affect existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13684) Prepare PageStore/B+Tree to usage outside of standart lifecycle
[ https://issues.apache.org/jira/browse/IGNITE-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17229316#comment-17229316 ] Anton Kalashnikov commented on IGNITE-13684: Benchmarks look good: ||Benchmark||master - operation||this branch - operation||diff %|| |tx_put|4.00|43723.30|0.90%| |atomic_put_all_bs_10|67685.50|67149.10|-0.79%| |atomic_put_get|79930.80|78607.00|-1.66%| |tx_put_get|25010.20|24600.10|-1.64%| |sql_query|58618.50|59074.10|0.78%| |atomic_put_random_value|160666.00|157411.00|-2.03%| |sql_query_put|104990.00|103457.00|-1.46%| > Prepare PageStore/B+Tree to usage outside of standart lifecycle > --- > > Key: IGNITE-13684 > URL: https://issues.apache.org/jira/browse/IGNITE-13684 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Right now, PageStore and some other classes which responsible for persistent > too couple with many other dependencies which not allow to use it in > different initial conditions(ex. defragmentation). So it is needed to > refactor some places in order to improve this situation. > Changes are: > * static constant for cache group meta page; > * PageStore allocation tracker replaced with a more generic LongConsumer do > decouple it from metrics framework; > * PageReadWriteManager added to basically allow having same cache group in > different data regions; > * several methods and fields exposed as internally public/protected API; > * several inner classes refactored so that they become static classes; > * PageIOResolver interface created and used to make data structure more > flexible; > * InsertLast interface for B+Tree added that will optimize comparisons on > inserts. Unused for now; > * All this code doesn't affect existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13684) Prepare PageStore/B+Tree to usage outside of standart lifecycle
[ https://issues.apache.org/jira/browse/IGNITE-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13684: --- Description: Right now, PageStore and some other classes which responsible for persistent too couple with many other dependencies which not allow to use it in different initial conditions(ex. defragmentation). So it is needed to refactor some places in order to improve this situation. Changes are: * static constant for cache group meta page; * PageStore allocation tracker replaced with a more generic LongConsumer do decouple it from metrics framework; * PageReadWriteManager added to basically allow having same cache group in different data regions; * several methods and fields exposed as internally public/protected API; * several inner classes refactored so that they become static classes; * PageIOResolver interface created and used to make data structure more flexible; * InsertLast interface for B+Tree added that will optimize comparisons on inserts. Unused for now; * All this code doesn't affect existing behavior. was: Right now, ignite has a static pageIo resolver which not allow substituting the different implementation if needed. So it is needed to rewrite the current implementation in order of this target. Changes are: * static constant for cache group meta page; * PageStore allocation tracker replaced with a more generic LongConsumer do decouple it from metrics framework; * PageReadWriteManager added to basically allow having same cache group in different data regions; * several methods and fields exposed as internally public/protected API; * several inner classes refactored so that they become static classes; * PageIOResolver interface created and used to make data structure more flexible; * InsertLast interface for B+Tree added that will optimize comparisons on inserts. Unused for now; * All this code doesn't affect existing behavior. > Prepare PageStore/B+Tree to usage outside of standart lifecycle > --- > > Key: IGNITE-13684 > URL: https://issues.apache.org/jira/browse/IGNITE-13684 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Right now, PageStore and some other classes which responsible for persistent > too couple with many other dependencies which not allow to use it in > different initial conditions(ex. defragmentation). So it is needed to > refactor some places in order to improve this situation. > Changes are: > * static constant for cache group meta page; > * PageStore allocation tracker replaced with a more generic LongConsumer do > decouple it from metrics framework; > * PageReadWriteManager added to basically allow having same cache group in > different data regions; > * several methods and fields exposed as internally public/protected API; > * several inner classes refactored so that they become static classes; > * PageIOResolver interface created and used to make data structure more > flexible; > * InsertLast interface for B+Tree added that will optimize comparisons on > inserts. Unused for now; > * All this code doesn't affect existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13684) Prepare PageStore/B+Tree to usage outside of standart lifecycle
[ https://issues.apache.org/jira/browse/IGNITE-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13684: --- Summary: Prepare PageStore/B+Tree to usage outside of standart lifecycle (was: Rewrite PageIo resolver from static to explicit dependency) > Prepare PageStore/B+Tree to usage outside of standart lifecycle > --- > > Key: IGNITE-13684 > URL: https://issues.apache.org/jira/browse/IGNITE-13684 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Right now, ignite has a static pageIo resolver which not allow substituting > the different implementation if needed. So it is needed to rewrite the > current implementation in order of this target. > Changes are: > * static constant for cache group meta page; > * PageStore allocation tracker replaced with a more generic LongConsumer do > decouple it from metrics framework; > * PageReadWriteManager added to basically allow having same cache group in > different data regions; > * several methods and fields exposed as internally public/protected API; > * several inner classes refactored so that they become static classes; > * PageIOResolver interface created and used to make data structure more > flexible; > * InsertLast interface for B+Tree added that will optimize comparisons on > inserts. Unused for now; > * All this code doesn't affect existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13684) Rewrite PageIo resolver from static to explicit dependency
[ https://issues.apache.org/jira/browse/IGNITE-13684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13684: --- Description: Right now, ignite has a static pageIo resolver which not allow substituting the different implementation if needed. So it is needed to rewrite the current implementation in order of this target. Changes are: * static constant for cache group meta page; * PageStore allocation tracker replaced with a more generic LongConsumer do decouple it from metrics framework; * PageReadWriteManager added to basically allow having same cache group in different data regions; * several methods and fields exposed as internally public/protected API; * several inner classes refactored so that they become static classes; * PageIOResolver interface created and used to make data structure more flexible; * InsertLast interface for B+Tree added that will optimize comparisons on inserts. Unused for now; * All this code doesn't affect existing behavior. was:Right now, ignite has a static pageIo resolver which not allow substituting the different implementation if needed. So it is needed to rewrite the current implementation in order of this target. > Rewrite PageIo resolver from static to explicit dependency > -- > > Key: IGNITE-13684 > URL: https://issues.apache.org/jira/browse/IGNITE-13684 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Ivan Bessonov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > Right now, ignite has a static pageIo resolver which not allow substituting > the different implementation if needed. So it is needed to rewrite the > current implementation in order of this target. > Changes are: > * static constant for cache group meta page; > * PageStore allocation tracker replaced with a more generic LongConsumer do > decouple it from metrics framework; > * PageReadWriteManager added to basically allow having same cache group in > different data regions; > * several methods and fields exposed as internally public/protected API; > * several inner classes refactored so that they become static classes; > * PageIOResolver interface created and used to make data structure more > flexible; > * InsertLast interface for B+Tree added that will optimize comparisons on > inserts. Unused for now; > * All this code doesn't affect existing behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13682) Add generic to maintenance mode feature
[ https://issues.apache.org/jira/browse/IGNITE-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17228459#comment-17228459 ] Anton Kalashnikov commented on IGNITE-13682: [~sergey-chugunov] Can you please take a review and merge this? > Add generic to maintenance mode feature > --- > > Key: IGNITE-13682 > URL: https://issues.apache.org/jira/browse/IGNITE-13682 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > MaintenanceAction has no generic right now which lead to parametirezed problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13682) Add generic to maintenance mode feature
[ https://issues.apache.org/jira/browse/IGNITE-13682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13682: --- Summary: Add generic to maintenance mode feature (was: Added generic to maintenance mode feature) > Add generic to maintenance mode feature > --- > > Key: IGNITE-13682 > URL: https://issues.apache.org/jira/browse/IGNITE-13682 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > MaintenanceAction has no generic right now which lead to parametirezed problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13684) Rewrite PageIo resolver from static to explicit dependency
Anton Kalashnikov created IGNITE-13684: -- Summary: Rewrite PageIo resolver from static to explicit dependency Key: IGNITE-13684 URL: https://issues.apache.org/jira/browse/IGNITE-13684 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Ivan Bessonov Right now, ignite has a static pageIo resolver which not allow substituting the different implementation if needed. So it is needed to rewrite the current implementation in order of this target. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13683) Added MVCC validation to ValidateIndexesClosure
Anton Kalashnikov created IGNITE-13683: -- Summary: Added MVCC validation to ValidateIndexesClosure Key: IGNITE-13683 URL: https://issues.apache.org/jira/browse/IGNITE-13683 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Semyon Danilov MVCC indexes validation should be added to ValidateIndexesClosure -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13682) Added generic to maintenance mode feature
Anton Kalashnikov created IGNITE-13682: -- Summary: Added generic to maintenance mode feature Key: IGNITE-13682 URL: https://issues.apache.org/jira/browse/IGNITE-13682 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov MaintenanceAction has no generic right now which lead to parametirezed problem -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13681) Non markers checkpoint implementation
Anton Kalashnikov created IGNITE-13681: -- Summary: Non markers checkpoint implementation Key: IGNITE-13681 URL: https://issues.apache.org/jira/browse/IGNITE-13681 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov It's needed to implement a new version of checkpoint which will be simpler than the current one. The main differences compared to the current checkpoint: * It doesn't contain any write operation to WAL. * It doesn't create checkpoint markers. * It should be possible to configure checkpoint listener only on the exact data region This checkpoint will be helpful for defragmentation and for recovery(it is not possible to use the current checkpoint during recovery right now) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13366) Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.
[ https://issues.apache.org/jira/browse/IGNITE-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213750#comment-17213750 ] Anton Kalashnikov commented on IGNITE-13366: [~sergeychugunov] ok, let's implement it a little later. LGTM. > Special mode for maintenance of Ignite node. Employing Maintenance Mode for > clearing corrupted PDS files. > - > > Key: IGNITE-13366 > URL: https://issues.apache.org/jira/browse/IGNITE-13366 > Project: Ignite > Issue Type: New Feature > Components: persistence >Affects Versions: 2.8.1 >Reporter: Sergey Chugunov >Assignee: Sergey Chugunov >Priority: Critical > Labels: IEP-53 > Fix For: 2.10 > > Original Estimate: 168h > Time Spent: 1h 40m > Remaining Estimate: 166h 20m > > If node with persistence is stopped when WAL was disabled for a cache (no > matters because of rebalancing in progress or by explicit user request) on > next node start all data files of that cache are removed automatically and > unconditionally. > This behavior may be unexpected for users as they may not understand all > consequences of disabling WAL locally (for rebalancing) or globally (via > IgniteCluster API call). Also it is not smart enough as there is no point in > deleting consistent data files. > We should change this behavior to the following list: no automatic deletions > whatsoever. If data files are consistent (equivalent to: no checkpoint was > running when node was stopped) start up normally. If data files are > corrupted, don't let the node start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (IGNITE-12489) Error during purges by expiration: Unknown page type
[ https://issues.apache.org/jira/browse/IGNITE-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov reassigned IGNITE-12489: -- Assignee: (was: Anton Kalashnikov) > Error during purges by expiration: Unknown page type > > > Key: IGNITE-12489 > URL: https://issues.apache.org/jira/browse/IGNITE-12489 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7, 2.7.6 >Reporter: Ruslan Kamashev >Priority: Blocker > Fix For: 2.10 > > > {{*logger*}} > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheIoManager > {code} > {{*message*}} > {code:java} > Failed to process message [senderId=969d56ba-4b46-40cf-886e-ac445cf6a95d, > messageType=class > o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicUpdateRequest]{code} > {{*thread*}} > {code:java} > sys-stripe-19-#20{code} > {{*trace*}} > {code:java} > java.lang.IllegalStateException: Unknown page type: 1 pageId: 00010303117d > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.io(BPlusTree.java:5058) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$200(BPlusTree.java:90) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.nextPage(BPlusTree.java:5330) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.next(BPlusTree.java:5566) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2232) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2157) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:845) > at > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207) > at > org.apache.ignite.internal.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:888) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessageProcessed(GridCacheIoManager.java:1103) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1076) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) > at > org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) > at > org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) > at > org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Dec 23, 2019 @ 18:28:28.457 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13366) Special mode for maintenance of Ignite node. Employing Maintenance Mode for clearing corrupted PDS files.
[ https://issues.apache.org/jira/browse/IGNITE-13366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17212401#comment-17212401 ] Anton Kalashnikov commented on IGNITE-13366: In general, it looks good to me. But I have several questions: * I noticed that you rewrite the file when a new record is added. Did you think about copy-on-write approach with a temp file? * Your maintenanceId is UUID right now. But maybe it is better to use something more human-readable? * You start the autoAction(mntcProcessor.prepareAndExecuteMaintenance();) before the discovery is started. I don't have the right answer for it but do you sure it is the right place for it? Don't we want to call this method in another thread(not started one) after the node was entirely started? * Do we want to add some version for the maintenance record store file? Maybe we should add it to the name of the file? > Special mode for maintenance of Ignite node. Employing Maintenance Mode for > clearing corrupted PDS files. > - > > Key: IGNITE-13366 > URL: https://issues.apache.org/jira/browse/IGNITE-13366 > Project: Ignite > Issue Type: New Feature > Components: persistence >Affects Versions: 2.8.1 >Reporter: Sergey Chugunov >Assignee: Sergey Chugunov >Priority: Critical > Labels: IEP-53 > Fix For: 2.10 > > Original Estimate: 168h > Time Spent: 1h 40m > Remaining Estimate: 166h 20m > > If node with persistence is stopped when WAL was disabled for a cache (no > matters because of rebalancing in progress or by explicit user request) on > next node start all data files of that cache are removed automatically and > unconditionally. > This behavior may be unexpected for users as they may not understand all > consequences of disabling WAL locally (for rebalancing) or globally (via > IgniteCluster API call). Also it is not smart enough as there is no point in > deleting consistent data files. > We should change this behavior to the following list: no automatic deletions > whatsoever. If data files are consistent (equivalent to: no checkpoint was > running when node was stopped) start up normally. If data files are > corrupted, don't let the node start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13500) Checkpoint read lock fail if it is taking under write lock during the stopping node
[ https://issues.apache.org/jira/browse/IGNITE-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17211041#comment-17211041 ] Anton Kalashnikov commented on IGNITE-13500: [~sergeychugunov] please, take a look at these changes. They should fix the problem with BasicIndexTest#testInlineSizeChange > Checkpoint read lock fail if it is taking under write lock during the > stopping node > --- > > Key: IGNITE-13500 > URL: https://issues.apache.org/jira/browse/IGNITE-13500 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 20m > Remaining Estimate: 0h > > org.apache.ignite.internal.processors.cache.index.BasicIndexTest#testDynamicIndexesDropWithPersistence > {noformat} > [2020-09-30 > 15:09:26,085][ERROR][db-checkpoint-thread-#371%index.BasicIndexTest0%][Checkpointer] > Runtime error caught during grid runnable execution: GridWorker > [name=db-checkpoint-thread, igniteInstanceName=index.BasicIndexTest0, > finished=false, heartbeatTs=1601467766063, hashCode=963964001, > interrupted=false, runner=db-checkpoint-thread-#371%index.BasicIndexTest0%] > class org.apache.ignite.IgniteException: Failed to perform cache update: node > is stopping. > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:396) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class org.apache.ignite.IgniteException: Failed to perform cache > update: node is stopping. > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:128) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1298) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.removeDurableBackgroundTask(DurableBackgroundTasksProcessor.java:245) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.onMarkCheckpointBegin(DurableBackgroundTasksProcessor.java:277) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointWorkflow.markCheckpointBegin(CheckpointWorkflow.java:274) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:387) > ... 3 more > Caused by: class org.apache.ignite.internal.NodeStoppingException: Failed to > perform cache update: node is stopping. > ... 9 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13569) disable archiving + walCompactionEnabled probably broke reading from wal on server restart
Anton Kalashnikov created IGNITE-13569: -- Summary: disable archiving + walCompactionEnabled probably broke reading from wal on server restart Key: IGNITE-13569 URL: https://issues.apache.org/jira/browse/IGNITE-13569 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov * Start cluster with 4 server node * Preload * Start 4 clients * Start transactional loading * Wait 10 sec While loading: For node in server nodes: Kill -9 node Wait 20 sec Return node back Wait 20 sec Wal + Wal_archive - lab40, lab41 - /storage/hdd/aromantsov/GG-18739 Looks like node can't read all wal files that was generated before start node back {noformat} [12:50:27,001][SEVERE][wal-file-compressor-%null%-1-#71][FileWriteAheadLogManager] Compression of WAL segment [idx=0] was skipped due to unexpected error class org.apache.ignite.IgniteCheckedException: WAL archive segment is missing: /storage/ssd/aromantsov/tiden/snapshots-190514-121520/test_pitr/node_1_1/.wal at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body0(FileWriteAheadLogManager.java:2076) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body(FileWriteAheadLogManager.java:2054) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [12:50:27,001][SEVERE][wal-file-compressor-%null%-0-#69][FileWriteAheadLogManager] Compression of WAL segment [idx=2] was skipped due to unexpected error class org.apache.ignite.IgniteCheckedException: WAL archive segment is missing: /storage/ssd/aromantsov/tiden/snapshots-190514-121520/test_pitr/node_1_1/0002.wal at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body0(FileWriteAheadLogManager.java:2076) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.access$4800(FileWriteAheadLogManager.java:2019) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressor.body(FileWriteAheadLogManager.java:1995) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [12:50:27,001][SEVERE][wal-file-compressor-%null%-3-#73][FileWriteAheadLogManager] Compression of WAL segment [idx=3] was skipped due to unexpected error class org.apache.ignite.IgniteCheckedException: WAL archive segment is missing: /storage/ssd/aromantsov/tiden/snapshots-190514-121520/test_pitr/node_1_1/0003.wal at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body0(FileWriteAheadLogManager.java:2076) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body(FileWriteAheadLogManager.java:2054) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [12:50:27,001][SEVERE][wal-file-compressor-%null%-2-#72][FileWriteAheadLogManager] Compression of WAL segment [idx=1] was skipped due to unexpected error class org.apache.ignite.IgniteCheckedException: WAL archive segment is missing: /storage/ssd/aromantsov/tiden/snapshots-190514-121520/test_pitr/node_1_1/0001.wal at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body0(FileWriteAheadLogManager.java:2076) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body(FileWriteAheadLogManager.java:2054) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [12:50:27,002][SEVERE][wal-file-compressor-%null%-1-#71][FileWriteAheadLogManager] Compression of WAL segment [idx=4] was skipped due to unexpected error class org.apache.ignite.IgniteCheckedException: WAL archive segment is missing: /storage/ssd/aromantsov/tiden/snapshots-190514-121520/test_pitr/node_1_1/0004.wal at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body0(FileWriteAheadLogManager.java:2076) at org.apache.ignite.internal.processors.cache.persistence.wal.FileWriteAheadLogManager$FileCompressorWorker.body(FileWriteAheadLogManager.java:2054) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) [12:50:27,002][SEVERE][wal-file-compressor-%null%-0-#69][FileWriteAheadLogManager]
[jira] [Commented] (IGNITE-13565) Potential further bugs with DurableBackgroundTasks.
[ https://issues.apache.org/jira/browse/IGNITE-13565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17210785#comment-17210785 ] Anton Kalashnikov commented on IGNITE-13565: In my opinion, it is not a potential bug, it is already a bug. It looks like if DurableBackgroundTask is finished but status isn't updated it metastore, it leads to data corruption but finishing DurableBackgroundTask and changing status in metastore is not atomic operation so nobody can guarantee that node doesn't fail between these two actions. Perhaps, It needs to add some atomic operation for detection of finish the DurableBackgroundTask(maybe we should write something in WAL). > Potential further bugs with DurableBackgroundTasks. > --- > > Key: IGNITE-13565 > URL: https://issues.apache.org/jira/browse/IGNITE-13565 > Project: Ignite > Issue Type: Bug > Components: general >Affects Versions: 2.8.1 >Reporter: Stanilovsky Evgeny >Priority: Major > > After some code refactoring [1] we obtain a problem with simpe test: > org.apache.ignite.internal.processors.cache.index.BasicIndexTest#testInlineSizeChange > between > {noformat} > execSql(cache, "drop index \"idx1\""); > {noformat} > and > {noformat} > ig0 = startGrid(0); > {noformat} > operations, seems [2] will fix it, but problem could potentially happen again > (check attached stacks). In few words already completed durable task not > updated > {noformat} > DurableBackgroundTask#complete > {noformat} > status on metastore, thus after cluster running this task still can try to > run once more with undefined behavior. [~Denis Chudov], [~makedonskaya] pay > your attention plz. > [1] https://issues.apache.org/jira/browse/IGNITE-13207 > [2] https://issues.apache.org/jira/browse/IGNITE-13500 > {noformat} > 2020-10-09 11:42:41,982][INFO ][test-runner-#1%index.BasicIndexTest%][root] > >>> Stopping grid [name=index.BasicIndexTest0, > id=161e62a2-1a5d-46b0-892d-2e0274e0] > [2020-10-09 > 11:42:41,999][ERROR][db-checkpoint-thread-#61%index.BasicIndexTest0%][root] > Critical system error detected. Will be handled accordingly to configured > handler [hnd=StopNodeFailureHandler [super=AbstractFailureHandler > [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, > SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext > [type=CRITICAL_ERROR, err=class o.a.i.IgniteException: Failed to perform > cache update: node is stopping.]] > class org.apache.ignite.IgniteException: Failed to perform cache update: node > is stopping. > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:125) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1297) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.removeDurableBackgroundTask(DurableBackgroundTasksProcessor.java:245) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.onMarkCheckpointBegin(DurableBackgroundTasksProcessor.java:277) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointWorkflow.markCheckpointBegin(CheckpointWorkflow.java:274) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:387) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > ... > starting grid and ... > java.lang.AssertionError: calculatedOffset=49152, allocated=45056, > headerSize=4096, > cfgFile=/work/repo/apache-ignite/work/db/index_BasicIndexTest0/cache-default/index.bin > >>> +---+ > >>> Ignite ver. 2.10.0-SNAPSHOT#20201009-sha1:DEV > >>> +---+ > at > org.apache.ignite.internal.processors.cache.persistence.file.FilePageStore.read(FilePageStore.java:492) > at > org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:554) > at > org.apache.ignite.internal.processors.cache.persistence.file.FilePageStoreManager.read(FilePageStoreManager.java:538) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:884) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:710) > at >
[jira] [Created] (IGNITE-13562) Prototype dynamic configuration
Anton Kalashnikov created IGNITE-13562: -- Summary: Prototype dynamic configuration Key: IGNITE-13562 URL: https://issues.apache.org/jira/browse/IGNITE-13562 Project: Ignite Issue Type: Sub-task Reporter: Anton Kalashnikov Assignee: Semyon Danilov The main target to add a new extra configuration module with a framework that allows us to create dynamic properties(node local and cluster wide?). The framework should provide the following: * Describing a rule for the schema by which public and private property classes would be generated * Implementing generation public and private classes from schema * Describing a view of public POJO(update/insert/get) to interact with properties in a type-safe way * Converting the property from HOCON to the inner view -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13511) Unified configuration
Anton Kalashnikov created IGNITE-13511: -- Summary: Unified configuration Key: IGNITE-13511 URL: https://issues.apache.org/jira/browse/IGNITE-13511 Project: Ignite Issue Type: New Feature Reporter: Anton Kalashnikov https://cwiki.apache.org/confluence/display/IGNITE/IEP-55+Unified+Configuration -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-12489) Error during purges by expiration: Unknown page type
[ https://issues.apache.org/jira/browse/IGNITE-12489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17204722#comment-17204722 ] Anton Kalashnikov commented on IGNITE-12489: [~xdang] Can you please provide more details about your configuration(CacheConfiguration mostly) and your load profile(what type of request you have)? Also, it will help a lot if you able to write some reproducer for this scenario(but I suppose it's not so easy to do) > Error during purges by expiration: Unknown page type > > > Key: IGNITE-12489 > URL: https://issues.apache.org/jira/browse/IGNITE-12489 > Project: Ignite > Issue Type: Bug >Affects Versions: 2.7, 2.7.6 >Reporter: Ruslan Kamashev >Assignee: Anton Kalashnikov >Priority: Blocker > Fix For: 2.10 > > > {{*logger*}} > {code:java} > org.apache.ignite.internal.processors.cache.GridCacheIoManager > {code} > {{*message*}} > {code:java} > Failed to process message [senderId=969d56ba-4b46-40cf-886e-ac445cf6a95d, > messageType=class > o.a.i.i.processors.cache.distributed.dht.atomic.GridDhtAtomicUpdateRequest]{code} > {{*thread*}} > {code:java} > sys-stripe-19-#20{code} > {{*trace*}} > {code:java} > java.lang.IllegalStateException: Unknown page type: 1 pageId: 00010303117d > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.io(BPlusTree.java:5058) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$200(BPlusTree.java:90) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.nextPage(BPlusTree.java:5330) > at > org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$ForwardCursor.next(BPlusTree.java:5566) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2232) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2157) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:845) > at > org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207) > at > org.apache.ignite.internal.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:888) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessageProcessed(GridCacheIoManager.java:1103) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1076) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:581) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:380) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:306) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:101) > at > org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:295) > at > org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1569) > at > org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1197) > at > org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127) > at > org.apache.ignite.internal.managers.communication.GridIoManager$9.run(GridIoManager.java:1093) > at > org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:505) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Dec 23, 2019 @ 18:28:28.457 {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (IGNITE-13500) Checkpoint read lock fail if it is taking under write lock during the stopping node
[ https://issues.apache.org/jira/browse/IGNITE-13500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov reassigned IGNITE-13500: -- Assignee: Anton Kalashnikov > Checkpoint read lock fail if it is taking under write lock during the > stopping node > --- > > Key: IGNITE-13500 > URL: https://issues.apache.org/jira/browse/IGNITE-13500 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > > org.apache.ignite.internal.processors.cache.index.BasicIndexTest#testDynamicIndexesDropWithPersistence > {noformat} > [2020-09-30 > 15:09:26,085][ERROR][db-checkpoint-thread-#371%index.BasicIndexTest0%][Checkpointer] > Runtime error caught during grid runnable execution: GridWorker > [name=db-checkpoint-thread, igniteInstanceName=index.BasicIndexTest0, > finished=false, heartbeatTs=1601467766063, hashCode=963964001, > interrupted=false, runner=db-checkpoint-thread-#371%index.BasicIndexTest0%] > class org.apache.ignite.IgniteException: Failed to perform cache update: node > is stopping. > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:396) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263) > at > org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) > at java.lang.Thread.run(Thread.java:748) > Caused by: class org.apache.ignite.IgniteException: Failed to perform cache > update: node is stopping. > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:128) > at > org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1298) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.removeDurableBackgroundTask(DurableBackgroundTasksProcessor.java:245) > at > org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.onMarkCheckpointBegin(DurableBackgroundTasksProcessor.java:277) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointWorkflow.markCheckpointBegin(CheckpointWorkflow.java:274) > at > org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:387) > ... 3 more > Caused by: class org.apache.ignite.internal.NodeStoppingException: Failed to > perform cache update: node is stopping. > ... 9 more > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13500) Checkpoint read lock fail if it is taking under write lock during the stopping node
Anton Kalashnikov created IGNITE-13500: -- Summary: Checkpoint read lock fail if it is taking under write lock during the stopping node Key: IGNITE-13500 URL: https://issues.apache.org/jira/browse/IGNITE-13500 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov org.apache.ignite.internal.processors.cache.index.BasicIndexTest#testDynamicIndexesDropWithPersistence {noformat} [2020-09-30 15:09:26,085][ERROR][db-checkpoint-thread-#371%index.BasicIndexTest0%][Checkpointer] Runtime error caught during grid runnable execution: GridWorker [name=db-checkpoint-thread, igniteInstanceName=index.BasicIndexTest0, finished=false, heartbeatTs=1601467766063, hashCode=963964001, interrupted=false, runner=db-checkpoint-thread-#371%index.BasicIndexTest0%] class org.apache.ignite.IgniteException: Failed to perform cache update: node is stopping. at org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:396) at org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.body(Checkpointer.java:263) at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) at java.lang.Thread.run(Thread.java:748) Caused by: class org.apache.ignite.IgniteException: Failed to perform cache update: node is stopping. at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointTimeoutLock.checkpointReadLock(CheckpointTimeoutLock.java:128) at org.apache.ignite.internal.processors.cache.persistence.GridCacheDatabaseSharedManager.checkpointReadLock(GridCacheDatabaseSharedManager.java:1298) at org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.removeDurableBackgroundTask(DurableBackgroundTasksProcessor.java:245) at org.apache.ignite.internal.processors.localtask.DurableBackgroundTasksProcessor.onMarkCheckpointBegin(DurableBackgroundTasksProcessor.java:277) at org.apache.ignite.internal.processors.cache.persistence.checkpoint.CheckpointWorkflow.markCheckpointBegin(CheckpointWorkflow.java:274) at org.apache.ignite.internal.processors.cache.persistence.checkpoint.Checkpointer.doCheckpoint(Checkpointer.java:387) ... 3 more Caused by: class org.apache.ignite.internal.NodeStoppingException: Failed to perform cache update: node is stopping. ... 9 more {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13207) Checkpointer code refactoring: Splitting GridCacheDatabaseSharedManager ant Checkpointer
[ https://issues.apache.org/jira/browse/IGNITE-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13207: --- Description: The main target of this ticket - providing the possibility to reuse all or part of the checkpoint classes in a different way(ex. light-weight checkpoint during the defragmentation). What was done in this ticket: New classes: * CheckpointReadWriteLock - encapsulation of specific logic of checkpoint lock * CheckpointTimeoutLock - read lock with a timeout which able to trigger the new checkpoint if needed * CheckpointMakersStorage - encapsulation of the work over the checkpoint markers - write to/read from disk, caching the actual markers * CheckpointWorkflow - encapsulation of the checkpoint steps like checkpoint begin, checkpoint end * CheckpointManager - the entry point of the checkpoint. It responsible for consistent initialization of all checkpoint related components and it provides API for interaction with them. * WorkProgressDispatcher - interface for worker's heartbeat management Renamed classes: * DbCheckpointListener -> CheckpointListener(it also moved to checkpoint package) * WriteCheckpointPages -> CheckpointPagesWriter * DbCheckpointContextImpl -> CheckpointContextImpl Logical changes: * asyncRunner(Checkpoint runner thread pool) was replaced by two checkpointWritePagesPool(IO-bound thread pool for writing dirty pages to disk) and checkpointCollectPagesInfoPool(CPU-bound thread pool for collection the dirty pages from memory) * mehod afterCheckpointEnd was added to CheckpointListener was: The main target of this ticket - providing the possibility to reuse all or part of the checkpoint classes in a different way(ex. light-weight checkpoint during the defragmentation). What was done in this ticket: New classes: * CheckpointReadWriteLock - encapsulation of specific logic of checkpoint lock * CheckpointTimeoutLock - read lock with a timeout which able to trigger the new checkpoint if needed * CheckpointStorage - encapsulation of the work over the checkpoint markers - write to/read from disk, caching the actual markers * CheckpointProcess - encapsulation of the checkpoint steps like checkpoint begin, checkpoint end * CheckpointManager - the entry point of the checkpoint. It responsible for consistent initialization of all checkpoint related components and it provides API for interaction with them. * WorkProgressDispatcher - interface for worker's heartbeat management Renamed classes: * DbCheckpointListener -> CheckpointListener(it also moved to checkpoint package) * WriteCheckpointPages -> CheckpointPagesWriter * DbCheckpointContextImpl -> CheckpointContextImpl Logical changes: * asyncRunner(Checkpoint runner thread pool) was replaced by two checkpointWritePagesPool(IO-bound thread pool for writing dirty pages to disk) and checkpointCollectPagesInfoPool(CPU-bound thread pool for collection the dirty pages from memory) * mehod afterCheckpointEnd was added to CheckpointListener > Checkpointer code refactoring: Splitting GridCacheDatabaseSharedManager ant > Checkpointer > > > Key: IGNITE-13207 > URL: https://issues.apache.org/jira/browse/IGNITE-13207 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Labels: IEP-47 > Time Spent: 10m > Remaining Estimate: 0h > > The main target of this ticket - providing the possibility to reuse all or > part of the checkpoint classes in a different way(ex. light-weight checkpoint > during the defragmentation). > What was done in this ticket: > New classes: > * CheckpointReadWriteLock - encapsulation of specific logic of checkpoint lock > * CheckpointTimeoutLock - read lock with a timeout which able to trigger the > new checkpoint if needed > * CheckpointMakersStorage - encapsulation of the work over the checkpoint > markers - write to/read from disk, caching the actual markers > * CheckpointWorkflow - encapsulation of the checkpoint steps like checkpoint > begin, checkpoint end > * CheckpointManager - the entry point of the checkpoint. It responsible for > consistent initialization of all checkpoint related components and it > provides API for interaction with them. > * WorkProgressDispatcher - interface for worker's heartbeat management > Renamed classes: > * DbCheckpointListener -> CheckpointListener(it also moved to checkpoint > package) > * WriteCheckpointPages -> CheckpointPagesWriter > * DbCheckpointContextImpl -> CheckpointContextImpl > Logical changes: > * asyncRunner(Checkpoint runner thread pool) was replaced by two > checkpointWritePagesPool(IO-bound thread pool for writing dirty pages to > disk) and
[jira] [Commented] (IGNITE-13435) Fixing some unrecorded issues command warm-up control.sh
[ https://issues.apache.org/jira/browse/IGNITE-13435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17197571#comment-17197571 ] Anton Kalashnikov commented on IGNITE-13435: [~ktkale...@gridgain.com] LGTM. > Fixing some unrecorded issues command warm-up control.sh > > > Key: IGNITE-13435 > URL: https://issues.apache.org/jira/browse/IGNITE-13435 > Project: Ignite > Issue Type: Bug > Components: control.sh >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Labels: IEP-40 > Fix For: 2.10 > > Time Spent: 0.5h > Remaining Estimate: 0h > > Unrecorded problems: > * When parsing arguments for the warm-up command, subsequent arguments may be > skipped, such as auto-confirmation "--yes"; > * Processing requests for jetty; > * Authorization. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (IGNITE-13207) Checkpointer code refactoring: Splitting GridCacheDatabaseSharedManager ant Checkpointer
[ https://issues.apache.org/jira/browse/IGNITE-13207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Anton Kalashnikov updated IGNITE-13207: --- Description: The main target of this ticket - providing the possibility to reuse all or part of the checkpoint classes in a different way(ex. light-weight checkpoint during the defragmentation). What was done in this ticket: New classes: * CheckpointReadWriteLock - encapsulation of specific logic of checkpoint lock * CheckpointTimeoutLock - read lock with a timeout which able to trigger the new checkpoint if needed * CheckpointStorage - encapsulation of the work over the checkpoint markers - write to/read from disk, caching the actual markers * CheckpointProcess - encapsulation of the checkpoint steps like checkpoint begin, checkpoint end * CheckpointManager - the entry point of the checkpoint. It responsible for consistent initialization of all checkpoint related components and it provides API for interaction with them. * WorkProgressDispatcher - interface for worker's heartbeat management Renamed classes: * DbCheckpointListener -> CheckpointListener(it also moved to checkpoint package) * WriteCheckpointPages -> CheckpointPagesWriter * DbCheckpointContextImpl -> CheckpointContextImpl Logical changes: * asyncRunner(Checkpoint runner thread pool) was replaced by two checkpointWritePagesPool(IO-bound thread pool for writing dirty pages to disk) and checkpointCollectPagesInfoPool(CPU-bound thread pool for collection the dirty pages from memory) * mehod afterCheckpointEnd was added to CheckpointListener > Checkpointer code refactoring: Splitting GridCacheDatabaseSharedManager ant > Checkpointer > > > Key: IGNITE-13207 > URL: https://issues.apache.org/jira/browse/IGNITE-13207 > Project: Ignite > Issue Type: Sub-task >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Labels: IEP-47 > Time Spent: 10m > Remaining Estimate: 0h > > The main target of this ticket - providing the possibility to reuse all or > part of the checkpoint classes in a different way(ex. light-weight checkpoint > during the defragmentation). > What was done in this ticket: > New classes: > * CheckpointReadWriteLock - encapsulation of specific logic of checkpoint lock > * CheckpointTimeoutLock - read lock with a timeout which able to trigger the > new checkpoint if needed > * CheckpointStorage - encapsulation of the work over the checkpoint markers - > write to/read from disk, caching the actual markers > * CheckpointProcess - encapsulation of the checkpoint steps like checkpoint > begin, checkpoint end > * CheckpointManager - the entry point of the checkpoint. It responsible for > consistent initialization of all checkpoint related components and it > provides API for interaction with them. > * WorkProgressDispatcher - interface for worker's heartbeat management > Renamed classes: > * DbCheckpointListener -> CheckpointListener(it also moved to checkpoint > package) > * WriteCheckpointPages -> CheckpointPagesWriter > * DbCheckpointContextImpl -> CheckpointContextImpl > Logical changes: > * asyncRunner(Checkpoint runner thread pool) was replaced by two > checkpointWritePagesPool(IO-bound thread pool for writing dirty pages to > disk) and checkpointCollectPagesInfoPool(CPU-bound thread pool for collection > the dirty pages from memory) > * mehod afterCheckpointEnd was added to CheckpointListener -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13362) Stop warm-up via control.sh
[ https://issues.apache.org/jira/browse/IGNITE-13362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17193618#comment-17193618 ] Anton Kalashnikov commented on IGNITE-13362: [~ktkale...@gridgain.com] thanks for your changes. The code looks good to me. > Stop warm-up via control.sh > --- > > Key: IGNITE-13362 > URL: https://issues.apache.org/jira/browse/IGNITE-13362 > Project: Ignite > Issue Type: New Feature >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Labels: IEP-40 > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > At the moment, stop warm-up via "control.sh" is not possible due to fact that > processing messages from "control.sh" occurs after "discovery" and warm-up > goes before it. > It is necessary to do processing of messages from "control.sh" before warming > up and implement command for "control.sh". -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13345) Warming up strategy
[ https://issues.apache.org/jira/browse/IGNITE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183251#comment-17183251 ] Anton Kalashnikov commented on IGNITE-13345: [~sergey-chugunov] can you help with the merge, please. > Warming up strategy > --- > > Key: IGNITE-13345 > URL: https://issues.apache.org/jira/browse/IGNITE-13345 > Project: Ignite > Issue Type: New Feature >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Labels: IEP-40 > Fix For: 2.10 > > Time Spent: 40m > Remaining Estimate: 0h > > Summary of > [Dev-list|http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Cache-warmup-td48582.html] > # Adding a marker interface > *org.apache.ignite.configuration.WarmUpConfiguration*; > # Adding a configuration to > ## > *org.apache.ignite.configuration.DataRegionConfiguration#setWarmUpConfiguration* > ## > *org.apache.ignite.configuration.DataStorageConfiguration#setDefaultWarmUpConfiguration* > # Add an internal warm-up interface that will start in [1] after [2] (after > recovery); > {code:java} > package org.apache.ignite.internal.processors.cache.warmup; > import org.apache.ignite.IgniteCheckedException; > import org.apache.ignite.configuration.WarmUpConfiguration; > import org.apache.ignite.internal.GridKernalContext; > import org.apache.ignite.internal.processors.cache.persistence.DataRegion; > /** > * Interface for warming up. > */ > public interface WarmUpStrategy { > /** > * Returns configuration class for mapping to strategy. > * > * @return Configuration class. > */ > Class configClass(); > /** > * Warm up. > * > * @param kernalCtx Kernal context. > * @param cfg Warm-up configuration. > * @param regionData region. > * @throws IgniteCheckedException if faild. > */ > void warmUp(GridKernalContext kernalCtx, T cfg, DataRegion region) throws > IgniteCheckedException; > /** > * Closing warm up. > * > * @throws IgniteCheckedException if faild. > */ > void close() throws IgniteCheckedException; > } > {code} > # Adding an internal plugin extension for add own strategies; > {code:java} > package org.apache.ignite.internal.processors.cache.warmup; > > import java.util.Collection; > import org.apache.ignite.plugin.Extension; > > /** > * Interface for getting warm-up strategies from plugins. > */ > public interface WarmUpStrategySupplier extends Extension { > /** > * Getting warm-up strategies. > * > * @return Warm-up strategies. > */ > Collection strategies(); > } > {code} > # Adding strategies: > ## Without implementation, for the possibility of disabling the warm-up: NoOP > ## Loading everything while there is RAM with priority to indexes: LoadAll > # Add a command to "control.sh", to stop current warm-up and cancel all > others: --warm-up stop in IGNITE-13362 > [1] - > org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#afterLogicalUpdatesApplied > [2] - > org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#restorePartitionStates -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13345) Warming up strategy
[ https://issues.apache.org/jira/browse/IGNITE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183243#comment-17183243 ] Anton Kalashnikov commented on IGNITE-13345: [~ktkale...@gridgain.com] LGTM. > Warming up strategy > --- > > Key: IGNITE-13345 > URL: https://issues.apache.org/jira/browse/IGNITE-13345 > Project: Ignite > Issue Type: New Feature >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Labels: IEP-40 > Fix For: 2.10 > > Time Spent: 40m > Remaining Estimate: 0h > > Summary of > [Dev-list|http://apache-ignite-developers.2346864.n4.nabble.com/DISCUSSION-Cache-warmup-td48582.html] > # Adding a marker interface > *org.apache.ignite.configuration.WarmUpConfiguration*; > # Adding a configuration to > ## > *org.apache.ignite.configuration.DataRegionConfiguration#setWarmUpConfiguration* > ## > *org.apache.ignite.configuration.DataStorageConfiguration#setDefaultWarmUpConfiguration* > # Add an internal warm-up interface that will start in [1] after [2] (after > recovery); > {code:java} > package org.apache.ignite.internal.processors.cache.warmup; > import org.apache.ignite.IgniteCheckedException; > import org.apache.ignite.configuration.WarmUpConfiguration; > import org.apache.ignite.internal.GridKernalContext; > import org.apache.ignite.internal.processors.cache.persistence.DataRegion; > /** > * Interface for warming up. > */ > public interface WarmUpStrategy { > /** > * Returns configuration class for mapping to strategy. > * > * @return Configuration class. > */ > Class configClass(); > /** > * Warm up. > * > * @param kernalCtx Kernal context. > * @param cfg Warm-up configuration. > * @param regionData region. > * @throws IgniteCheckedException if faild. > */ > void warmUp(GridKernalContext kernalCtx, T cfg, DataRegion region) throws > IgniteCheckedException; > /** > * Closing warm up. > * > * @throws IgniteCheckedException if faild. > */ > void close() throws IgniteCheckedException; > } > {code} > # Adding an internal plugin extension for add own strategies; > {code:java} > package org.apache.ignite.internal.processors.cache.warmup; > > import java.util.Collection; > import org.apache.ignite.plugin.Extension; > > /** > * Interface for getting warm-up strategies from plugins. > */ > public interface WarmUpStrategySupplier extends Extension { > /** > * Getting warm-up strategies. > * > * @return Warm-up strategies. > */ > Collection strategies(); > } > {code} > # Adding strategies: > ## Without implementation, for the possibility of disabling the warm-up: NoOP > ## Loading everything while there is RAM with priority to indexes: LoadAll > # Add a command to "control.sh", to stop current warm-up and cancel all > others: --warm-up stop in IGNITE-13362 > [1] - > org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#afterLogicalUpdatesApplied > [2] - > org.apache.ignite.internal.processors.cache.GridCacheProcessor.CacheRecoveryLifecycle#restorePartitionStates -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13367) meta --remove command usage improvements
[ https://issues.apache.org/jira/browse/IGNITE-13367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17181119#comment-17181119 ] Anton Kalashnikov commented on IGNITE-13367: [~sergeychugunov] yes, I took a look at it already. It looks good to me. > meta --remove command usage improvements > > > Key: IGNITE-13367 > URL: https://issues.apache.org/jira/browse/IGNITE-13367 > Project: Ignite > Issue Type: Improvement > Components: control.sh >Reporter: Sergey Chugunov >Assignee: Sergey Chugunov >Priority: Major > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > Command for removing metadata has the following issues: > # In 'Type not found' scenario it prints long stack traces to console instead > of short information about requested type. > # When used it registers some internal classes which are not supposed to go > through binary metadata registration protocol. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13368) Speed base throttling unexpectedly degraded to zero
[ https://issues.apache.org/jira/browse/IGNITE-13368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17180371#comment-17180371 ] Anton Kalashnikov commented on IGNITE-13368: [~sergey-chugunov] can you take a look at it? > Speed base throttling unexpectedly degraded to zero > --- > > Key: IGNITE-13368 > URL: https://issues.apache.org/jira/browse/IGNITE-13368 > Project: Ignite > Issue Type: Bug >Reporter: Anton Kalashnikov >Assignee: Anton Kalashnikov >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > New test failure in master PagesWriteThrottleSmokeTest.testThrottle > https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=2808794487465215609=%3Cdefault%3E=testDetails > Throttling degraded to zero. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (IGNITE-13368) Speed base throttling unexpectedly degraded to zero
Anton Kalashnikov created IGNITE-13368: -- Summary: Speed base throttling unexpectedly degraded to zero Key: IGNITE-13368 URL: https://issues.apache.org/jira/browse/IGNITE-13368 Project: Ignite Issue Type: Bug Reporter: Anton Kalashnikov Assignee: Anton Kalashnikov New test failure in master PagesWriteThrottleSmokeTest.testThrottle https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8=2808794487465215609=%3Cdefault%3E=testDetails Throttling degraded to zero. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13151) Checkpointer code refactoring: extracting classes from GridCacheDatabaseSharedManager
[ https://issues.apache.org/jira/browse/IGNITE-13151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17177015#comment-17177015 ] Anton Kalashnikov commented on IGNITE-13151: [~agura] Thanks for your comment, but there are no new classes, in fact, all of these classes were extracted mostly from GridCacheDatabaseSharedManager with minimum changes. So I agree that javadocs are not perfect and I improved it a little but the further improvement I suggest to do in my next task because these classes will be changed. It is the same about naming - I left old names for easier the review but in the further, it is a high probability that I will find a more suitable name for them. [~sergey-chugunov] can you recheck these changes(there are not a lot of changes since the last time) and merge it to master. > Checkpointer code refactoring: extracting classes from > GridCacheDatabaseSharedManager > - > > Key: IGNITE-13151 > URL: https://issues.apache.org/jira/browse/IGNITE-13151 > Project: Ignite > Issue Type: Sub-task > Components: persistence >Reporter: Sergey Chugunov >Assignee: Anton Kalashnikov >Priority: Major > Labels: IEP-47 > Time Spent: 10m > Remaining Estimate: 0h > > Checkpointer is at the center of Ignite persistence subsystem and more people > from the community understand it the better means it is more stable and more > efficient. > However for now checkpointer code sits inside of > GridCacheDatabaseSharedManager class and is entangled with this higher-level > and more general component. > To take a step forward to more modular checkpointer we need to do two things: > # Move checkpointer code outside database manager to a separate class. > (That's what this ticket is about.) > # Create a well-defined API of checkpointer that will allow us to create new > implementations of checkpointer in the future. An example of this is new > checkpointer implementation needed for defragmentation feature purposes. > (Should be done in a separate ticket) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13013) Thick client must not open server sockets when used by serverless functions
[ https://issues.apache.org/jira/browse/IGNITE-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17172149#comment-17172149 ] Anton Kalashnikov commented on IGNITE-13013: [~ibessonov] LGTM. [~agoncharuk] can you help with the merge, please? > Thick client must not open server sockets when used by serverless functions > --- > > Key: IGNITE-13013 > URL: https://issues.apache.org/jira/browse/IGNITE-13013 > Project: Ignite > Issue Type: Improvement > Components: networking >Affects Versions: 2.8 >Reporter: Denis A. Magda >Assignee: Ivan Bessonov >Priority: Critical > Fix For: 2.10 > > Attachments: image-2020-07-30-18-42-01-266.png > > Time Spent: 10m > Remaining Estimate: 0h > > A thick client fails to start if being used inside of a serverless function > such as AWS Lamda or Azure Functions. Cloud providers prohibit opening > network ports to accept connections on the function's end. In short, the > function can only connect to a remote address. > To reproduce, you can follow this tutorial and swap the thin client (used in > the tutorial) with the thick one: > https://www.gridgain.com/docs/tutorials/serverless/azure_functions_tutorial > The thick client needs to support a mode when the communication SPI doesn't > create a server socket if the client is used for serverless computing. This > improvement looks like an extra task of this initiative: > https://issues.apache.org/jira/browse/IGNITE-12438 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13013) Thick client must not open server sockets when used by serverless functions
[ https://issues.apache.org/jira/browse/IGNITE-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17170832#comment-17170832 ] Anton Kalashnikov commented on IGNITE-13013: [~ibessonov] new changes also LGTM but I have one note - maybe it is better instead of using magic number = 0(when socket not needed) use well-described constant(like PORT_DISABLED or RECEIVED_SOCKET_DISABLE) or at least adding some comments to this number because when I see port equal to 0 it's not obvious which behaviour I should expect. > Thick client must not open server sockets when used by serverless functions > --- > > Key: IGNITE-13013 > URL: https://issues.apache.org/jira/browse/IGNITE-13013 > Project: Ignite > Issue Type: Improvement > Components: networking >Affects Versions: 2.8 >Reporter: Denis A. Magda >Assignee: Ivan Bessonov >Priority: Critical > Fix For: 2.10 > > Attachments: image-2020-07-30-18-42-01-266.png > > Time Spent: 10m > Remaining Estimate: 0h > > A thick client fails to start if being used inside of a serverless function > such as AWS Lamda or Azure Functions. Cloud providers prohibit opening > network ports to accept connections on the function's end. In short, the > function can only connect to a remote address. > To reproduce, you can follow this tutorial and swap the thin client (used in > the tutorial) with the thick one: > https://www.gridgain.com/docs/tutorials/serverless/azure_functions_tutorial > The thick client needs to support a mode when the communication SPI doesn't > create a server socket if the client is used for serverless computing. This > improvement looks like an extra task of this initiative: > https://issues.apache.org/jira/browse/IGNITE-12438 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13098) TcpCommunicationSpi split to independent classes
[ https://issues.apache.org/jira/browse/IGNITE-13098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168963#comment-17168963 ] Anton Kalashnikov commented on IGNITE-13098: [~ivan.glukos][~mstepachev], I took a look at it. LGTM. > TcpCommunicationSpi split to independent classes > > > Key: IGNITE-13098 > URL: https://issues.apache.org/jira/browse/IGNITE-13098 > Project: Ignite > Issue Type: Bug > Environment: TcpCommunicationSpi split to independent classes >Reporter: Stepachev Maksim >Assignee: Stepachev Maksim >Priority: Major > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > h2. Description > This ticket describes requirements for TcpCommunicationSpi refactoring. The > main goal is to split the class without changing behavior and public API. > *Actual problem:* > CurrentlyTcpCommunicationSpi has over 5K lines and includes about15+ inner > classes like: > # ShmemAcceptWorker > # SHMemHandshakeClosure > # ShmemWorker > # CommunicationDiscoveryEventListener > # CommunicationWorker > # ConnectFuture > # ConnectGateway > # ConnectionKey > # ConnectionPolicy > # DisconnectedSessionInfo > # FirstConnectionPolicy > # HandshakeTimeoutObject > # RoundRobinConnectionPolicy > # TcpCommunicationConnectionCheckFuture > # TcpCommunicationSpiMBeanImpl > In addition, it contains logic of client connection life cycle, nio server > handler, and handshake handler. > The classes above have cyclic dependencies and high coupling.The whole > mechanism works because classes have access to each other via parent class > references. As a result, initialization of class isn't consistent. By > consistent I mean that class created via constructor is ready to be used. All > of the classes work with context and shareproperties everywhere. > Many methods of TcpCommunicationSpi don’t have a single responsibility. > Example is getNodeAttribute:,it makes client reservation, takes the IP > address of the node and provides attributes. > It works fine and we usually don’t have reasons to change anything. But if > you want to create a test that has a little different behavior than a > blocking message, you can't mock or change the behavior of inner classes. For > example, test covering change in the handshake process. Some people make test > methods in public API like "closeConnections" or "openSocketChannel" because > the current design isn't fine for it. It also takes a lot of time for test > development for minor changes. > *Solution:* > The scope of work is big and communication spi is place which should be > changed carefully. I recommend to make this refactoring step by step. > * The first idea is to split the parent class into independent classes and > move them to the internal package. We should achieveSOLID when it’s done. > * Extract spread logic to appropriate classes like ClientPool, > HandshakeHandler, etc. > * Make a common transfer object for TCSpi configuration. > * Make dependencies direct if it is possible. > * Initialize all dependencies in one place. > * Make child classes context-free. > * Try to do classes more testable. > * Use the idea of dependency injection without a framework for it. > *Benefits:* > With the ability to write truly jUnit-style tests and cover functionality > with better testing we get a way to easier develop new features and > optimizations needed in such low-level components as TcpCommunicationSpi. > Examples of features that improve usability of Apache Ignite a lot are: > inverse communication connection with optimizations and connection > multiplexing. Both of the features could be used in environments with > restricted network connectivity (e.g. when connections between nodes could be > established only in one direction). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13013) Thick client must not open server sockets when used by serverless functions
[ https://issues.apache.org/jira/browse/IGNITE-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17168832#comment-17168832 ] Anton Kalashnikov commented on IGNITE-13013: [~ibessonov] thanks for your changes. LGTM. > Thick client must not open server sockets when used by serverless functions > --- > > Key: IGNITE-13013 > URL: https://issues.apache.org/jira/browse/IGNITE-13013 > Project: Ignite > Issue Type: Improvement > Components: networking >Affects Versions: 2.8 >Reporter: Denis A. Magda >Assignee: Ivan Bessonov >Priority: Critical > Fix For: 2.10 > > Attachments: image-2020-07-30-18-42-01-266.png > > Time Spent: 10m > Remaining Estimate: 0h > > A thick client fails to start if being used inside of a serverless function > such as AWS Lamda or Azure Functions. Cloud providers prohibit opening > network ports to accept connections on the function's end. In short, the > function can only connect to a remote address. > To reproduce, you can follow this tutorial and swap the thin client (used in > the tutorial) with the thick one: > https://www.gridgain.com/docs/tutorials/serverless/azure_functions_tutorial > The thick client needs to support a mode when the communication SPI doesn't > create a server socket if the client is used for serverless computing. This > improvement looks like an extra task of this initiative: > https://issues.apache.org/jira/browse/IGNITE-12438 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13269) Waiting for completion of operations on indexes before cache stop
[ https://issues.apache.org/jira/browse/IGNITE-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17162841#comment-17162841 ] Anton Kalashnikov commented on IGNITE-13269: [~ktkale...@gridgain.com] LGTM. [~sergey-chugunov] Can you help with merge, please? > Waiting for completion of operations on indexes before cache stop > - > > Key: IGNITE-13269 > URL: https://issues.apache.org/jira/browse/IGNITE-13269 > Project: Ignite > Issue Type: Improvement >Reporter: Kirill Tkalenko >Assignee: Kirill Tkalenko >Priority: Major > Fix For: 2.10 > > Time Spent: 10m > Remaining Estimate: 0h > > Currently, there is no waiting for completion of operation on indexes when > cache is stopped. Because of this, there may be errors, for example, when > restarting the node: > {code:java} > Suppressed: java.lang.AssertionError: Release pinned page: FullPageId > [pageId=000206bfc352, effectivePageId=06bfc352, grpId=-782612924] > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl$PagePool.releaseFreePage(PageMemoryImpl.java:1902) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl$PagePool.access$2100(PageMemoryImpl.java:1773) > at > org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl$ClearSegmentRunnable.run(PageMemoryImpl.java:2878) > ... 3 common frames omitted > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-11942) IGFS and Hadoop Accelerator Discontinuation
[ https://issues.apache.org/jira/browse/IGNITE-11942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17154421#comment-17154421 ] Anton Kalashnikov commented on IGNITE-11942: [~agoncharuk] can you take a look at these changes? [~kuaw26], [~vsisko] can you take a look at the web-console part? > IGFS and Hadoop Accelerator Discontinuation > --- > > Key: IGNITE-11942 > URL: https://issues.apache.org/jira/browse/IGNITE-11942 > Project: Ignite > Issue Type: Task >Reporter: Denis A. Magda >Assignee: Anton Kalashnikov >Priority: Blocker > Fix For: 2.9 > > Time Spent: 10m > Remaining Estimate: 0h > > The community has voted for the following decision: > * IGFS and In-Memory Hadoop Accelerator components are to be discontinued and > no longer supported by the community > * The existing source code of IGFS and In-Memory Hadoop Accelerator is to be > removed from Ignite master. Before that, a special branch like > "ignite-igfs-and-hadoop-accelerator" to be forked off the master in order to > preserve the sources in Git history for those who might need it. > The voting thread: > http://apache-ignite-developers.2346864.n4.nabble.com/VOTE-Complete-Discontinuation-of-IGFS-and-Hadoop-Accelerator-td42405.html > Once the changes are made for Ignite 2.8, please contact Denis Magda to > update a public documentation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (IGNITE-13013) Thick client must not open server sockets when used by serverless functions
[ https://issues.apache.org/jira/browse/IGNITE-13013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17153620#comment-17153620 ] Anton Kalashnikov commented on IGNITE-13013: [~dmagda], I think I agree that client-to-client connectivity is not soo useful in ignite. So I have another solution which looks pretty easy for implementation. We can add the possibility to set a communication port to -1(this means server socket shouldn't be open). And when the user sets this port to -1 we also set forceClientToServer to true. Also, we can add validation on establishing communication connection, and if we see that it is the client-to-client connection but the remote client doesn't support such connection we notify the user about it( exception is thrown) - as I understand this scenario mostly corresponds to the compute. In conclusion, expected changes: * Setting communication port to -1 is allowed * If the communication port set to -1, forceClientToServer will set to true * If the client tries to establish a connection with another client which port equal to -1, the exception will be thrown. > Thick client must not open server sockets when used by serverless functions > --- > > Key: IGNITE-13013 > URL: https://issues.apache.org/jira/browse/IGNITE-13013 > Project: Ignite > Issue Type: Improvement > Components: networking >Affects Versions: 2.8 >Reporter: Denis A. Magda >Priority: Critical > Fix For: 2.9 > > > A thick client fails to start if being used inside of a serverless function > such as AWS Lamda or Azure Functions. Cloud providers prohibit opening > network ports to accept connections on the function's end. In short, the > function can only connect to a remote address. > To reproduce, you can follow this tutorial and swap the thin client (used in > the tutorial) with the thick one: > https://www.gridgain.com/docs/tutorials/serverless/azure_functions_tutorial > The thick client needs to support a mode when the communication SPI doesn't > create a server socket if the client is used for serverless computing. This > improvement looks like an extra task of this initiative: > https://issues.apache.org/jira/browse/IGNITE-12438 -- This message was sent by Atlassian Jira (v8.3.4#803005)