[ https://issues.apache.org/jira/browse/IGNITE-23212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Kirill Tkalenko updated IGNITE-23212: ------------------------------------- Component/s: persistence > Page replacement doesn't work sometimes > --------------------------------------- > > Key: IGNITE-23212 > URL: https://issues.apache.org/jira/browse/IGNITE-23212 > Project: Ignite > Issue Type: Bug > Components: persistence > Reporter: Ivan Bessonov > Assignee: Kirill Tkalenko > Priority: Major > Labels: ignite-3 > Fix For: 3.0 > > Time Spent: 7h 40m > Remaining Estimate: 0h > > Under a sophisticated load, we sometimes see the following exception: > {noformat} > org.apache.ignite.lang.IgniteException: Error while executing > addWriteCommitted: [rowId=RowId [partitionId=13, > uuid=00000191-eb5c-824c-7a07-fa1210b49ed8], tableId=10, partitionId=13] > at > java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) > ~[?:?] > at > org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ViewUtils.copyExceptionWithCauseIfPossible(ViewUtils.java:91) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ViewUtils.ensurePublicException(ViewUtils.java:71) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at org.apache.ignite.internal.util.ViewUtils.sync(ViewUtils.java:54) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.put(ClientKeyValueBinaryView.java:207) > ~[ignite-client-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.client.table.ClientKeyValueBinaryView.put(ClientKeyValueBinaryView.java:60) > ~[ignite-client-3.0.0-SNAPSHOT.jar:?] > at site.ycsb.db.ignite3.IgniteClient.insert(IgniteClient.java:49) > [ignite3-binding-2024.15.jar:?] > at site.ycsb.DBWrapper.insert(DBWrapper.java:284) [core-2024.15.jar:?] > at site.ycsb.workloads.CoreWorkload.doInsert(CoreWorkload.java:657) > [core-2024.15.jar:?] > at site.ycsb.ClientThread.run(ClientThread.java:181) > [core-2024.15.jar:?] > at java.lang.Thread.run(Thread.java:829) [?:?] > Caused by: org.apache.ignite.lang.IgniteException: Error while executing > addWriteCommitted: [rowId=RowId [partitionId=13, > uuid=00000191-eb5c-824c-7a07-fa1210b49ed8], tableId=10, partitionId=13] > at > java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710) > ~[?:?] > at > org.apache.ignite.internal.util.ExceptionUtils$1.copy(ExceptionUtils.java:789) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ExceptionUtils$ExceptionFactory.createCopy(ExceptionUtils.java:723) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.util.ExceptionUtils.copyExceptionWithCause(ExceptionUtils.java:525) > ~[ignite-core-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.client.TcpClientChannel.readError(TcpClientChannel.java:549) > ~[ignite-client-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.client.TcpClientChannel.processNextMessage(TcpClientChannel.java:435) > ~[ignite-client-3.0.0-SNAPSHOT.jar:?] > at > org.apache.ignite.internal.client.TcpClientChannel.lambda$onMessage$3(TcpClientChannel.java:277) > ~[ignite-client-3.0.0-SNAPSHOT.jar:?] > at > java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1426) > ~[?:?] > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:290) > ~[?:?] > at > java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1020) > ~[?:?] > at java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1656) ~[?:?] > at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1594) > ~[?:?] > at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:183) > ~[?:?] > Caused by: org.apache.ignite.lang.IgniteException: > org.apache.ignite.lang.IgniteException: IGN-CMN-65535 > TraceId:60b4295a-8c18-4cdb-93e8-266bc9aaed88 Error while executing > addWriteCommitted: [rowId=RowId [partitionId=13, > uuid=00000191-eb5c-824c-7a07-fa1210b49ed8], tableId=10, partitionId=13] > at > org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.lambda$mapToPublicException$2(IgniteExceptionMapperUtil.java:88) > at > org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.mapCheckingResultIsPublic(IgniteExceptionMapperUtil.java:141) > at > org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.mapToPublicException(IgniteExceptionMapperUtil.java:137) > at > org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.mapToPublicException(IgniteExceptionMapperUtil.java:88) > at > org.apache.ignite.internal.lang.IgniteExceptionMapperUtil.lambda$convertToPublicFuture$3(IgniteExceptionMapperUtil.java:178) > at > java.base/java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) > at > java.base/java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > at > org.apache.ignite.internal.replicator.ReplicaService.lambda$sendToReplica$7(ReplicaService.java:249) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) > at > org.apache.ignite.internal.network.DefaultMessagingService.onInvokeResponse(DefaultMessagingService.java:586) > at > org.apache.ignite.internal.network.DefaultMessagingService.send0(DefaultMessagingService.java:272) > at > org.apache.ignite.internal.network.DefaultMessagingService.respond(DefaultMessagingService.java:215) > at > org.apache.ignite.internal.network.DefaultMessagingService.respond(DefaultMessagingService.java:228) > at > org.apache.ignite.internal.network.MessagingService.respond(MessagingService.java:142) > at > org.apache.ignite.internal.replicator.ReplicaManager.lambda$handleReplicaRequest$5(ReplicaManager.java:460) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > at > org.apache.ignite.internal.table.distributed.replicator.PartitionReplicaListener.lambda$applyCmdWithRetryOnSafeTimeReorderException$117(PartitionReplicaListener.java:2680) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088) > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.handleSmErrorResponse(RaftGroupServiceImpl.java:749) > at > org.apache.ignite.internal.raft.RaftGroupServiceImpl.lambda$sendWithRetry$40(RaftGroupServiceImpl.java:613) > at > java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) > at > java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) > at > java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) > at > java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2073) > at > org.apache.ignite.internal.network.DefaultMessagingService.onInvokeResponse(DefaultMessagingService.java:586) > at > org.apache.ignite.internal.network.DefaultMessagingService.send0(DefaultMessagingService.java:272) > at > org.apache.ignite.internal.network.DefaultMessagingService.respond(DefaultMessagingService.java:215) > at > org.apache.ignite.internal.network.MessagingService.respond(MessagingService.java:115) > at > org.apache.ignite.raft.jraft.rpc.impl.IgniteRpcServer$NetworkRpcContext.sendResponse(IgniteRpcServer.java:245) > at > org.apache.ignite.raft.jraft.rpc.impl.ActionRequestProcessor.sendSMError(ActionRequestProcessor.java:302) > at > org.apache.ignite.raft.jraft.rpc.impl.ActionRequestProcessor.sendResponse(ActionRequestProcessor.java:257) > at > org.apache.ignite.raft.jraft.rpc.impl.ActionRequestProcessor$1.result(ActionRequestProcessor.java:187) > at > org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine$1$1.result(JraftServerImpl.java:770) > at > org.apache.ignite.internal.table.distributed.raft.PartitionListener.lambda$onWriteBusy$2(PartitionListener.java:280) > at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133) > at > org.apache.ignite.internal.table.distributed.raft.PartitionListener.onWriteBusy(PartitionListener.java:199) > at > org.apache.ignite.internal.table.distributed.raft.PartitionListener.onWrite(PartitionListener.java:192) > at > org.apache.ignite.internal.raft.server.impl.JraftServerImpl$DelegatingStateMachine.onApply(JraftServerImpl.java:731) > at > org.apache.ignite.raft.jraft.core.FSMCallerImpl.doApplyTasks(FSMCallerImpl.java:571) > at > org.apache.ignite.raft.jraft.core.FSMCallerImpl.doCommitted(FSMCallerImpl.java:539) > at > org.apache.ignite.raft.jraft.core.FSMCallerImpl.runApplyTask(FSMCallerImpl.java:458) > at > org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:131) > at > org.apache.ignite.raft.jraft.core.FSMCallerImpl$ApplyTaskHandler.onEvent(FSMCallerImpl.java:125) > at > org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:326) > at > org.apache.ignite.raft.jraft.disruptor.StripedDisruptor$StripeEntryHandler.onEvent(StripedDisruptor.java:283) > at > com.lmax.disruptor.BatchEventProcessor.processEvents(BatchEventProcessor.java:167) > at > com.lmax.disruptor.BatchEventProcessor.run(BatchEventProcessor.java:122) > at java.base/java.lang.Thread.run(Thread.java:829) > Caused by: org.apache.ignite.internal.storage.StorageException: IGN-STORAGE-1 > TraceId:60b4295a-8c18-4cdb-93e8-266bc9aaed88 Error while executing > addWriteCommitted: [rowId=RowId [partitionId=13, > uuid=00000191-eb5c-824c-7a07-fa1210b49ed8], tableId=10, partitionId=13] > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.lambda$addWriteCommitted$13(AbstractPageMemoryMvPartitionStorage.java:532) > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busy(AbstractPageMemoryMvPartitionStorage.java:659) > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.addWriteCommitted(AbstractPageMemoryMvPartitionStorage.java:515) > at > org.apache.ignite.internal.table.distributed.raft.snapshot.outgoing.SnapshotAwarePartitionDataStorage.addWriteCommitted(SnapshotAwarePartitionDataStorage.java:135) > at > org.apache.ignite.internal.table.distributed.StorageUpdateHandler.tryProcessRow(StorageUpdateHandler.java:162) > at > org.apache.ignite.internal.table.distributed.StorageUpdateHandler.lambda$handleUpdate$0(StorageUpdateHandler.java:114) > at > org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.lambda$runConsistently$0(PersistentPageMemoryMvPartitionStorage.java:180) > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.busy(AbstractPageMemoryMvPartitionStorage.java:659) > at > org.apache.ignite.internal.storage.pagememory.mv.PersistentPageMemoryMvPartitionStorage.runConsistently(PersistentPageMemoryMvPartitionStorage.java:170) > at > org.apache.ignite.internal.table.distributed.raft.snapshot.outgoing.SnapshotAwarePartitionDataStorage.runConsistently(SnapshotAwarePartitionDataStorage.java:76) > at > org.apache.ignite.internal.table.distributed.StorageUpdateHandler.handleUpdate(StorageUpdateHandler.java:109) > at > org.apache.ignite.internal.table.distributed.raft.PartitionListener.handleUpdateCommand(PartitionListener.java:327) > at > org.apache.ignite.internal.table.distributed.raft.PartitionListener.lambda$onWriteBusy$2(PartitionListener.java:242) > ... 14 more > Caused by: org.apache.ignite.internal.pagememory.tree.CorruptedTreeException: > IGN-CMN-65535 TraceId:60b4295a-8c18-4cdb-93e8-266bc9aaed88 B+Tree is > corrupted [groupId=10, pageIds=[563005788186369], groupName=10, msg=Runtime > failure on search row: > org.apache.ignite.internal.storage.pagememory.mv.VersionChainKey@7cda74e0] > at > org.apache.ignite.internal.pagememory.tree.BplusTree.corruptedTreeException(BplusTree.java:6670) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invoke(BplusTree.java:2119) > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.lambda$addWriteCommitted$13(AbstractPageMemoryMvPartitionStorage.java:524) > ... 26 more > Caused by: java.lang.AssertionError: Checkpoint must be in progress > at > org.apache.ignite.internal.pagememory.persistence.checkpoint.CheckpointManager.writePageToDeltaFilePageStore(CheckpointManager.java:312) > at > org.apache.ignite.internal.storage.pagememory.PersistentPageMemoryDataRegion.lambda$start$0(PersistentPageMemoryDataRegion.java:110) > at > org.apache.ignite.internal.pagememory.persistence.replacement.DelayedDirtyPageWrite.finishReplacement(DelayedDirtyPageWrite.java:104) > at > org.apache.ignite.internal.pagememory.persistence.PersistentPageMemory.allocatePageNoReuse(PersistentPageMemory.java:593) > at > org.apache.ignite.internal.pagememory.PageIdAllocator.allocatePage(PageIdAllocator.java:111) > at > org.apache.ignite.internal.pagememory.PageIdAllocator.allocatePage(PageIdAllocator.java:70) > at > org.apache.ignite.internal.pagememory.freelist.FreeListImpl.allocateDataPage(FreeListImpl.java:491) > at > org.apache.ignite.internal.pagememory.freelist.FreeListImpl.writeSinglePage(FreeListImpl.java:621) > at > org.apache.ignite.internal.pagememory.freelist.FreeListImpl.insertDataRow(FreeListImpl.java:500) > at > org.apache.ignite.internal.storage.pagememory.mv.AbstractPageMemoryMvPartitionStorage.insertRowVersion(AbstractPageMemoryMvPartitionStorage.java:412) > at > org.apache.ignite.internal.storage.pagememory.mv.AddWriteCommittedInvokeClosure.insertCommittedRowVersion(AddWriteCommittedInvokeClosure.java:137) > at > org.apache.ignite.internal.storage.pagememory.mv.AddWriteCommittedInvokeClosure.call(AddWriteCommittedInvokeClosure.java:99) > at > org.apache.ignite.internal.storage.pagememory.mv.AddWriteCommittedInvokeClosure.call(AddWriteCommittedInvokeClosure.java:43) > at > org.apache.ignite.internal.pagememory.tree.BplusTree$Invoke.invokeClosure(BplusTree.java:4268) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invokeDown(BplusTree.java:2187) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invokeDown(BplusTree.java:2171) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invokeDown(BplusTree.java:2171) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invokeDown(BplusTree.java:2171) > at > org.apache.ignite.internal.pagememory.tree.BplusTree.invoke(BplusTree.java:2093) > ... 27 more > {noformat} > h2. {color:#DE350B}Update:{color} > The problem is in the incorrect implementation of page replacement, firstly > it can be after the completion of the checkpoint, secondly after the fsync of > files and thirdly it can disrupt the operation of delta files. It will need > to be fixed. > What needs to be done: > # Get rid of delayed page replacement, since the implementation and use can > lead to bugs. > # It is necessary to do page replacement only during the checkpoint and > before fsync of delta files. > # fsync of delta files should start only after all page replacements are > completed and after that prohibit page replacement. -- This message was sent by Atlassian Jira (v8.20.10#820010)