[ https://issues.apache.org/jira/browse/ASTERIXDB-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219598#comment-16219598 ]
Michael J. Carey commented on ASTERIXDB-2145: --------------------------------------------- This was a good workaround and will actually be fine for Cloudberry - but the smaller the components, the worse the write amplification will be. As has been discussed on a related e-mail thread, the right solution is for recovery not to try to have all of the datasets active simultaneously - that's a broken approach to recovery - we should be able to recover independently of the number of datasets. AsterixDB users should be setting their component size parameter based on their expectations for the data - how much there will be, how they want to trade off write amplification vs. query efficiency, etc. > Recovery process fails on 100 datasets > -------------------------------------- > > Key: ASTERIXDB-2145 > URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145 > Project: Apache AsterixDB > Issue Type: Bug > Reporter: Taewoo Kim > > On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When > restarting that instance, the NC showed the following error and stopped. > java.lang.IllegalStateException: Failed to redo > at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712) > at > org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378) > at > org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187) > at > org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179) > at > org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43) > at > org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56) > at > org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92) > at > org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51) > at > org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127) > Caused by: org.apache.hyracks.api.exceptions.HyracksDataException: > Cannot allocate dataset 191 memory since memory budget would be > exceeded. > at > org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568) > at > org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53) > at > org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307) > at > org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119) > at > org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611) > at > org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389) > at > org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421) > at > org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368) > at > org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181) > at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707) > ... 8 more > So, I increased the storage.memorycomponent.globalbudget parameter from 3GB > to 5GB. Still, the NC showed the following error and the recovery process > could not finish. > ... similar log records ... > Oct 25, 2017 9:33:44 AM > org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository > loadDataverse > INFO: Loading dataverse:berry > Oct 25, 2017 9:33:44 AM > org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository > loadIndex > INFO: Loading index:meta_idx_meta > Oct 25, 2017 9:33:44 AM > org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository > loadIndex > INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta > Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run > INFO: JVM exiting with status 2; bye! > So, I checked the parameter information page and found that the default > parameter for storage.memorycomponent.numpages is 1/16 of the global > component budget. Therefore, I decreased this parameter to increase the > number of datasets in memory. And the instance was finally able to start. So, > it seems that the recovery process tries to load and keep all datasets into > memory and this needs to be checked. -- This message was sent by Atlassian JIRA (v6.4.14#64029)