[jira] [Commented] (ASTERIXDB-2145) Recovery process fails on 100 datasets

Michael J. Carey (JIRA) Wed, 25 Oct 2017 14:46:19 -0700

    [ 
https://issues.apache.org/jira/browse/ASTERIXDB-2145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219598#comment-16219598
 ]


Michael J. Carey commented on ASTERIXDB-2145:
---------------------------------------------

This was a good workaround and will actually be fine for Cloudberry - but the 
smaller the components, the worse the write amplification will be.  As has been 
discussed on a related e-mail thread, the right solution is for recovery not to 
try to have all of the datasets active simultaneously - that's a broken 
approach to recovery - we should be able to recover independently of the number 
of datasets.  AsterixDB users should be setting their component size parameter 
based on their expectations for the data - how much there will be, how they 
want to trade off write amplification vs. query efficiency, etc. 

> Recovery process fails on 100 datasets
> --------------------------------------
>
>                 Key: ASTERIXDB-2145
>                 URL: https://issues.apache.org/jira/browse/ASTERIXDB-2145
>             Project: Apache AsterixDB
>          Issue Type: Bug
>            Reporter: Taewoo Kim
>
> On the Cloudberry DB, currently, there are 112 datasets on a dataverse. When 
> restarting that instance, the NC showed the following error and stopped. 
> java.lang.IllegalStateException: Failed to redo
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:712)
> at 
> org.apache.asterix.app.nc.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:378)
> at 
> org.apache.asterix.app.nc.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:187)
> at 
> org.apache.asterix.app.nc.RecoveryManager.startLocalRecovery(RecoveryManager.java:179)
> at 
> org.apache.asterix.app.nc.task.LocalRecoveryTask.perform(LocalRecoveryTask.java:43)
> at 
> org.apache.asterix.app.replication.message.StartupTaskResponseMessage.handle(StartupTaskResponseMessage.java:56)
> at 
> org.apache.asterix.messaging.NCMessageBroker.receivedMessage(NCMessageBroker.java:92)
> at 
> org.apache.hyracks.control.nc.work.ApplicationMessageWork.run(ApplicationMessageWork.java:51)
> at 
> org.apache.hyracks.control.common.work.WorkQueue$WorkerThread.run(WorkQueue.java:127)
> Caused by: org.apache.hyracks.api.exceptions.HyracksDataException:
> Cannot allocate dataset 191 memory since memory budget would be
> exceeded.
> at 
> org.apache.asterix.common.context.DatasetLifecycleManager.allocateMemory(DatasetLifecycleManager.java:568)
> at 
> org.apache.hyracks.storage.common.buffercache.ResourceHeapBufferAllocator.reserveAllocation(ResourceHeapBufferAllocator.java:53)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.VirtualBufferCache.open(VirtualBufferCache.java:307)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.MultitenantVirtualBufferCache.open(MultitenantVirtualBufferCache.java:119)
> at 
> org.apache.hyracks.storage.am.lsm.btree.impls.LSMBTree.allocateMemoryComponent(LSMBTree.java:611)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.AbstractLSMIndex.allocateMemoryComponents(AbstractLSMIndex.java:389)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.modify(LSMHarness.java:421)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness.forceModify(LSMHarness.java:368)
> at 
> org.apache.hyracks.storage.am.lsm.common.impls.LSMTreeIndexAccessor.forceUpsert(LSMTreeIndexAccessor.java:181)
> at org.apache.asterix.app.nc.RecoveryManager.redo(RecoveryManager.java:707)
> ... 8 more
> So, I increased the storage.memorycomponent.globalbudget parameter from 3GB 
> to 5GB. Still, the NC showed the following error and the recovery process 
> could not finish. 
> ... similar log records ...
> Oct 25, 2017 9:33:44 AM 
> org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
>  loadDataverse
> INFO: Loading dataverse:berry
> Oct 25, 2017 9:33:44 AM 
> org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
>  loadIndex
> INFO: Loading index:meta_idx_meta
> Oct 25, 2017 9:33:44 AM 
> org.apache.asterix.transaction.management.resource.PersistentLocalResourceRepository
>  loadIndex
> INFO: Resource loaded 161:storage/partition_1/berry/meta_idx_meta
> Oct 25, 2017 9:34:09 AM org.apache.hyracks.util.ExitUtil$ExitThread run
> INFO: JVM exiting with status 2; bye!
> So, I checked the parameter information page and found that the default 
> parameter for storage.memorycomponent.numpages is 1/16 of the global 
> component budget. Therefore, I decreased this parameter to increase the 
> number of datasets in memory. And the instance was finally able to start. So, 
> it seems that the recovery process tries to load and keep all datasets into 
> memory and this needs to be checked.  



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (ASTERIXDB-2145) Recovery process fails on 100 datasets

Reply via email to