Re: Help! Any idea to stop AsterixDB from recovering?

Jianfeng Jia Fri, 20 May 2016 17:00:32 -0700

I am able to reproduce the issue. The thread dump is pasted here: 
http://pastebin.com/KKvLVVg9 <http://pastebin.com/KKvLVVg9>
It looks pretty busy with loadPartitionToMemory.


"main" #1 prio=5 os_prio=0 tid=0x00007f33f400d800 nid=0x14 runnable 
[0x00007f33fc27a000]
   java.lang.Thread.State: RUNNABLE
    at java.io.FileInputStream.read(Native Method)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.loadPartitionToMemory(RecoveryManager.java:1033)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.serachPartition(RecoveryManager.java:983)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager$JobEntityCommits.containsEntityCommitForTxnId(RecoveryManager.java:947)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager.startRecoveryRedoPhase(RecoveryManager.java:334)
    - locked <0x00000006c03dc1c0> (a 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager.replayPartitionsLogs(RecoveryManager.java:203)
    - locked <0x00000006c03dc1c0> (a 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager)
    at 
org.apache.asterix.transaction.management.service.recovery.RecoveryManager.startRecovery(RecoveryManager.java:195)
    at 
org.apache.asterix.hyracks.bootstrap.NCApplicationEntryPoint.start(NCApplicationEntryPoint.java:142)
    at 
org.apache.hyracks.control.nc.NodeControllerService.startApplication(NodeControllerService.java:324)
    at 
org.apache.hyracks.control.nc.NodeControllerService.start(NodeControllerService.java:259)
    at org.apache.hyracks.control.nc.NCDriver.main(NCDriver.java:48)



> On May 17, 2016, at 12:37 AM, Jianfeng Jia <[email protected]> wrote:
> 
> I’m trying to reproduce the issue and will dump it then. 
> 
>> On May 16, 2016, at 10:57 PM, Michael Blow <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> It would good to get thread dumps if this happens again.
>> On Mon, May 16, 2016 at 10:56 PM Jianfeng Jia <[email protected] 
>> <mailto:[email protected]>>
>> wrote:
>> 
>>> I revisited the logs, and luckily it hasn’t been cleared. Here is part of
>>> the nc1’s log:
>>> 
>>> May 15, 2016 1:04:10 PM
>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>> INFO: Opening file: 14 in cache:
>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>> May 15, 2016 1:04:10 PM
>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>> INFO: Opening file: 13 in cache:
>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>> May 15, 2016 1:04:10 PM
>>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>>> INFO: Creating file:
>>> /nc1/iodevice1/storage/partition_0/hackathon/log_device_idx_log_device/2016-05-15-12-56-48-712_2016-05-15-12-23-31-225_f
>>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>> May 15, 2016 1:04:10 PM
>>> org.apache.hyracks.storage.common.buffercache.BufferCache openFile
>>> INFO: Opening file: 15 in cache:
>>> org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>> ——————————————————————————————————————
>>> /// I shut down the cluster from here and start the server right away.
>>> ——————————————————————————————————————
>>> May 15, 2016 1:43:12 PM
>>> org.apache.asterix.transaction.management.service.recovery.RecoveryManager
>>> startRecoveryRedoPhase
>>> INFO: Logs REDO phase completed. Redo logs count: 1197
>>> May 15, 2016 1:43:12 PM 
>>> org.apache.hyracks.storage.am.lsm.common.impls.LSMHarness
>>> flush
>>> INFO: Started a flush operation for index: LSMBTree
>>> [/nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/] ...
>>> May 15, 2016 1:43:12 PM
>>> org.apache.hyracks.storage.common.buffercache.BufferCache createFile
>>> INFO: Creating file:
>>> /nc1/iodevice1/storage/partition_0/Metadata/Dataset_idx_Dataset/2016-05-15-13-43-12-680_2016-05-15-13-43-12-680_f
>>> in cache: org.apache.hyracks.storage.common.buffercache.BufferCache@2a7f1f10
>>> 
>>> No logs generated in that 43mins. During that time one CPU was exhaustive
>>> and I remember no file was touched or generated in the asterix foler. Then
>>> it may not be the problem of the buffercache in the recovery phase?
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> On May 16, 2016, at 9:28 PM, Mike Carey <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> Agreed and agreed.  But is the spinning on recovery?
>>>> 
>>>> (What's the role of the buffer cache during recovery?)
>>>> 
>>>> 
>>>> On 5/17/16 2:10 AM, Jianfeng Jia wrote:
>>>>> I think the BuffeCache is the core issue, the recovery process may just
>>> run into the same spin trap where it was stopped.
>>>>> And I create another issue that we should be able to Abort the task so
>>> that we don’t need to restart the server.
>>>>> 
>>>>>> On May 16, 2016, at 7:24 AM, Michael Blow <[email protected] 
>>>>>> <mailto:[email protected]>>
>>> wrote:
>>>>>> 
>>>>>> This might be related: (ASTERIXDB-1438) BufferCache spins indefinitely
>>> when
>>>>>> cache is exceeded.
>>>>>> 
>>>>>> https://issues.apache.org/jira/browse/ASTERIXDB-1438 
>>>>>> <https://issues.apache.org/jira/browse/ASTERIXDB-1438>
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> -MDB
>>>>>> 
>>>>>> On Mon, May 16, 2016 at 1:52 AM Mike Carey <[email protected]> wrote:
>>>>>> 
>>>>>>> Glad it worked out - can someone also capture the core issue in
>>> JIRA?  Thx!
>>>>>>> On May 15, 2016 11:40 PM, "Jianfeng Jia" <[email protected] 
>>>>>>> <mailto:[email protected]>>
>>> wrote:
>>>>>>> 
>>>>>>>> Great! The server is back now. Thanks a lot!
>>>>>>>>> On May 15, 2016, at 2:26 PM, Murtadha Hubail <[email protected] 
>>>>>>>>> <mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>>> You can delete the existing log files and create new empty ones with
>>>>>>>> incremented  log file number, but it is very important that you don't
>>>>>>>> delete the checkpoint file.
>>>>>>>>> Of course any data in the old log files will be lost, but the data
>>>>>>>> already on disk will be available.
>>>>>>>>>> On May 15, 2016, at 1:23 PM, Jianfeng Jia <[email protected] 
>>>>>>>>>> <mailto:[email protected]>>
>>>>>>>> wrote:
>>>>>>>>>> Hi,
>>>>>>>>>> We submitted a long running join+insert query and stop the cluster
>>> to
>>>>>>>> stop running it. However, when it restarted it ran the recovery
>>> forever,
>>>>>>>>>> the logs shows that it is creating a lot of buffer cache.
>>>>>>>>>> 
>>>>>>>>>> In order to bring the cluster back to answer the query, is there
>>> any
>>>>>>>> hacking solutions? such as remove the recovery txnlogs? I’m worried
>>> that
>>>>>>> it
>>>>>>>> will ruin the cluster somehow.
>>>>>>>>>> We are in a contest so any early helps are really appreciated!
>>> Thanks!
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> 
>>>>>>>>>> Jianfeng Jia
>>>>>>>>>> PhD Candidate of Computer Science
>>>>>>>>>> University of California, Irvine
>>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> 
>>>>>>>> Jianfeng Jia
>>>>>>>> PhD Candidate of Computer Science
>>>>>>>> University of California, Irvine
>>>>>>>> 
>>>>>>>> 
>>>>> 
>>>>> 
>>>>> Best,
>>>>> 
>>>>> Jianfeng Jia
>>>>> PhD Candidate of Computer Science
>>>>> University of California, Irvine
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>> Best,
>>> 
>>> Jianfeng Jia
>>> PhD Candidate of Computer Science
>>> University of California, Irvine
>>> 
>>> 
> 
> 
> 
> Best,
> 
> Jianfeng Jia
> PhD Candidate of Computer Science
> University of California, Irvine
> 



Best,

Jianfeng Jia
PhD Candidate of Computer Science
University of California, Irvine

Re: Help! Any idea to stop AsterixDB from recovering?

Reply via email to