[ 
https://issues.apache.org/jira/browse/ATLAS-1720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sarath Subramanian updated ATLAS-1720:
--------------------------------------
    Description: 
Some of the ITs in Atlas fail intermittently with exception - "Could not 
execute operation due to backend exception"

Upon investigation it's found this is due to Berkley LockTimeoutException 
(https://github.com/thinkaurelius/titan/issues/1113)

The default LockTimeout for berkley db is 500 ms and if a thread (some IT) is 
waiting on titan storage resource which is locked by another thread and it 
doesn't releases the lock within 500ms - fails with above exception. (see error 
log below)

The fix for this is to increase the storage.lock.wait-time for berkley db to 
10000 ms. This is consistent with the lock wait timeout specified for HBase.

{code}
Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.73) Lock expired. 
Locker 1516581475 7535_NotificationHookConsumer thread-0_Txn: waited for lock 
on database=edgestore LockAddr:284896285 LSN=0x0/0x21d55f type=WRITE 
grant=WAIT_PROMOTION timeoutMillis=500 startTime=1491261268442 
endTime=1491261268942
Owners: [<LockInfo locker="1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" type="READ"/>, <LockInfo 
locker="1516581475 7535_NotificationHookConsumer thread-0_Txn" type="READ"/>]
Waiters: []
Transaction 1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn waits for  LockAddr:471572402 
Owners:<LockInfo locker="1516581475 7535_NotificationHookConsumer thread-0_Txn" 
type="WRITE"/> Waiters:[<LockInfo locker="1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" type="READ"/>]
Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn owns 
LockAddr:471572402 <LockInfo locker="1516581475 7535_NotificationHookConsumer 
thread-0_Txn" type="WRITE"/>
Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn waits for 
LockAddr:284896285
{code}

  was:
Some of the ITs in Atlas fail intermittently with exception - "Could not 
execute operation due to backend exception"

Upon investigation it's found this is due to Berkley LockTimeoutException 
(https://github.com/thinkaurelius/titan/issues/1113)

The default LockTimeout for berkley db is 500 ms and if a thread (some IT) is 
waiting on titan storage resource which is locked by another thread and it 
doesn't releases the lock within 500ms - fails with above exception. (see error 
log below)

The fix for this is to increase the storage.lock.wait-time for berkley db to 
10000 ms. This is consistent with the lock wait timeout specified for HBase.

Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.73) Lock expired. 
Locker 1516581475 7535_NotificationHookConsumer thread-0_Txn: waited for lock 
on database=edgestore LockAddr:284896285 LSN=0x0/0x21d55f type=WRITE 
grant=WAIT_PROMOTION timeoutMillis=500 startTime=1491261268442 
endTime=1491261268942
Owners: [<LockInfo locker="1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" type="READ"/>, <LockInfo 
locker="1516581475 7535_NotificationHookConsumer thread-0_Txn" type="READ"/>]
Waiters: []
Transaction 1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn waits for  LockAddr:471572402 
Owners:<LockInfo locker="1516581475 7535_NotificationHookConsumer thread-0_Txn" 
type="WRITE"/> Waiters:[<LockInfo locker="1445928922 7537_qtp184901207-1038 - 
e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" type="READ"/>]
Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn owns 
LockAddr:471572402 <LockInfo locker="1516581475 7535_NotificationHookConsumer 
thread-0_Txn" type="WRITE"/>
Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn waits for 
LockAddr:284896285



> Increase titan storage.lock.wait-time for Berkley DB to fix intermittent IT 
> failures 
> -------------------------------------------------------------------------------------
>
>                 Key: ATLAS-1720
>                 URL: https://issues.apache.org/jira/browse/ATLAS-1720
>             Project: Atlas
>          Issue Type: Bug
>          Components:  atlas-core
>    Affects Versions: trunk, 0.9-incubating
>            Reporter: Sarath Subramanian
>            Assignee: Sarath Subramanian
>
> Some of the ITs in Atlas fail intermittently with exception - "Could not 
> execute operation due to backend exception"
> Upon investigation it's found this is due to Berkley LockTimeoutException 
> (https://github.com/thinkaurelius/titan/issues/1113)
> The default LockTimeout for berkley db is 500 ms and if a thread (some IT) is 
> waiting on titan storage resource which is locked by another thread and it 
> doesn't releases the lock within 500ms - fails with above exception. (see 
> error log below)
> The fix for this is to increase the storage.lock.wait-time for berkley db to 
> 10000 ms. This is consistent with the lock wait timeout specified for HBase.
> {code}
> Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.73) Lock expired. 
> Locker 1516581475 7535_NotificationHookConsumer thread-0_Txn: waited for lock 
> on database=edgestore LockAddr:284896285 LSN=0x0/0x21d55f type=WRITE 
> grant=WAIT_PROMOTION timeoutMillis=500 startTime=1491261268442 
> endTime=1491261268942
> Owners: [<LockInfo locker="1445928922 7537_qtp184901207-1038 - 
> e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" type="READ"/>, <LockInfo 
> locker="1516581475 7535_NotificationHookConsumer thread-0_Txn" type="READ"/>]
> Waiters: []
> Transaction 1445928922 7537_qtp184901207-1038 - 
> e015a355-d6c5-4424-b7a7-833a289aea9d_Txn waits for  LockAddr:471572402 
> Owners:<LockInfo locker="1516581475 7535_NotificationHookConsumer 
> thread-0_Txn" type="WRITE"/> Waiters:[<LockInfo locker="1445928922 
> 7537_qtp184901207-1038 - e015a355-d6c5-4424-b7a7-833a289aea9d_Txn" 
> type="READ"/>]
> Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn owns 
> LockAddr:471572402 <LockInfo locker="1516581475 7535_NotificationHookConsumer 
> thread-0_Txn" type="WRITE"/>
> Transaction 1516581475 7535_NotificationHookConsumer thread-0_Txn waits for 
> LockAddr:284896285
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to