[
https://issues.apache.org/jira/browse/HBASE-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani resolved HBASE-29109.
----------------------------------
Fix Version/s: 2.5.12
Hadoop Flags: Reviewed
Resolution: Fixed
> Process stage of TakeSnapshotHandler runs Async which is holding the
> EXCLUSIVE Lock of Table for around 9 mins blocking ASSIGN procedures of SCP
> ------------------------------------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29109
> URL: https://issues.apache.org/jira/browse/HBASE-29109
> Project: HBase
> Issue Type: Improvement
> Affects Versions: 2.5.8
> Reporter: Prathyusha
> Assignee: Prathyusha
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.5.12
>
>
> {{[SnapshotManager|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/master/snapshot/SnapshotManager.java#L659-L660]
> does *prepare* and *process* as below}}
> {quote}{{{}{color:#6a3e3e}handler{color}{color:#000000}.prepare();{color}{}}}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}executorService{color}{color:#000000}.submit({color}{color:#6a3e3e}handler{color}{color:#000000});
> (process called here){color}
> {quote}
> For Enabled tables prepare() takes an EXCLUSIVE lock with a timeout
> and right when it starts processing it releases the current EXCLUSIVE lock on
> the table and takes a SHARED one instead
> {quote}{color:#000000} {color}{color:#7f0055}try{color}{color:#000000}
> {{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> (downgradeToSharedTableLock()) {{color}
> {color:#000000} {color}{color:#3f7f5f}// release the exclusive lock and hold
> the shared lock instead{color}
> {color:#000000}
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000} =
> {color}{color:#0000c0}master{color}{color:#000000}.getLockManager().createMasterLock({color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000} LockType.{color}{color:#0000c0}SHARED{color}{color:#000000},
> {color}{color:#7f0055}this{color}{color:#000000}.getClass().getName() +
> {color}{color:#2a00ff}": take snapshot "{color}{color:#000000} +
> {color}{color:#0000c0}snapshot{color}{color:#000000}.getName());{color}
> {color:#000000}
> {color}{color:#0000c0}tableLock{color}{color:#000000}.release();{color}
> {color:#000000} {color}{color:#7f0055}boolean{color}{color:#000000}
> {color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000} =
> {color}{color:#6a3e3e}tableLockToRelease{color}{color:#000000}.tryAcquire({color}{color:#7f0055}this{color}{color:#000000}.{color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}if{color}{color:#000000}
> (!{color}{color:#6a3e3e}isTableLockAcquired{color}{color:#000000}) {{color}
> {color:#000000}
> {color}{color:#0000c0}LOG{color}{color:#000000}.error({color}{color:#2a00ff}"Could
> not acquire shared lock on table {} in {} ms"{color}{color:#000000},
> {color}{color:#0000c0}snapshotTable{color}{color:#000000},{color}
> {color:#000000}
> {color}{color:#0000c0}lockAcquireTimeoutMs{color}{color:#000000});{color}
> {color:#000000} {color}{color:#7f0055}throw{color}{color:#000000}
> {color}{color:#7f0055}new{color}{color:#000000}
> IOException({color}{color:#2a00ff}"Could not acquire shared lock on table
> "{color}{color:#000000} +
> {color}{color:#0000c0}snapshotTable{color}{color:#000000});{color}
> {color:#000000} }{color}
> {color:#000000} }{color}
> {quote}
> Since the process stage runs in ExecutorService sometimes there is a delay
> and Snapshot hold the EXCLUSIVE too long, blocking the ASSIGN child
> procedures of SCP
> Below are the logs we saw for the same in our production
> 1.Process stage of Snapshot started
> {quote}2025-01-15 02:15:09,608 INFO
> [ER_SNAPSHOT_OPERATIONS-master/hmaster-3:60000-5360]
> snapshot.TakeSnapshotHandler - Running SKIPFLUSH table snapshot
> <TABLE_NAME>_1736902854385_1736906318965_0 C_M_SNAPSHOT_TABLE on table
> <TABLE_NAME>
> {quote}
> 2.then the EXCLUSIVE lock is released
> {quote}2025-01-15 02:15:09,615 INFO [PEWorker-20]
> procedure2.ProcedureExecutor - Finished pid=26291867, state=SUCCESS;
> org.apache.hadoop.hbase.master.locking.LockProcedure, tableName=<TABLE_NAME>,
> type=EXCLUSIVE in 9 mins, 55.422 sec
> {quote}
> 3.ASSIGN of SCP was able to then get the lock
> {quote}2025-01-15 02:15:09,615 INFO [PEWorker-20]
> procedure.MasterProcedureScheduler - Took xlock for pid=26292238,
> ppid=26291869, state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE;
> TransitRegionStateProcedure table=<TABLE_NAME>,
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN
> {quote}
> 4.Though once lock is acquired ASSIGN was pretty quick, total time of that
> region availability shoot up to 9 mins
> {quote}2025-01-15 02:15:10,384 INFO [PEWorker-24]
> procedure2.ProcedureExecutor - Finished pid=26292238, ppid=26291869,
> state=SUCCESS; TransitRegionStateProcedure table=<TABLE_NAME>,
> region=e9d104da353c65725b33f902c8a1449d, ASSIGN in 9 mins, 1.364 sec
> {quote}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)