[
https://issues.apache.org/jira/browse/HDFS-10830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458534#comment-15458534
]
Arpit Agarwal commented on HDFS-10830:
--------------------------------------
Thanks for the clarification [~xiaochen].
[~manojg], there is no one signaling the waiter, so we just have to replace it
with a release-sleep-reacquire. We do have to release the lock before sleeping
though to avoid a potential deadlock.
If you haven't started I can post a simple patch to fix this. It would also
require a change to replace FsVolumeList#checkDirsMutex with a separate
reentrant lock.
> FsDatasetImpl#removeVolumes() crashes with IllegalMonitorStateException when
> vol being removed is in use
> --------------------------------------------------------------------------------------------------------
>
> Key: HDFS-10830
> URL: https://issues.apache.org/jira/browse/HDFS-10830
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: hdfs
> Affects Versions: 3.0.0-alpha1
> Reporter: Manoj Govindassamy
> Assignee: Manoj Govindassamy
>
> {{FsDatasetImpl#removeVolumes()}} operation crashes abruptly with
> IllegalMonitorStateException whenever the volume being removed is in use
> concurrently.
> Looks like {{removeVolumes()}} is waiting on a monitor object "this" (that is
> FsDatasetImpl) which it has never locked, leading to
> IllegalMonitorStateException. This monitor wait happens only the volume being
> removed is in use (referencecount > 0). The thread performing this remove
> volume operation thus crashes abruptly and block invalidations for the remove
> volumes are totally skipped.
> {code:title=FsDatasetImpl.java|borderStyle=solid}
> @Override
> public void removeVolumes(Set<File> volumesToRemove, boolean clearFailure) {
> ..
> ..
> try (AutoCloseableLock lock = datasetLock.acquire()) { <== LOCK acquire
> datasetLock
> for (int idx = 0; idx < dataStorage.getNumStorageDirs(); idx++) {
> .. .. ..
> asyncDiskService.removeVolume(sd.getCurrentDir()); <== volume SD1 remove
> volumes.removeVolume(absRoot, clearFailure);
> volumes.waitVolumeRemoved(5000, this); <== WAIT on "this"
> ?? But, we haven't locked it yet.
> This will cause
> IllegalMonitorStateException
> and crash
> getBlockReports()/FBR thread!
> for (String bpid : volumeMap.getBlockPoolList()) {
> List<ReplicaInfo> blocks = new ArrayList<>();
> for (Iterator<ReplicaInfo> it = volumeMap.replicas(bpid).iterator();
> it.hasNext(); ) {
> .. .. ..
> it.remove(); <== volumeMap removal
> }
> blkToInvalidate.put(bpid, blocks);
> }
> .. ..
> } <== LOCK release
> datasetLock
> // Call this outside the lock.
> for (Map.Entry<String, List<ReplicaInfo>> entry :
> blkToInvalidate.entrySet()) {
> ..
> for (ReplicaInfo block : blocks) {
> invalidate(bpid, block); <== Notify NN of
> Block removal
> }
> }
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]