[ https://issues.apache.org/jira/browse/HDFS-10830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458960#comment-15458960 ]
Arpit Agarwal commented on HDFS-10830: -------------------------------------- bq. Wouldn't it be better to go for the following ..wait and signal model compared to polling I completely agree, but that may be a more complex change. Let's fix the immediate problem first and address the signaling improvement later. Sound fair? I assigned it to myself. > FsDatasetImpl#removeVolumes() crashes with IllegalMonitorStateException when > vol being removed is in use > -------------------------------------------------------------------------------------------------------- > > Key: HDFS-10830 > URL: https://issues.apache.org/jira/browse/HDFS-10830 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs > Affects Versions: 3.0.0-alpha1 > Reporter: Manoj Govindassamy > Assignee: Arpit Agarwal > Attachments: HDFS-10830.01.patch > > > {{FsDatasetImpl#removeVolumes()}} operation crashes abruptly with > IllegalMonitorStateException whenever the volume being removed is in use > concurrently. > Looks like {{removeVolumes()}} is waiting on a monitor object "this" (that is > FsDatasetImpl) which it has never locked, leading to > IllegalMonitorStateException. This monitor wait happens only the volume being > removed is in use (referencecount > 0). The thread performing this remove > volume operation thus crashes abruptly and block invalidations for the remove > volumes are totally skipped. > {code:title=FsDatasetImpl.java|borderStyle=solid} > @Override > public void removeVolumes(Set<File> volumesToRemove, boolean clearFailure) { > .. > .. > try (AutoCloseableLock lock = datasetLock.acquire()) { <== LOCK acquire > datasetLock > for (int idx = 0; idx < dataStorage.getNumStorageDirs(); idx++) { > .. .. .. > asyncDiskService.removeVolume(sd.getCurrentDir()); <== volume SD1 remove > volumes.removeVolume(absRoot, clearFailure); > volumes.waitVolumeRemoved(5000, this); <== WAIT on "this" > ?? But, we haven't locked it yet. > This will cause > IllegalMonitorStateException > and crash > getBlockReports()/FBR thread! > for (String bpid : volumeMap.getBlockPoolList()) { > List<ReplicaInfo> blocks = new ArrayList<>(); > for (Iterator<ReplicaInfo> it = volumeMap.replicas(bpid).iterator(); > it.hasNext(); ) { > .. .. .. > it.remove(); <== volumeMap removal > } > blkToInvalidate.put(bpid, blocks); > } > .. .. > } <== LOCK release > datasetLock > // Call this outside the lock. > for (Map.Entry<String, List<ReplicaInfo>> entry : > blkToInvalidate.entrySet()) { > .. > for (ReplicaInfo block : blocks) { > invalidate(bpid, block); <== Notify NN of > Block removal > } > } > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org