[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641693#comment-17641693 ] ASF GitHub Bot commented on HDFS-16855: --- dingshun3016 commented on PR #5170: URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333269882 > @dingshun3016 This seems to only happen when invoke addBlockPool() and CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid refresh() when ReplicaCachingGetSpaceUsed#init() at first time ? @MingXiangLi thanks reply forbid refresh() when ReplicaCachingGetSpaceUsed #init() at first time,it will cause the value of dfsUsage to be 0 until the next time refresh(). if remove the BLOCK_POOl level write lock in the org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool(String bpid, Configuration conf) method, what will be the impact ? do you have any other suggestions? > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641690#comment-17641690 ] ASF GitHub Bot commented on HDFS-16855: --- dingshun3016 opened a new pull request, #5170: URL: https://github.com/apache/hadoop/pull/5170 When patching the datanode's fine-grained lock, we found that the datanode couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool get writeLock org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica need readLock because it is not the same thread, so the write lock cannot be downgraded to a read lock > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641688#comment-17641688 ] ASF GitHub Bot commented on HDFS-16855: --- dingshun3016 commented on PR #5170: URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333268646 > @dingshun3016 This seems to only happen when invoke addBlockPool() and CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid refresh() when ReplicaCachingGetSpaceUsed#init() at first time ? > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641689#comment-17641689 ] ASF GitHub Bot commented on HDFS-16855: --- dingshun3016 closed pull request #5170: HDFS-16855. Remove the redundant write lock in addBlockPool. URL: https://github.com/apache/hadoop/pull/5170 > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16633) Reserved Space For Replicas is not released on some cases
[ https://issues.apache.org/jira/browse/HDFS-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Akira Ajisaka updated HDFS-16633: - Fix Version/s: 3.3.9 Backported to branch-3.3. > Reserved Space For Replicas is not released on some cases > - > > Key: HDFS-16633 > URL: https://issues.apache.org/jira/browse/HDFS-16633 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.2 >Reporter: Prabhu Joseph >Assignee: Ashutosh Gupta >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > Have found the Reserved Space For Replicas is not released on some cases in a > Cx Prod cluster. There are few fixes like HDFS-9530 and HDFS-8072 but still > the issue is not completely fixed. Have tried to debug the root cause but > this will take lot of time as it is Cx Prod Cluster. > But we have an easier way to fix the issue completely by releasing any > remaining reserved space of the Replica from the Volume. > DataXceiver#writeBlock finally will call BlockReceiver#close which will check > if the ReplicaInfo has any remaining reserved space, if so release from the > Volume. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641675#comment-17641675 ] ASF GitHub Bot commented on HDFS-16855: --- MingXiangLi commented on PR #5170: URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333200919 @dingshun3016 This seems to only happen when invoke addBlockPool() and CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid refresh() when CachingGetSpaceUsed#init() at first time ? > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-7343) HDFS smart storage management
[ https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641615#comment-17641615 ] Brahma Reddy Battula commented on HDFS-7343: {quote}Hi Brahma, currently we have no plan to merge this feature to upstream. We have a repo to maintain this project. See [https://github.com/Intel-bigdata/SSM] {quote} Ok. thanks. [~zhouwei] /[~PhiloHe] i) Any features are pending.? is it production ready..? ii) kafka and ZK are required to deploy this.? iii) any chance to move this as subproject or incubation to apache..? > HDFS smart storage management > - > > Key: HDFS-7343 > URL: https://issues.apache.org/jira/browse/HDFS-7343 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kai Zheng >Assignee: Wei Zhou >Priority: Major > Attachments: HDFS-Smart-Storage-Management-update.pdf, > HDFS-Smart-Storage-Management.pdf, > HDFSSmartStorageManagement-General-20170315.pdf, > HDFSSmartStorageManagement-Phase1-20170315.pdf, access_count_tables.jpg, > move.jpg, tables_in_ssm.xlsx > > > As discussed in HDFS-7285, it would be better to have a comprehensive and > flexible storage policy engine considering file attributes, metadata, data > temperature, storage type, EC codec, available hardware capabilities, > user/application preference and etc. > Modified the title for re-purpose. > We'd extend this effort some bit and aim to work on a comprehensive solution > to provide smart storage management service in order for convenient, > intelligent and effective utilizing of erasure coding or replicas, HDFS cache > facility, HSM offering, and all kinds of tools (balancer, mover, disk > balancer and so on) in a large cluster. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16837) [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max state id for each namespace
[ https://issues.apache.org/jira/browse/HDFS-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641597#comment-17641597 ] ASF GitHub Bot commented on HDFS-16837: --- simbadzina commented on PR #5123: URL: https://github.com/apache/hadoop/pull/5123#issuecomment-1332948822 @ZanderXu I committed some changes to the tests which are causing merge conflicts with your PR. There are just two conflicts though. > [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max > state id for each namespace > > > Key: HDFS-16837 > URL: https://issues.apache.org/jira/browse/HDFS-16837 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > ClientGSIContext should merge local and remote RouterFederatedState to get > the max state id for each namespace. > And the related code as bellows: > {code:java} > @Override > public synchronized void receiveResponseState(RpcResponseHeaderProto header) { > if (header.hasRouterFederatedState()) { > // BUG here > routerFederatedState = header.getRouterFederatedState(); > } else { > lastSeenStateId.accumulate(header.getStateId()); > } > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16809) EC striped block is not sufficient when doing in maintenance
[ https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641575#comment-17641575 ] ASF GitHub Bot commented on HDFS-16809: --- hadoop-yetus commented on PR #5050: URL: https://github.com/apache/hadoop/pull/5050#issuecomment-1332827003 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 9s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 41m 48s | | trunk passed | | +1 :green_heart: | compile | 1m 39s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | compile | 1m 30s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 17s | | trunk passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 34s | | trunk passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 44s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 20s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 29s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javac | 1m 29s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 58s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 81 unchanged - 1 fixed = 83 total (was 82) | | +1 :green_heart: | mvnsite | 1m 27s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 24s | | the patch passed with JDK Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 31s | | the patch passed | | +1 :green_heart: | shadedclient | 29m 27s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 461m 15s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 55s | | The patch does not generate ASF License warnings. | | | | 584m 11s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.namenode.ha.TestObserverNode | | | hadoop.hdfs.TestWriteConfigurationToDFS | | | hadoop.hdfs.TestLeaseRecovery2 | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5050 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f36a8911e6e0 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 7f6b3165fbcd8621650a6da458b9bd5374918ec1 | | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjd
[jira] [Commented] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy
[ https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641424#comment-17641424 ] ASF GitHub Bot commented on HDFS-16839: --- Kidd53685368 commented on PR #5128: URL: https://github.com/apache/hadoop/pull/5128#issuecomment-1332401791 Thanks for the reviews! > It should consider EC reconstruction work when we determine if a node is busy > - > > Key: HDFS-16839 > URL: https://issues.apache.org/jira/browse/HDFS-16839 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Kidd5368 >Assignee: Kidd5368 >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > > In chooseSourceDatanodes( ), I think it's more reasonable if we take EC > reconstruction work as a consideration when we determine if a node is busy or > not. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erik Krogen resolved HDFS-16550. Fix Version/s: 3.4.0 Resolution: Fixed > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324
[ https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641406#comment-17641406 ] ASF GitHub Bot commented on HDFS-16853: --- xkrogen commented on PR #5162: URL: https://github.com/apache/hadoop/pull/5162#issuecomment-1332369461 cc @omalley > The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed > because HADOOP-18324 > --- > > Key: HDFS-16853 > URL: https://issues.apache.org/jira/browse/HDFS-16853 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > > The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed > with error message: Waiting for cluster to become active. And the blocking > jstack as bellows: > {code:java} > "BP-1618793397-192.168.3.4-1669198559828 heartbeating to > localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x > 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x0007430a9ec0> (a > java.util.concurrent.SynchronousQueue$TransferQueue) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762) > at > java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695) > at > java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877) > at > org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186) > at org.apache.hadoop.ipc.Client.call(Client.java:1482) > at org.apache.hadoop.ipc.Client.call(Client.java:1429) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139) > at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient > SideTranslatorPB.java:168) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915) > at java.lang.Thread.run(Thread.java:748) {code} > After looking into the code and found that this bug is imported by > HADOOP-18324. Because RpcRequestSender exited without cleaning up the > rpcRequestQueue, then caused BPServiceActor was blocked in sending request. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641405#comment-17641405 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen merged PR #4209: URL: https://github.com/apache/hadoop/pull/4209 > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash
[ https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641403#comment-17641403 ] ASF GitHub Bot commented on HDFS-16550: --- xkrogen commented on PR #4209: URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1332367011 `TestLeaseRecoveryV2` failure is tracked in HDFS-16853. LGTM. Merging to trunk. Thanks for the contribution @tomscut ! > [SBN read] Improper cache-size for journal node may cause cluster crash > --- > > Key: HDFS-16550 > URL: https://issues.apache.org/jira/browse/HDFS-16550 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Tao Li >Assignee: Tao Li >Priority: Major > Labels: pull-request-available > Attachments: image-2022-04-21-09-54-29-751.png, > image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png > > Time Spent: 1h > Remaining Estimate: 0h > > When we introduced {*}SBN Read{*}, we encountered a situation during upgrade > the JournalNodes. > Cluster Info: > *Active: nn0* > *Standby: nn1* > 1. Rolling restart journal node. {color:#ff}(related config: > fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color} > 2. The cluster runs for a while, edits cache usage is increasing and memory > is used up. > 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed > out waiting 12ms for a quorum of nodes to respond”{_}. > 4. Transfer nn1 to Active state. > 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of > “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}. > 6. {color:#ff}The cluster crashed{color}. > > Related code: > {code:java} > JournaledEditsCache(Configuration conf) { > capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY, > DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT); > if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) { > Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " + > "maximum JVM memory is only %d bytes. It is recommended that you " + > "decrease the cache size or increase the heap size.", > capacity, Runtime.getRuntime().maxMemory())); > } > Journal.LOG.info("Enabling the journaled edits cache with a capacity " + > "of bytes: " + capacity); > ReadWriteLock lock = new ReentrantReadWriteLock(true); > readLock = new AutoCloseableLock(lock.readLock()); > writeLock = new AutoCloseableLock(lock.writeLock()); > initialize(INVALID_TXN_ID); > } {code} > Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size > than the memory requested by the process. If > {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * > Runtime.getruntime().maxMemory(){*}, only warn logs are printed during > journalnode startup. This can easily be overlooked by users. However, as the > cluster runs to a certain period of time, it is likely to cause the cluster > to crash. > > NN log: > !image-2022-04-21-09-54-57-111.png|width=1012,height=47! > !image-2022-04-21-12-32-56-170.png|width=809,height=218! > IMO, we should not set the {{cache size}} to a fixed value, but to the ratio > of maximum memory, which is 0.2 by default. > This avoids the problem of too large cache size. In addition, users can > actively adjust the heap size when they need to increase the cache size. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641299#comment-17641299 ] dingshun commented on HDFS-16855: - [~hexiaoqiao] I've looked at the logic of the latest trunk branch, but can't seem to find anything. If you later find a PR that fixes it, please post it. thanks In addition, I would like to ask, if I remove the BLOCK_POOl level write lock in the org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool(String bpid, Configuration conf) method, what will be the impact ? > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16809) EC striped block is not sufficient when doing in maintenance
[ https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641287#comment-17641287 ] ASF GitHub Bot commented on HDFS-16809: --- dingshun3016 commented on PR #5050: URL: https://github.com/apache/hadoop/pull/5050#issuecomment-1332104291 @tasanuma Thanks for your review, I have submitted the relevant test cases, please review it > EC striped block is not sufficient when doing in maintenance > > > Key: HDFS-16809 > URL: https://issues.apache.org/jira/browse/HDFS-16809 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec, erasure-coding >Reporter: dingshun >Assignee: dingshun >Priority: Major > Labels: pull-request-available > > When doing maintenance, ec striped block is not sufficient, which will lead > to miss block -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool
[ https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641171#comment-17641171 ] Xiaoqiao He commented on HDFS-16855: [~dingshun] Thanks for the detailed explanation. IIRC, at the beginning of DataNode's fine-grained lock, it indeed could cause dead lock here, hence it had fixed at the following PRs. Sorry I don't find the related PR now. Would you mind to check the logic branch trunk? welcome more feedback if any issue you meet. Thanks again. cc [~Aiphag0] Any more suggestions? > Remove the redundant write lock in addBlockPool > --- > > Key: HDFS-16855 > URL: https://issues.apache.org/jira/browse/HDFS-16855 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: dingshun >Priority: Major > Labels: pull-request-available > > When patching the datanode's fine-grained lock, we found that the datanode > couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it. > {code:java} > // getspaceused classname > > fs.getspaceused.classname > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed > {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool > > // get writeLock > @Override > public void addBlockPool(String bpid, Configuration conf) > throws IOException { > LOG.info("Adding block pool " + bpid); > AddBlockPoolException volumeExceptions = new AddBlockPoolException(); > try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, > bpid)) { > try { > volumes.addBlockPool(bpid, conf); > } catch (AddBlockPoolException e) { > volumeExceptions.mergeException(e); > } > volumeMap.initBlockPool(bpid); > Set vols = storageMap.keySet(); > for (String v : vols) { > lockManager.addLock(LockLevel.VOLUME, bpid, v); > } > } > > } {code} > {code:java} > // > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica > // need readLock > void replicas(String bpid, Consumer> consumer) { > LightWeightResizableGSet m = null; > try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, > bpid)) { > m = map.get(bpid); > if (m !=null) { > m.getIterator(consumer); > } > } > } {code} > > because it is not the same thread, so the write lock cannot be downgraded to > a read lock > {code:java} > void addBlockPool(final String bpid, final Configuration conf) throws > IOException { > long totalStartTime = Time.monotonicNow(); > final Map unhealthyDataDirs = > new ConcurrentHashMap(); > List blockPoolAddingThreads = new ArrayList(); > for (final FsVolumeImpl v : volumes) { > Thread t = new Thread() { > public void run() { > try (FsVolumeReference ref = v.obtainReference()) { > FsDatasetImpl.LOG.info("Scanning block pool " + bpid + > " on volume " + v + "..."); > long startTime = Time.monotonicNow(); > v.addBlockPool(bpid, conf); > long timeTaken = Time.monotonicNow() - startTime; > FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid + > " on " + v + ": " + timeTaken + "ms"); > } catch (IOException ioe) { > FsDatasetImpl.LOG.info("Caught exception while scanning " + v + > ". Will throw later.", ioe); > unhealthyDataDirs.put(v, ioe); > } > } > }; > blockPoolAddingThreads.add(t); > t.start(); > } > for (Thread t : blockPoolAddingThreads) { > try { > t.join(); > } catch (InterruptedException ie) { > throw new IOException(ie); > } > } > } {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org