[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641693#comment-17641693
 ] 

ASF GitHub Bot commented on HDFS-16855:
---

dingshun3016 commented on PR #5170:
URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333269882

   > @dingshun3016 This seems to only happen when invoke addBlockPool() and 
CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid 
refresh() when ReplicaCachingGetSpaceUsed#init() at first time ?
   
   @MingXiangLi  thanks reply
   
   forbid refresh() when ReplicaCachingGetSpaceUsed #init() at first time,it 
will cause the value of dfsUsage to be 0 until the next time refresh(). 
   
if remove the BLOCK_POOl level write lock in the 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool(String
 bpid, Configuration conf) method, what will be the impact ?
   
   do you have any other suggestions?




> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641690#comment-17641690
 ] 

ASF GitHub Bot commented on HDFS-16855:
---

dingshun3016 opened a new pull request, #5170:
URL: https://github.com/apache/hadoop/pull/5170

   When patching the datanode's fine-grained lock, we found that the datanode 
couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
   
   
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
get writeLock
   
   
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
  need readLock
   
   because it is not the same thread, so the write lock cannot be downgraded to 
a read lock




> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641688#comment-17641688
 ] 

ASF GitHub Bot commented on HDFS-16855:
---

dingshun3016 commented on PR #5170:
URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333268646

   > @dingshun3016 This seems to only happen when invoke addBlockPool() and 
CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid 
refresh() when ReplicaCachingGetSpaceUsed#init() at first time ?
   
   




> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641689#comment-17641689
 ] 

ASF GitHub Bot commented on HDFS-16855:
---

dingshun3016 closed pull request #5170: HDFS-16855. Remove the redundant write 
lock in addBlockPool. 
URL: https://github.com/apache/hadoop/pull/5170




> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16633) Reserved Space For Replicas is not released on some cases

2022-11-30 Thread Akira Ajisaka (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Akira Ajisaka updated HDFS-16633:
-
Fix Version/s: 3.3.9

Backported to branch-3.3.

> Reserved Space For Replicas is not released on some cases
> -
>
> Key: HDFS-16633
> URL: https://issues.apache.org/jira/browse/HDFS-16633
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: hdfs
>Affects Versions: 3.1.2
>Reporter: Prabhu Joseph
>Assignee: Ashutosh Gupta
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> Have found the Reserved Space For Replicas is not released on some cases in a 
> Cx Prod cluster. There are few fixes like HDFS-9530 and HDFS-8072 but still 
> the issue is not completely fixed. Have tried to debug the root cause but 
> this will take lot of time as it is Cx Prod Cluster. 
> But we have an easier way to fix the issue completely by releasing any 
> remaining reserved space of the Replica from the Volume. 
> DataXceiver#writeBlock finally will call BlockReceiver#close which will check 
> if the ReplicaInfo has any remaining reserved space, if so release from the 
> Volume.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641675#comment-17641675
 ] 

ASF GitHub Bot commented on HDFS-16855:
---

MingXiangLi commented on PR #5170:
URL: https://github.com/apache/hadoop/pull/5170#issuecomment-1333200919

   @dingshun3016 This seems to only happen when invoke addBlockPool() and 
CachingGetSpaceUsed#used < 0, so why not handle it for example like forbid 
refresh() when CachingGetSpaceUsed#init() at first time ? 




> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-7343) HDFS smart storage management

2022-11-30 Thread Brahma Reddy Battula (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-7343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641615#comment-17641615
 ] 

Brahma Reddy Battula commented on HDFS-7343:


{quote}Hi Brahma, currently we have no plan to merge this feature to upstream. 
We have a repo to maintain this project. See 
[https://github.com/Intel-bigdata/SSM]
{quote}
Ok. thanks.

[~zhouwei] /[~PhiloHe] 

i) Any features are pending.? is it production ready..?

ii) kafka and ZK are required to deploy this.?

iii) any chance to move this as subproject or incubation to apache..?

> HDFS smart storage management
> -
>
> Key: HDFS-7343
> URL: https://issues.apache.org/jira/browse/HDFS-7343
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kai Zheng
>Assignee: Wei Zhou
>Priority: Major
> Attachments: HDFS-Smart-Storage-Management-update.pdf, 
> HDFS-Smart-Storage-Management.pdf, 
> HDFSSmartStorageManagement-General-20170315.pdf, 
> HDFSSmartStorageManagement-Phase1-20170315.pdf, access_count_tables.jpg, 
> move.jpg, tables_in_ssm.xlsx
>
>
> As discussed in HDFS-7285, it would be better to have a comprehensive and 
> flexible storage policy engine considering file attributes, metadata, data 
> temperature, storage type, EC codec, available hardware capabilities, 
> user/application preference and etc.
> Modified the title for re-purpose.
> We'd extend this effort some bit and aim to work on a comprehensive solution 
> to provide smart storage management service in order for convenient, 
> intelligent and effective utilizing of erasure coding or replicas, HDFS cache 
> facility, HSM offering, and all kinds of tools (balancer, mover, disk 
> balancer and so on) in a large cluster.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16837) [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max state id for each namespace

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641597#comment-17641597
 ] 

ASF GitHub Bot commented on HDFS-16837:
---

simbadzina commented on PR #5123:
URL: https://github.com/apache/hadoop/pull/5123#issuecomment-1332948822

   @ZanderXu I committed some changes to the tests which are causing merge 
conflicts with your PR. There are just two conflicts though.




> [RBF SBN] ClientGSIContext should merge RouterFederatedStates to get the max 
> state id for each namespace
> 
>
> Key: HDFS-16837
> URL: https://issues.apache.org/jira/browse/HDFS-16837
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> ClientGSIContext should merge local and remote RouterFederatedState to get 
> the max state id for each namespace.
> And the related code as bellows:
> {code:java}
> @Override
> public synchronized void receiveResponseState(RpcResponseHeaderProto header) {
>   if (header.hasRouterFederatedState()) {
> // BUG here
> routerFederatedState = header.getRouterFederatedState();
>   } else {
> lastSeenStateId.accumulate(header.getStateId());
>   }
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16809) EC striped block is not sufficient when doing in maintenance

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641575#comment-17641575
 ] 

ASF GitHub Bot commented on HDFS-16809:
---

hadoop-yetus commented on PR #5050:
URL: https://github.com/apache/hadoop/pull/5050#issuecomment-1332827003

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m  9s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  41m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 39s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  compile  |   1m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  checkstyle  |   1m 17s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 17s |  |  trunk passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 34s |  |  trunk passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 44s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  26m 20s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javac  |   1m 29s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 20s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  javac  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 58s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 2 new + 81 unchanged - 
1 fixed = 83 total (was 82)  |
   | +1 :green_heart: |  mvnsite  |   1m 27s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  the patch passed with JDK 
Ubuntu-11.0.16+8-post-Ubuntu-0ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 24s |  |  the patch passed with JDK 
Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07  |
   | +1 :green_heart: |  spotbugs  |   3m 31s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  29m 27s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 461m 15s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 55s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 584m 11s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   |   | hadoop.hdfs.server.namenode.ha.TestObserverNode |
   |   | hadoop.hdfs.TestWriteConfigurationToDFS |
   |   | hadoop.hdfs.TestLeaseRecovery2 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.41 ServerAPI=1.41 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5050/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5050 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux f36a8911e6e0 4.15.0-191-generic #202-Ubuntu SMP Thu Aug 4 
01:49:29 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 7f6b3165fbcd8621650a6da458b9bd5374918ec1 |
   | Default Java | Private Build-1.8.0_342-8u342-b07-0ubuntu1~20.04-b07 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjd

[jira] [Commented] (HDFS-16839) It should consider EC reconstruction work when we determine if a node is busy

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641424#comment-17641424
 ] 

ASF GitHub Bot commented on HDFS-16839:
---

Kidd53685368 commented on PR #5128:
URL: https://github.com/apache/hadoop/pull/5128#issuecomment-1332401791

   Thanks for the reviews!




> It should consider EC reconstruction work when we determine if a node is busy
> -
>
> Key: HDFS-16839
> URL: https://issues.apache.org/jira/browse/HDFS-16839
> Project: Hadoop HDFS
>  Issue Type: Improvement
>Reporter: Kidd5368
>Assignee: Kidd5368
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> In chooseSourceDatanodes( ), I think it's more reasonable if we take EC 
> reconstruction work as a consideration when we determine if a node is busy or 
> not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-11-30 Thread Erik Krogen (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen resolved HDFS-16550.

Fix Version/s: 3.4.0
   Resolution: Fixed

> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
> Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 12ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>   DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>   "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, we should not set the {{cache size}} to a fixed value, but to the ratio 
> of maximum memory, which is 0.2 by default.
> This avoids the problem of too large cache size. In addition, users can 
> actively adjust the heap size when they need to increase the cache size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16853) The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed because HADOOP-18324

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641406#comment-17641406
 ] 

ASF GitHub Bot commented on HDFS-16853:
---

xkrogen commented on PR #5162:
URL: https://github.com/apache/hadoop/pull/5162#issuecomment-1332369461

   cc @omalley 




> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> because HADOOP-18324
> ---
>
> Key: HDFS-16853
> URL: https://issues.apache.org/jira/browse/HDFS-16853
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ZanderXu
>Assignee: ZanderXu
>Priority: Major
>  Labels: pull-request-available
>
> The UT TestLeaseRecovery2#testHardLeaseRecoveryAfterNameNodeRestart failed 
> with error message: Waiting for cluster to become active. And the blocking 
> jstack as bellows:
> {code:java}
> "BP-1618793397-192.168.3.4-1669198559828 heartbeating to 
> localhost/127.0.0.1:54673" #260 daemon prio=5 os_prio=31 tid=0x
> 7fc1108fa000 nid=0x19303 waiting on condition [0x700017884000]
>    java.lang.Thread.State: WAITING (parking)
>         at sun.misc.Unsafe.park(Native Method)
>         - parking to wait for  <0x0007430a9ec0> (a 
> java.util.concurrent.SynchronousQueue$TransferQueue)
>         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.awaitFulfill(SynchronousQueue.java:762)
>         at 
> java.util.concurrent.SynchronousQueue$TransferQueue.transfer(SynchronousQueue.java:695)
>         at 
> java.util.concurrent.SynchronousQueue.put(SynchronousQueue.java:877)
>         at 
> org.apache.hadoop.ipc.Client$Connection.sendRpcRequest(Client.java:1186)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1482)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1429)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:258)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Invoker.invoke(ProtobufRpcEngine2.java:139)
>         at com.sun.proxy.$Proxy23.sendHeartbeat(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.sendHeartbeat(DatanodeProtocolClient
> SideTranslatorPB.java:168)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:570)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:714)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:915)
>         at java.lang.Thread.run(Thread.java:748)  {code}
> After looking into the code and found that this bug is imported by 
> HADOOP-18324. Because RpcRequestSender exited without cleaning up the 
> rpcRequestQueue, then caused BPServiceActor was blocked in sending request.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641405#comment-17641405
 ] 

ASF GitHub Bot commented on HDFS-16550:
---

xkrogen merged PR #4209:
URL: https://github.com/apache/hadoop/pull/4209




> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 12ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>   DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>   "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, we should not set the {{cache size}} to a fixed value, but to the ratio 
> of maximum memory, which is 0.2 by default.
> This avoids the problem of too large cache size. In addition, users can 
> actively adjust the heap size when they need to increase the cache size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16550) [SBN read] Improper cache-size for journal node may cause cluster crash

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16550?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641403#comment-17641403
 ] 

ASF GitHub Bot commented on HDFS-16550:
---

xkrogen commented on PR #4209:
URL: https://github.com/apache/hadoop/pull/4209#issuecomment-1332367011

   `TestLeaseRecoveryV2` failure is tracked in HDFS-16853. LGTM. Merging to 
trunk. Thanks for the contribution @tomscut !




> [SBN read] Improper cache-size for journal node may cause cluster crash
> ---
>
> Key: HDFS-16550
> URL: https://issues.apache.org/jira/browse/HDFS-16550
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Tao Li
>Assignee: Tao Li
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2022-04-21-09-54-29-751.png, 
> image-2022-04-21-09-54-57-111.png, image-2022-04-21-12-32-56-170.png
>
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> When we introduced {*}SBN Read{*}, we encountered a situation during upgrade 
> the JournalNodes.
> Cluster Info: 
> *Active: nn0*
> *Standby: nn1*
> 1. Rolling restart journal node. {color:#ff}(related config: 
> fs.journalnode.edit-cache-size.bytes=1G, -Xms1G, -Xmx=1G){color}
> 2. The cluster runs for a while, edits cache usage is increasing and memory 
> is used up.
> 3. {color:#ff}Active namenode(nn0){color} shutdown because of “{_}Timed 
> out waiting 12ms for a quorum of nodes to respond”{_}.
> 4. Transfer nn1 to Active state.
> 5. {color:#ff}New Active namenode(nn1){color} also shutdown because of 
> “{_}Timed out waiting 12ms for a quorum of nodes to respond” too{_}.
> 6. {color:#ff}The cluster crashed{color}.
>  
> Related code:
> {code:java}
> JournaledEditsCache(Configuration conf) {
>   capacity = conf.getInt(DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_KEY,
>   DFSConfigKeys.DFS_JOURNALNODE_EDIT_CACHE_SIZE_DEFAULT);
>   if (capacity > 0.9 * Runtime.getRuntime().maxMemory()) {
> Journal.LOG.warn(String.format("Cache capacity is set at %d bytes but " +
> "maximum JVM memory is only %d bytes. It is recommended that you " +
> "decrease the cache size or increase the heap size.",
> capacity, Runtime.getRuntime().maxMemory()));
>   }
>   Journal.LOG.info("Enabling the journaled edits cache with a capacity " +
>   "of bytes: " + capacity);
>   ReadWriteLock lock = new ReentrantReadWriteLock(true);
>   readLock = new AutoCloseableLock(lock.readLock());
>   writeLock = new AutoCloseableLock(lock.writeLock());
>   initialize(INVALID_TXN_ID);
> } {code}
> Currently, *fs.journalNode.edit-cache-size-bytes* can be set to a larger size 
> than the memory requested by the process. If 
> {*}fs.journalNode.edit-cache-sie.bytes > 0.9 * 
> Runtime.getruntime().maxMemory(){*}, only warn logs are printed during 
> journalnode startup. This can easily be overlooked by users. However, as the 
> cluster runs to a certain period of time, it is likely to cause the cluster 
> to crash.
>  
> NN log:
> !image-2022-04-21-09-54-57-111.png|width=1012,height=47!
> !image-2022-04-21-12-32-56-170.png|width=809,height=218!
> IMO, we should not set the {{cache size}} to a fixed value, but to the ratio 
> of maximum memory, which is 0.2 by default.
> This avoids the problem of too large cache size. In addition, users can 
> actively adjust the heap size when they need to increase the cache size.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread dingshun (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641299#comment-17641299
 ] 

dingshun commented on HDFS-16855:
-

[~hexiaoqiao] 

I've looked at the logic of the latest trunk branch, but can't seem to find 
anything. If you later find a PR that fixes it, please post it. thanks

In addition, I would like to ask, if I remove the BLOCK_POOl level write lock 
in the 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool(String
 bpid, Configuration conf) method, what will be the impact ?

> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16809) EC striped block is not sufficient when doing in maintenance

2022-11-30 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641287#comment-17641287
 ] 

ASF GitHub Bot commented on HDFS-16809:
---

dingshun3016 commented on PR #5050:
URL: https://github.com/apache/hadoop/pull/5050#issuecomment-1332104291

   @tasanuma Thanks for your review, I have submitted the relevant test cases, 
please review it




> EC striped block is not sufficient when doing in maintenance
> 
>
> Key: HDFS-16809
> URL: https://issues.apache.org/jira/browse/HDFS-16809
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: ec, erasure-coding
>Reporter: dingshun
>Assignee: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When doing maintenance, ec striped block is not sufficient, which will lead 
> to miss block



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16855) Remove the redundant write lock in addBlockPool

2022-11-30 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17641171#comment-17641171
 ] 

Xiaoqiao He commented on HDFS-16855:


[~dingshun] Thanks for the detailed explanation. IIRC, at the beginning of 
DataNode's fine-grained lock, it indeed could cause dead lock here, hence it 
had fixed at the following PRs. Sorry I don't find the related PR now. Would 
you mind to check the logic branch trunk? welcome more feedback if any issue 
you meet. Thanks again. cc [~Aiphag0] Any more suggestions?

> Remove the redundant write lock in addBlockPool
> ---
>
> Key: HDFS-16855
> URL: https://issues.apache.org/jira/browse/HDFS-16855
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: dingshun
>Priority: Major
>  Labels: pull-request-available
>
> When patching the datanode's fine-grained lock, we found that the datanode 
> couldn't start,maybe happened deadlock,when addBlockPool, so we can remove it.
> {code:java}
> // getspaceused classname
>   
>     fs.getspaceused.classname
>     
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed
>    {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#addBlockPool
>  
> // get writeLock
> @Override
> public void addBlockPool(String bpid, Configuration conf)
> throws IOException {
>   LOG.info("Adding block pool " + bpid);
>   AddBlockPoolException volumeExceptions = new AddBlockPoolException();
>   try (AutoCloseableLock lock = lockManager.writeLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> try {
>   volumes.addBlockPool(bpid, conf);
> } catch (AddBlockPoolException e) {
>   volumeExceptions.mergeException(e);
> }
> volumeMap.initBlockPool(bpid);
> Set vols = storageMap.keySet();
> for (String v : vols) {
>   lockManager.addLock(LockLevel.VOLUME, bpid, v);
> }
>   }
>  
> } {code}
> {code:java}
> // 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl#deepCopyReplica
> // need readLock
> void replicas(String bpid, Consumer> consumer) {
>   LightWeightResizableGSet m = null;
>   try (AutoCloseDataSetLock l = lockManager.readLock(LockLevel.BLOCK_POOl, 
> bpid)) {
> m = map.get(bpid);
> if (m !=null) {
>   m.getIterator(consumer);
> }
>   }
> } {code}
>  
> because it is not the same thread, so the write lock cannot be downgraded to 
> a read lock
> {code:java}
> void addBlockPool(final String bpid, final Configuration conf) throws 
> IOException {
>   long totalStartTime = Time.monotonicNow();
>   final Map unhealthyDataDirs =
>   new ConcurrentHashMap();
>   List blockPoolAddingThreads = new ArrayList();
>   for (final FsVolumeImpl v : volumes) {
> Thread t = new Thread() {
>   public void run() {
> try (FsVolumeReference ref = v.obtainReference()) {
>   FsDatasetImpl.LOG.info("Scanning block pool " + bpid +
>   " on volume " + v + "...");
>   long startTime = Time.monotonicNow();
>   v.addBlockPool(bpid, conf);
>   long timeTaken = Time.monotonicNow() - startTime;
>   FsDatasetImpl.LOG.info("Time taken to scan block pool " + bpid +
>   " on " + v + ": " + timeTaken + "ms");
> } catch (IOException ioe) {
>   FsDatasetImpl.LOG.info("Caught exception while scanning " + v +
>   ". Will throw later.", ioe);
>   unhealthyDataDirs.put(v, ioe);
> }
>   }
> };
> blockPoolAddingThreads.add(t);
> t.start();
>   }
>   for (Thread t : blockPoolAddingThreads) {
> try {
>   t.join();
> } catch (InterruptedException ie) {
>   throw new IOException(ie);
> }
>   }
> } {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org