[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15651: --- Attachment: HDFS-15651.002.patch > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15651.001.patch, HDFS-15651.002.patch, > HDFS-15651.patch > > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15651: --- Attachment: HDFS-15651.001.patch > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15651.001.patch, HDFS-15651.patch > > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17221284#comment-17221284 ] Aiphago commented on HDFS-15651: [^HDFS-15651.patch] I think let BPServiceActor exit is better when occur error due to hardware. > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Priority: Major > Attachments: HDFS-15651.patch > > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15651) Client could not obtain block when DN CommandProcessingThread exit
[ https://issues.apache.org/jira/browse/HDFS-15651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15651: --- Attachment: HDFS-15651.patch > Client could not obtain block when DN CommandProcessingThread exit > -- > > Key: HDFS-15651 > URL: https://issues.apache.org/jira/browse/HDFS-15651 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Yiqun Lin >Priority: Major > Attachments: HDFS-15651.patch > > > In our cluster, we applied the HDFS-14997 improvement. > We find one case that CommandProcessingThread will exit due to OOM error. > OOM error was caused by our one abnormal application that running on this DN > node. > {noformat} > 2020-10-18 10:27:12,604 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Command processor > encountered fatal exception and exit. > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:173) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:222) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2005) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:671) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:617) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1247) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.access$1000(BPServiceActor.java:1194) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread$3.run(BPServiceActor.java:1299) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1221) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1208) > {noformat} > Here the main point is that CommandProcessingThread crashed will lead a very > bad impact. All the NN response commands will not be processed by DN side. > We enabled the block token to access the data, but here the DN command > DNA_ACCESSKEYUPDATE is not processed on time by DN. And then we see lots of > Sasl error due to key expiration in DN log: > {noformat} > javax.security.sasl.SaslException: DIGEST-MD5: IO error acquiring password > [Caused by org.apache.hadoop.security.token.SecretManager$InvalidToken: Can't > re-compute password for block_token_identifier (expiryDate=xxx, keyId=xx, > userId=xxx, blockPoolId=, blockId=xxx, access modes=[READ]), since the > required block key (keyID=xxx) doesn't exist.] > {noformat} > > For the impact in client side, our users receive lots of 'could not obtain > block' error with BlockMissingException. > CommandProcessingThread is a critical thread, it should always be running. > {code:java} > /** >* CommandProcessingThread that process commands asynchronously. >*/ > class CommandProcessingThread extends Thread { > private final BPServiceActor actor; > private final BlockingQueue queue; > ... > @Override > public void run() { > try { > processQueue(); > } catch (Throwable t) { > LOG.error("{} encountered fatal exception and exit.", getName(), t); > <=== should not exit this thread > } > } > {code} > Once a unexpected error happened, a better handing should be: > * catch the exception, appropriately deal with the error and let > processQueue continue to run > or > * exit the DN process to let admin user investigate this -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196824#comment-17196824 ] Aiphago edited comment on HDFS-15382 at 9/16/20, 9:35 AM: -- This is a sample base our 2.7 version, main logic is similar, but apply to trunk need some work todo. was (Author: aiphag0): This is a sample base our 2.7 version, main logic is similar, but apply to trunck need some work todo. > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, > image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196824#comment-17196824 ] Aiphago edited comment on HDFS-15382 at 9/16/20, 9:35 AM: -- This is a sample base our 2.7 version, main logic is similar, but apply to trunck need some work todo. was (Author: aiphag0): This is a sample base our 2.7 version, mian logic is similar, but apply to trunck need some work todo. > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, > image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196824#comment-17196824 ] Aiphago edited comment on HDFS-15382 at 9/16/20, 9:33 AM: -- This is a sample base our 2.7 version, mian logic is similar, but apply to trunck need some work todo. was (Author: aiphag0): This is a sample base our 2.7 version, mian logic is similar, but apply to trunck need some wrok todo. > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, > image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: HDFS-15382-sample.patch > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, > image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17196824#comment-17196824 ] Aiphago commented on HDFS-15382: This is a sample base our 2.7 version, mian logic is similar, but apply to trunck need some wrok todo. > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: HDFS-15382-sample.patch, image-2020-06-02-1.png, > image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124767#comment-17124767 ] Aiphago edited comment on HDFS-15382 at 6/3/20, 8:45 AM: - !image-2020-06-03-1.png|width=628,height=378! The simple lock model like this.Parts of implemented as follows * As for finalizeReplica(),append(),createRbw()First get BlockPoolLock read lock,and then get BlockPoolLock-volume-lock write lock. * As for getStoredBlock(),getMetaDataInputStream()First get BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock. * As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock. * As for delete hold the BlockPoolLock write lock. * The change of replicaMap's Gset change to sync to make thread safe.And replicaMap itself is the same as HDFS-15180 only control blockpool lock was (Author: aiphag0): !image-2020-06-03-1.png|width=628,height=378! The simple lock model like this.Parts of implemented as follows # As for finalizeReplica(),append(),createRbw()First get BlockPoolLock read lock,and then get BlockPoolLock-volume-lock write lock. # As for getStoredBlock(),getMetaDataInputStream()First get BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock. # As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock. # As for delete hold the BlockPoolLock write lock. # The change of replicaMap's Gset change to sync to make thread safe.And replicaMap itself is the same as HDFS-15180 only control blockpool lock > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png, image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124767#comment-17124767 ] Aiphago commented on HDFS-15382: !image-2020-06-03-1.png|width=628,height=378! The simple lock model like this.Parts of implemented as follows # As for finalizeReplica(),append(),createRbw()First get BlockPoolLock read lock,and then get BlockPoolLock-volume-lock write lock. # As for getStoredBlock(),getMetaDataInputStream()First get BlockPoolLock read lock,and the then get BlockPoolLock-volume-lock read lock. # As for deepCopyReplica(),getBlockReports() get the BlockPoolLock read lock. # As for delete hold the BlockPoolLock write lock. # The change of replicaMap's Gset change to sync to make thread safe.And replicaMap itself is the same as HDFS-15180 only control blockpool lock > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png, image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: image-2020-06-03-1.png > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png, image-2020-06-03-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: (was: 2020-06-03.png) > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: 2020-06-03.png > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: 2020-06-03.png, image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124736#comment-17124736 ] Aiphago edited comment on HDFS-15382 at 6/3/20, 8:12 AM: - {quote}it is confused with {{ReplicaCachingGetSpaceUsed}}, IIUC, {{ReplicaCachingGetSpaceUsed}} is calculated in memory directly rather than sync info from disk, right? so why is it related to this changes? Also the log print is based on our internal version rather than branch trunk, some notes could be better. {quote} ReplicaCachingGetSpaceUsed copy replica from FsDataSetImpl most of time is spend in wait FsDataSetImpl lock,so 'Copy replica infos' time spend can reflect the time wait for the lock. was (Author: aiphag0): {quote}it is confused with {{ReplicaCachingGetSpaceUsed}}, IIUC, {{ReplicaCachingGetSpaceUsed}} is calculated in memory directly rather than sync info from disk, right? so why is it related to this changes? Also the log print is based on our internal version rather than branch trunk, some notes could be better. {quote} {{ReplicaCachingGetSpaceUsed copy replica from FsDataSetImpl most of time is spend in wait }}{{FsDataSetImpl lock,so 'Copy replica infos' time spend can reflect the time wait for the lock.}}{{}} > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17124736#comment-17124736 ] Aiphago commented on HDFS-15382: {quote}it is confused with {{ReplicaCachingGetSpaceUsed}}, IIUC, {{ReplicaCachingGetSpaceUsed}} is calculated in memory directly rather than sync info from disk, right? so why is it related to this changes? Also the log print is based on our internal version rather than branch trunk, some notes could be better. {quote} {{ReplicaCachingGetSpaceUsed copy replica from FsDataSetImpl most of time is spend in wait }}{{FsDataSetImpl lock,so 'Copy replica infos' time spend can reflect the time wait for the lock.}}{{}} > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split FsDatasetImpl from blockpool lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Summary: Split FsDatasetImpl from blockpool lock to blockpool volume lock (was: Split DataNode FsDatasetImpl lock to blockpool volume lock ) > Split FsDatasetImpl from blockpool lock to blockpool volume lock > - > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123379#comment-17123379 ] Aiphago commented on HDFS-15382: After improve our du in cache copy time is very low.And we make improve just copy replica in each BlockPoolSlice. {code:java} 2020-06-02 12:44:16,586 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, blockPoolId: BP-xxx, replicas size: 665, duration: 0ms 2020-06-02 12:44:16,586 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Refresh dfs used, bpid: BP-xxx,replicas size: 665, dfsUsed: 15925882188 on volume: DS-4f1f820a-460f-4fa9-89be-49caed604a52, duration: 0ms , isopen hardlink false 2020-06-02 12:44:16,586 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, blockPoo lId: BP-xxx, replicas size: 699, duration: 1ms 2020-06-02 12:44:16,586 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, blockPoo lId: BP-xxx, replicas size: 698, duration: 1ms 2020-06-02 12:44:16,587 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, blockPoo lId: BP-xxx, replicas size: 638, duration: 0ms 2020-06-02 12:44:16,587 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Refresh dfs used, bpid: BP-xxx, replicas size: 638, dfsUsed: 16519661992 on volume: DS-b2eb6423-d0bd-493e-a102-d317e55815ce, duration: 0ms , isopen hardlink false 2020-06-02 12:44:16,588 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, blockPoo lId: BP-xxx, replicas size: 644, duration: 0ms 2020-06-02 12:44:16,588 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Refresh dfs used, bpid: BP-xxx, replicas size: 644, dfsUsed: 16636348641 on volume: DS-83a2deeb-2389-4036-9f13-df61fc6b35f6, duration: 0ms , isopen hardlink false 2020-06-02 12:44:16,588 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: Copy replica infos, BP-xxx, replicas size: 663, duration: 0ms{code} > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17123372#comment-17123372 ] Aiphago commented on HDFS-15382: Now we deploy in our produce cluster use this patch for some days.Here is the metric by random choose some dn.The metric we add before upgrade with this patch.The unit is ms. !image-2020-06-02-1.png|width=923,height=233! > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: image-2020-06-02-1.png > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02-1.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Affects Version/s: (was: 3.2.1) (was: 2.9.2) > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Target Version/s: (was: 3.2.0, 2.9.2) > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.9.2, 3.2.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: (was: image-2020-06-02.png) > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.9.2, 3.2.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Attachment: image-2020-06-02.png > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.9.2, 3.2.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > Attachments: image-2020-06-02.png > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work started] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on HDFS-15382 started by Aiphago. -- > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Affects Version/s: 2.9.2 3.2.1 > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 2.9.2, 3.2.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Fix Version/s: 2.9.2 3.2.1 > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 2.9.2, 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
[ https://issues.apache.org/jira/browse/HDFS-15382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15382: --- Fix Version/s: (was: 2.9.2) > Split DataNode FsDatasetImpl lock to blockpool volume lock > --- > > Key: HDFS-15382 > URL: https://issues.apache.org/jira/browse/HDFS-15382 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Aiphago >Assignee: Aiphago >Priority: Major > Fix For: 3.2.1 > > > In HDFS-15180 we split lock to blockpool grain size.But when one volume is in > heavy load and will block other request which in same blockpool but different > volume.So we split lock to two leval to avoid this happend.And to improve > datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15382) Split DataNode FsDatasetImpl lock to blockpool volume lock
Aiphago created HDFS-15382: -- Summary: Split DataNode FsDatasetImpl lock to blockpool volume lock Key: HDFS-15382 URL: https://issues.apache.org/jira/browse/HDFS-15382 Project: Hadoop HDFS Issue Type: Improvement Reporter: Aiphago Assignee: Aiphago In HDFS-15180 we split lock to blockpool grain size.But when one volume is in heavy load and will block other request which in same blockpool but different volume.So we split lock to two leval to avoid this happend.And to improve datanode performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080192#comment-17080192 ] Aiphago commented on HDFS-15180: Hi [~zhuqi] , thanks for your feedback.I think GenerationStamp may be change before we split block pool lock.And in our version we use wrtie lock in DataNode.transferReplicaForPipelineRecovery,this is diff the patch and may related to this problem. > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, HDFS-15180.004.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png, image-2020-04-09-11-20-36-459.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17065536#comment-17065536 ] Aiphago commented on HDFS-15180: ping [~sodonnell] , [~linyiqun], [~weichiu] Any advice ?Thanks. > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, HDFS-15180.004.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17064192#comment-17064192 ] Aiphago commented on HDFS-15180: Fix the problem in UT > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, HDFS-15180.004.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15180: --- Attachment: HDFS-15180.004.patch > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, HDFS-15180.004.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063806#comment-17063806 ] Aiphago commented on HDFS-15180: ok,I'll fix later > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, image-2020-03-10-17-22-57-391.png, > image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15180: --- Attachment: HDFS-15180.003.patch > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > HDFS-15180.003.patch, image-2020-03-10-17-22-57-391.png, > image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17063204#comment-17063204 ] Aiphago commented on HDFS-15180: Hi [~zhuqi], thanks for valuable suggestions. Change the lock style use try() without finally{}. Change transferReplicaForPipelineRecovery to read lock. Wait UT result.[^HDFS-15180.002.patch] > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15180: --- Attachment: HDFS-15180.002.patch > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15180.001.patch, HDFS-15180.002.patch, > image-2020-03-10-17-22-57-391.png, image-2020-03-10-17-31-58-830.png, > image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15180: --- Attachment: HDFS-15180.001.patch > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: HDFS-15180.001.patch, image-2020-03-10-17-22-57-391.png, > image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15180) DataNode FsDatasetImpl Fine-Grained Locking via BlockPool.
[ https://issues.apache.org/jira/browse/HDFS-15180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17059199#comment-17059199 ] Aiphago commented on HDFS-15180: Hi [~zhuqi] ,Thanks for your proposal.And we have split dataset lock in our early version about 2.7 ,and gray deploy in our produce cluster for weeks.It looks like a good improvement in our version.But the trunck version looks big different from our version and have many works to do.And our idea is to split lock to blockpool at first, second we try to split each blockpool lock to volume lock, third we try to remove remain IO in lock as HDFS-15000 say.If you are interesting with this we can do this together.And here is the demo patch,and may have some problem. > DataNode FsDatasetImpl Fine-Grained Locking via BlockPool. > --- > > Key: HDFS-15180 > URL: https://issues.apache.org/jira/browse/HDFS-15180 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.2.0 >Reporter: zhuqi >Assignee: zhuqi >Priority: Major > Attachments: image-2020-03-10-17-22-57-391.png, > image-2020-03-10-17-31-58-830.png, image-2020-03-10-17-34-26-368.png > > > Now the FsDatasetImpl datasetLock is heavy, when their are many namespaces in > big cluster. If we can split the FsDatasetImpl datasetLock via blockpool. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002146#comment-17002146 ] Aiphago commented on HDFS-15068: Renew the patch,change log.debug to e.printStackTrace().[^HDFS-15068.005.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch, HDFS-15068.005.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.005.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch, HDFS-15068.005.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: HDFS-15000.001.patch > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: (was: HDFS-15000.001.patch) > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000742#comment-17000742 ] Aiphago edited comment on HDFS-15000 at 12/20/19 9:21 AM: -- Hi [~sodonnell] ,Thanks for your comment.I have some question with your idea. {quote}I wonder if it would be possible to refactor things so we do steps 1, 2, 3 and 5, drop the lock and then do the IO operation to actually create the file. In the event the IO fails, re-take the lock and clean up the volume map. {quote} If we do the IO after step 5,and now the volume map have the replica info but actually the IO may not done.If some error happend with IO,you should get lock again so volume map will have this replica for a long time,this may cause consistency problem when other thread get this replica. and my thought is # Not change the step in one method. # Make the IO async and release the lock wait util other thread signal this thread when finish the IO. # Keep the IO operation is in order in different methods as #finalizeBlock, #finalizeReplica, #createRbw. # after IO operation change then change the volume map. was (Author: aiphag0): Hi [~sodonnell] ,Thanks for your comment.I have some question with your idea. ??I wonder if it would be possible to refactor things so we do steps 1, 2, 3 and 5, drop the lock and then do the IO operation to actually create the file. In the event the IO fails, re-take the lock and clean up the volume map.?? If we do the IO after step 5,and now the volume map have the replica info but actually the IO may not done.If some error happend with IO,you should get lock again so volume map will have this replica for a long time,this may cause consistency problem when other thread get this replica. and my thought is # Not change the step in one method. # Make the IO async and release the lock wait util other thread signal this thread when finish the IO. # Keep the IO operation is in order in different methods as #finalizeBlock, #finalizeReplica, #createRbw. # after IO operation change then change the volume map. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: HDFS-15000.001.patch > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: (was: HDFS-15000.001.patch) > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: HDFS-15000.001.patch > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000742#comment-17000742 ] Aiphago commented on HDFS-15000: Hi [~sodonnell] ,Thanks for your comment.I have some question with your idea. ??I wonder if it would be possible to refactor things so we do steps 1, 2, 3 and 5, drop the lock and then do the IO operation to actually create the file. In the event the IO fails, re-take the lock and clean up the volume map.?? If we do the IO after step 5,and now the volume map have the replica info but actually the IO may not done.If some error happend with IO,you should get lock again so volume map will have this replica for a long time,this may cause consistency problem when other thread get this replica. and my thought is # Not change the step in one method. # Make the IO async and release the lock wait util other thread signal this thread when finish the IO. # Keep the IO operation is in order in different methods as #finalizeBlock, #finalizeReplica, #createRbw. # after IO operation change then change the volume map. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: (was: HDFS-15000.001.patch) > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17000679#comment-17000679 ] Aiphago commented on HDFS-15068: Hi [~elgoiri] ,thanks for valuable suggestions.Have test the unit without patch,and fix the problem as your suggestions.Renew the patch.[^HDFS-15068.004.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.004.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch, HDFS-15068.004.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999776#comment-16999776 ] Aiphago edited comment on HDFS-15000 at 12/19/19 6:32 AM: -- submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order.Any suggestions or problems? was (Author: aiphag0): submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999776#comment-16999776 ] Aiphago commented on HDFS-15000: submit demo patch.The main idea is to make IO opreate(the opreate may have order depend) without lock and async and keep the opreate is in order. > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15000) Improve FsDatasetImpl to avoid IO operation in datasetLock
[ https://issues.apache.org/jira/browse/HDFS-15000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15000: --- Attachment: HDFS-15000.001.patch > Improve FsDatasetImpl to avoid IO operation in datasetLock > -- > > Key: HDFS-15000 > URL: https://issues.apache.org/jira/browse/HDFS-15000 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Major > Attachments: HDFS-15000.001.patch > > > As HDFS-14997 mentioned, some methods in #FsDatasetImpl such as > #finalizeBlock, #finalizeReplica, #createRbw includes IO operation in the > datasetLock, It will block some logic when IO load is very high. We should > reduce grain fineness or move IO operation out of datasetLock. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16999736#comment-16999736 ] Aiphago commented on HDFS-15068: Hi [~hexiaoqiao],thanks for valuable advice,renew the patch.[^HDFS-15068.003.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.003.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch, > HDFS-15068.003.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.002.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16998860#comment-16998860 ] Aiphago commented on HDFS-15068: Thansk [~hexiaoqiao] for the review,I Release the patch with unit test [^HDFS-15068.002.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch, HDFS-15068.002.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-15068: --- Attachment: HDFS-15068.001.patch > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15068) DataNode could meet deadlock if invoke refreshVolumes when register
[ https://issues.apache.org/jira/browse/HDFS-15068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16997922#comment-16997922 ] Aiphago commented on HDFS-15068: I change the lock order of refreshVolumes() and here is the patch.[^HDFS-15068.001.patch] > DataNode could meet deadlock if invoke refreshVolumes when register > --- > > Key: HDFS-15068 > URL: https://issues.apache.org/jira/browse/HDFS-15068 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: Xiaoqiao He >Assignee: Aiphago >Priority: Critical > Attachments: HDFS-15068.001.patch > > > DataNode could meet deadlock when invoke `dfsadmin -reconfig datanode ip:host > start` to trigger #refreshVolumes. > 1. DataNod#refreshVolumes hold datanode instance ownable {{synchronizer}} > when enter this method first, then try to hold BPOfferService {{readlock}} > when `bpos.getNamespaceInfo()` in following code segment. > {code:java} > for (BPOfferService bpos : blockPoolManager.getAllNamenodeThreads()) { > nsInfos.add(bpos.getNamespaceInfo()); > } > {code} > 2. BPOfferService#registrationSucceeded (which is invoked by #register when > DataNode start or #reregister when processCommandFromActor) hold > BPOfferService {{writelock}} first, then try to hold datanode instance > ownable {{synchronizer}} in following method. > {code:java} > synchronized void bpRegistrationSucceeded(DatanodeRegistration > bpRegistration, > String blockPoolId) throws IOException { > id = bpRegistration; > if(!storage.getDatanodeUuid().equals(bpRegistration.getDatanodeUuid())) { > throw new IOException("Inconsistent Datanode IDs. Name-node returned " > + bpRegistration.getDatanodeUuid() > + ". Expecting " + storage.getDatanodeUuid()); > } > > registerBlockPoolWithSecretManager(bpRegistration, blockPoolId); > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16984170#comment-16984170 ] Aiphago commented on HDFS-14986: Thanks a lot for the review [~linyiqun]. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Affects Versions: 2.10.0 >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Fix For: 3.3.0, 2.10.1, 2.11.0 > > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch, HDFS-14986.005.patch, > HDFS-14986.006.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.006.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch, HDFS-14986.005.patch, > HDFS-14986.006.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983513#comment-16983513 ] Aiphago commented on HDFS-14986: Good advice,I change the retrytimes to 10 and close the stream in while loop.[^HDFS-14986.006.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch, HDFS-14986.005.patch, > HDFS-14986.006.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983213#comment-16983213 ] Aiphago commented on HDFS-14986: Hi [~linyiqun],Thank you for your valuable advice.I improve the patch as your comment.Addition I rename shouldInitRefresh to shouldFirstRefresh order to easy distinguish.[^HDFS-14986.005.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch, HDFS-14986.005.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.005.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch, HDFS-14986.005.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16980042#comment-16980042 ] Aiphago commented on HDFS-14986: Hi !https://issues.apache.org/jira/secure/useravatar?size=xsmall&ownerId=linyiqun&avatarId=25258! [~linyiqun],I get your idea.But this may have a small problem.Because when use FSCachingGetSpaceUsed invoke init() and when used < 0,so pass the frist refresh().This will make used = 0 until default 10 min until next refresh().So I add runImmediately to slove this problem.And to prevent other subClass not refresh() twice at first du.So this is my idea in patch 004. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979853#comment-16979853 ] Aiphago commented on HDFS-14986: Hi [~linyiqun],Thank you for your valuable advice.And now setShouldInitRefresh() is only invoke in ReplicaCachingGetSpaceUsed.[^HDFS-14986.004.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.004.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch, HDFS-14986.004.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978951#comment-16978951 ] Aiphago commented on HDFS-14986: hi [~jianliang.wu], I just assign this Jira to myself, please feel free to assign back if you would also like to work on this. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago reassigned HDFS-14986: -- Assignee: Aiphago (was: Ryan Wu) > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Aiphago >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16978950#comment-16978950 ] Aiphago commented on HDFS-14986: Hi [~linyiqun],Thank you for your valuable advice.In previous patch I modify CachingGetSpaceUsed#init(),but this will influences the subclass of CachingGetSpaceUsed like DU.So I add a filter,and now the related unit tests can pass.[^HDFS-14986.003.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.003.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch, > HDFS-14986.003.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.002.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16977328#comment-16977328 ] Aiphago commented on HDFS-14986: Hi [~linyiqun],I updata the patch.Can you review this again?[^HDFS-14986.002.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch, HDFS-14986.002.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16976301#comment-16976301 ] Aiphago commented on HDFS-14986: Thanks for your advice, I'll fix later. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16974888#comment-16974888 ] Aiphago commented on HDFS-14986: Here is the patch for the trunk can [~leosun08] [~hexiaoqiao] [~linyiqun] review this one? Thanks very much.[^HDFS-14986.001.patch] > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14986: --- Attachment: HDFS-14986.001.patch > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > Attachments: HDFS-14986.001.patch > > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973916#comment-16973916 ] Aiphago commented on HDFS-14986: I means I can fix the dead lock problem with add dataset lock,my first comment have say the solution. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973906#comment-16973906 ] Aiphago commented on HDFS-14986: We use the branch before 2.8 with the synchronized,and we fix the deadlock problem.And I think it's better to and add dataset lock at FsDatasetImpl#deepCopyReplica in trunk,and the method to slove deadlock problem is the same. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973321#comment-16973321 ] Aiphago commented on HDFS-14986: "DataNode: [... #283 daemon prio=5 os_prio=0 tid=0x7fd9826a7800 nid=0x7463 in Object.wait() [0x7fd949616000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x0006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2) at java.lang.Thread.join(Thread.java:1249) - locked <0x0006b375a1d0> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList$2) at java.lang.Thread.join(Thread.java:1323) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsVolumeList.addBlockPool(FsVolumeList.java:423) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.addBlockPool(FsDatasetImpl.java:2509) - locked <0x0006b33a3b10> (a org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl) at org.apache.hadoop.hdfs.server.datanode.DataNode.initBlockPool(DataNode.java:1388) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:311) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:232) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:720) > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973317#comment-16973317 ] Aiphago commented on HDFS-14986: I would like to submit a patch to try to fix this bug later. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14986) ReplicaCachingGetSpaceUsed throws ConcurrentModificationException
[ https://issues.apache.org/jira/browse/HDFS-14986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973314#comment-16973314 ] Aiphago commented on HDFS-14986: In -HDFS-14313- we found same concurrent problem before base our hadoop version.So we solve it by add dataset lock at FsDatasetImpl#deepCopyReplica(). But this may cause deadlock problem when restart datanode.When BlockPoolSlice init first time,the BlockPoolSlice#loadDfsUsed() may return -1.And the ReplicaCachingGetSpaceUsed#init() will blocking because of FsDatasetImpl#deepCopyReplica() can't get the dataset lock,and the thread can't release dataset lock the same time. > ReplicaCachingGetSpaceUsed throws ConcurrentModificationException > -- > > Key: HDFS-14986 > URL: https://issues.apache.org/jira/browse/HDFS-14986 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, performance >Reporter: Ryan Wu >Assignee: Ryan Wu >Priority: Major > > Running DU across lots of disks is very expensive . We applied the patch > HDFS-14313 to get used space from ReplicaInfo in memory.However, new du > threads throw the exception > {code:java} > // 2019-11-08 18:07:13,858 ERROR > [refreshUsed-/home/vipshop/hard_disk/7/dfs/dn/current/BP-1203969992--1450855658517] > > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed: > ReplicaCachingGetSpaceUsed refresh error > java.util.ConcurrentModificationException: Tree has been modified outside of > iterator > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.checkForModification(FoldedTreeSet.java:311) > > at > org.apache.hadoop.hdfs.util.FoldedTreeSet$TreeSetIterator.hasNext(FoldedTreeSet.java:256) > > at java.util.AbstractCollection.addAll(AbstractCollection.java:343) > at java.util.HashSet.(HashSet.java:120) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.deepCopyReplica(FsDatasetImpl.java:1052) > > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.ReplicaCachingGetSpaceUsed.refresh(ReplicaCachingGetSpaceUsed.java:73) > > at > org.apache.hadoop.fs.CachingGetSpaceUsed$RefreshThread.run(CachingGetSpaceUsed.java:178) > > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932097#comment-16932097 ] Aiphago commented on HDFS-14836: Hi [~jojochuang],I run TestDFSUpgradeWithHA on local with the patch,and I pass the junit tests > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836-trunk-001.patch, HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16929875#comment-16929875 ] Aiphago commented on HDFS-14836: Hi [~jojochuang] I update the code can you help review this again? Thanks very much. > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836-trunk-001.patch, HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14836: --- Attachment: HDFS-14836-trunk-001.patch > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836-trunk-001.patch, HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16928173#comment-16928173 ] Aiphago commented on HDFS-14836: Thanks [~jojochuang] for the comment. {quote}it would be the best if we can avoid string-matching exception messages. "Broken pipe" and "Connection reset" are usually thrown as a SocketException. Would it make sense to check exception class name instead? Better, catch SocketException and do not call onFailure(). {quote} if we just check exception class name instead, the range maybe too big.I means maybe there are some other SocketException and not match "Broken pipe" and "Connection reset" .So why we not keep consistency to -HDFS-2054 .- {quote}for socket related exceptions, you don't want to call onFailure(); however, those exceptions should be re-thrown too. They should not be ignored silently. {quote} it's a good suggestion,I will change later. > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14836: --- Affects Version/s: (was: 2.9.1) > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16927338#comment-16927338 ] Aiphago commented on HDFS-14836: Hi [~jojochuang] can you help review this one? Thanks very much. > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14836: --- Attachment: HDFS-14836.patch > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > Attachments: HDFS-14836.patch > > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16926300#comment-16926300 ] Aiphago commented on HDFS-14836: Hi [~jojochuang] thxs for your attention. like HDFS-2054 "Broken pipe" and "Connection reset" is cause by client rather than datanode , and datanode may increment a lot of FileIoErrors counter ,because of this Exception. So I think it's better to do a filter. > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Aiphago >Assignee: Aiphago >Priority: Minor > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14836) FileIoProvider should not increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14836: --- Summary: FileIoProvider should not increase FileIoErrors metric in datanode volume metric (was: FileIoProvider will increase FileIoErrors metric in datanode volume metric) > FileIoProvider should not increase FileIoErrors metric in datanode volume > metric > > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Aiphago >Priority: Minor > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-14836) FileIoProvider will increase FileIoErrors metric in datanode volume metric
[ https://issues.apache.org/jira/browse/HDFS-14836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aiphago updated HDFS-14836: --- Description: I found that FileIoErrors metric will increase in BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been ignore like "Broken pipe" and "Connection reset" . So should do a filter when fileIoProvider increase FileIoErrors count ? was: I found that FileIoErrors metric will increase in BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been ignore like "Broken pipe" and "Connection reset" .So should do a filter when fileIoProvider increase FileIoErrors count ? > FileIoProvider will increase FileIoErrors metric in datanode volume metric > -- > > Key: HDFS-14836 > URL: https://issues.apache.org/jira/browse/HDFS-14836 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.9.1 >Reporter: Aiphago >Priority: Minor > > I found that FileIoErrors metric will increase in > BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But > in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been > ignore like "Broken pipe" and "Connection reset" . > So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-14836) FileIoProvider will increase FileIoErrors metric in datanode volume metric
Aiphago created HDFS-14836: -- Summary: FileIoProvider will increase FileIoErrors metric in datanode volume metric Key: HDFS-14836 URL: https://issues.apache.org/jira/browse/HDFS-14836 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 2.9.1 Reporter: Aiphago I found that FileIoErrors metric will increase in BlockSender.sendPacket(),when use fileIoProvider.transferToSocketFully().But in https://issues.apache.org/jira/browse/HDFS-2054 the Exception has been ignore like "Broken pipe" and "Connection reset" .So should do a filter when fileIoProvider increase FileIoErrors count ? -- This message was sent by Atlassian Jira (v8.3.2#803003) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org