[jira] [Commented] (HDFS-13224) RBF: Resolvers to support mount points across multiple subclusters
[ https://issues.apache.org/jira/browse/HDFS-13224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17689432#comment-17689432 ] Daniel Ma commented on HDFS-13224: -- [~elgoiri] Could you pls share the design doc for this feature. No idea what kind of scenario need to cross subclusters. Thanks > RBF: Resolvers to support mount points across multiple subclusters > -- > > Key: HDFS-13224 > URL: https://issues.apache.org/jira/browse/HDFS-13224 > Project: Hadoop HDFS > Issue Type: Sub-task >Reporter: Íñigo Goiri >Assignee: Íñigo Goiri >Priority: Major > Fix For: 3.1.0, 2.10.0, 2.9.1, 3.0.3 > > Attachments: HDFS-13224-branch-2.000.patch, HDFS-13224.000.patch, > HDFS-13224.001.patch, HDFS-13224.002.patch, HDFS-13224.003.patch, > HDFS-13224.004.patch, HDFS-13224.005.patch, HDFS-13224.006.patch, > HDFS-13224.007.patch, HDFS-13224.008.patch, HDFS-13224.009.patch, > HDFS-13224.010.patch > > > Currently, a mount point points to a single subcluster. We should be able to > spread files in a mount point across subclusters. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17651129#comment-17651129 ] Daniel Ma commented on HDFS-16115: -- [~hemanthboyina] [~brahma] Could you pls help to review > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname, when Balancer process try to migrate on it, there will be a IllegalArgumentException thrown as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} was: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname, when Balancer process try to > migrate on it, there will be a IllegalArgumentException thrown as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Attachment: screenshot-2.png > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. !screenshot-2.png! For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} was: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png, screenshot-2.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > !screenshot-2.png! > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Description: DiskBalancer process read DataNode hostname as lowercase letters, !screenshot-1.png! but there is no letter case transform when getNodeByName. For a DataNode with lowercase hostname. everything is ok. But for a DataNode with uppercase hostname,, there will be a IllegalArgumentException as below, {code:java} 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException: Unable to find the specified node. node-group-1YlRf0002 {code} > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > > DiskBalancer process read DataNode hostname as lowercase letters, > !screenshot-1.png! > but there is no letter case transform when getNodeByName. > For a DataNode with lowercase hostname. everything is ok. > But for a DataNode with uppercase hostname,, there will be a > IllegalArgumentException as below, > {code:java} > 2022-10-09 16:15:26,631 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException: Unable to find the specified node. > node-group-1YlRf0002 > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma reassigned HDFS-16871: Assignee: Daniel Ma > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
Daniel Ma created HDFS-16871: Summary: DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname Key: HDFS-16871 URL: https://issues.apache.org/jira/browse/HDFS-16871 Project: Hadoop HDFS Issue Type: Bug Reporter: Daniel Ma Attachments: screenshot-1.png -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16871) DiskBalancer process may throws IllegalArgumentException when the target DataNode has capital letter in hostname
[ https://issues.apache.org/jira/browse/HDFS-16871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16871: - Attachment: screenshot-1.png > DiskBalancer process may throws IllegalArgumentException when the target > DataNode has capital letter in hostname > > > Key: HDFS-16871 > URL: https://issues.apache.org/jira/browse/HDFS-16871 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Attachments: screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks
[ https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16870: - Description: There are two scenario involved for reportBadBlocks. 1-HDFS client will report bad block to NameNode once the block size is inconsitent with meta; 2-DataNode will report bad block to NameNode via heartbeat if Replica stored on DataNode is corrupted or be modified. Currently, when namenode process reportBadBlock rpc request, only DataNode address is recorded in Log msg, Client Ip should also be recorded to distinguish where the report comes from, which is very useful for trouble shooting. was: There are two scenario involved for reportBadBlocks. 1-HDFS client will report bad block to NameNode once the block size is inconsitent with meta; 2-DataNode will report bad block to NameNode via heartbeat if Replica stored on DataNode is corrupted or be modified. Currently, when namenode process reportBadBlock rpc request, only DataNode Ip is recorded in Log msg, Client Ip should also be recorded to distinguish where the report comes from, which is very useful for trouble shooting. > Client ip should also be recorded when NameNode is processing reportBadBlocks > - > > Key: HDFS-16870 > URL: https://issues.apache.org/jira/browse/HDFS-16870 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Daniel Ma >Priority: Trivial > > There are two scenario involved for reportBadBlocks. > 1-HDFS client will report bad block to NameNode once the block size is > inconsitent with meta; > 2-DataNode will report bad block to NameNode via heartbeat if Replica stored > on DataNode is corrupted or be modified. > Currently, when namenode process reportBadBlock rpc request, only DataNode > address is recorded in Log msg, > Client Ip should also be recorded to distinguish where the report comes from, > which is very useful for trouble shooting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks
[ https://issues.apache.org/jira/browse/HDFS-16870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16870: - Description: There are two scenario involved for reportBadBlocks. 1-HDFS client will report bad block to NameNode once the block size is inconsitent with meta; 2-DataNode will report bad block to NameNode via heartbeat if Replica stored on DataNode is corrupted or be modified. Currently, when namenode process reportBadBlock rpc request, only DataNode Ip is recorded in Log msg, Client Ip should also be recorded to distinguish where the report comes from, which is very useful for trouble shooting. > Client ip should also be recorded when NameNode is processing reportBadBlocks > - > > Key: HDFS-16870 > URL: https://issues.apache.org/jira/browse/HDFS-16870 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Daniel Ma >Priority: Trivial > > There are two scenario involved for reportBadBlocks. > 1-HDFS client will report bad block to NameNode once the block size is > inconsitent with meta; > 2-DataNode will report bad block to NameNode via heartbeat if Replica stored > on DataNode is corrupted or be modified. > Currently, when namenode process reportBadBlock rpc request, only DataNode Ip > is recorded in Log msg, > Client Ip should also be recorded to distinguish where the report comes from, > which is very useful for trouble shooting. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16870) Client ip should also be recorded when NameNode is processing reportBadBlocks
Daniel Ma created HDFS-16870: Summary: Client ip should also be recorded when NameNode is processing reportBadBlocks Key: HDFS-16870 URL: https://issues.apache.org/jira/browse/HDFS-16870 Project: Hadoop HDFS Issue Type: Improvement Reporter: Daniel Ma -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.
[ https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16869: - Description: We first encounter this issue in 3.3.1 version when we are upgrading from 3.1.1 to 3.3.1 which may cause NameNode start failure but just occasionally not everytime. The root cause for why 0 size of clientid happened here is still not found after long-term investigating. So here we add a protection judge here to exlude 0 size of clientid from be added into cache. was: The root cause for why 0 size of clientid happened here is still not found. So here we add a protection judge here to exlude 0 size of clientid from be added into cache. > Fail to start namenode owing to 0 size of clientid recorded in edit log. > > > Key: HDFS-16869 > URL: https://issues.apache.org/jira/browse/HDFS-16869 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > > We first encounter this issue in 3.3.1 version when we are upgrading from > 3.1.1 to 3.3.1 which may cause NameNode start failure but just occasionally > not everytime. > The root cause for why 0 size of clientid happened here is still not found > after long-term investigating. > So here we add a protection judge here to exlude 0 size of clientid from be > added into cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.
[ https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16869: - Description: The root cause for why 0 size of clientid happened here is still not found. So here we add a protection judge here to exlude 0 size of clientid from be added into cache. was: DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak。 1-If yarn cluster is in idle state, that is almost no token renewer event triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but dead loop, it will cause high CPU utilization. 2-The renewer event is hold in a map named futures, will has no remove logic , that is the map will become increasingly great with time going by. > Fail to start namenode owing to 0 size of clientid recorded in edit log. > > > Key: HDFS-16869 > URL: https://issues.apache.org/jira/browse/HDFS-16869 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > > The root cause for why 0 size of clientid happened here is still not found. > So here we add a protection judge here to exlude 0 size of clientid from be > added into cache. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.
[ https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16869: - Issue Type: Bug (was: Improvement) > Fail to start namenode owing to 0 size of clientid recorded in edit log. > > > Key: HDFS-16869 > URL: https://issues.apache.org/jira/browse/HDFS-16869 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > > DelegationTokenRenewer timeout feature may cause high utilization of CPU and > object leak。 > 1-If yarn cluster is in idle state, that is almost no token renewer event > triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but > dead loop, it will cause high CPU utilization. > 2-The renewer event is hold in a map named futures, will has no remove logic > , that is the map will become increasingly great with time going by. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16869) Fail to start namenode owing to 0 size of clientid recorded in edit log.
[ https://issues.apache.org/jira/browse/HDFS-16869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16869: - Summary: Fail to start namenode owing to 0 size of clientid recorded in edit log. (was: DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak) > Fail to start namenode owing to 0 size of clientid recorded in edit log. > > > Key: HDFS-16869 > URL: https://issues.apache.org/jira/browse/HDFS-16869 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.1, 3.3.2, 3.3.3, 3.3.4 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > > DelegationTokenRenewer timeout feature may cause high utilization of CPU and > object leak。 > 1-If yarn cluster is in idle state, that is almost no token renewer event > triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but > dead loop, it will cause high CPU utilization. > 2-The renewer event is hold in a map named futures, will has no remove logic > , that is the map will become increasingly great with time going by. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16869) DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak
Daniel Ma created HDFS-16869: Summary: DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak Key: HDFS-16869 URL: https://issues.apache.org/jira/browse/HDFS-16869 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.3.4, 3.3.3, 3.3.2, 3.3.1 Reporter: Daniel Ma Assignee: Daniel Ma DelegationTokenRenewer timeout feature may cause high utilization of CPU and object leak。 1-If yarn cluster is in idle state, that is almost no token renewer event triggered, the DelegationTokenRenewerPoolTracker thread will do nothing but dead loop, it will cause high CPU utilization. 2-The renewer event is hold in a map named futures, will has no remove logic , that is the map will become increasingly great with time going by. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal
[ https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888 ] Daniel Ma edited comment on HDFS-16060 at 5/12/22 6:27 AM: --- [~ferhui] Thanks for your report, I have encountered the similiar issue. There are some doubts I wanna confirm: 1-Does write operation succeed when the Exception info cast on Datanode log? 2-Does read operation also fail? was (Author: daniel ma): [~ferhui] Thanks for your report, I have encountered the similiar issue. There are some doubts I wanna confirm: 1-Does write operation succeed even with the Exception info on Datanode log? 2-Does read operation also fail? > There is an inconsistent between replicas of datanodes when hardware is > abnormal > > > Key: HDFS-16060 > URL: https://issues.apache.org/jira/browse/HDFS-16060 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Hui Fei >Priority: Major > > We find the following case on production env. > * replicas of the same block are stored in dn1, dn2. > * replicas of dn1 and dn2 are different > * Verify meta & data for replica successfully on dn1, and the same on dn2. > User code is just copyfromlocal. > Find some error log on datanode at first > {quote} > 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Checksum error in block > BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from > /y.y.y.y:47960 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774 > at > org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:748) > {quote} > After this, new pipeline is created and then wrong data and meta written in > disk file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal
[ https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888 ] Daniel Ma edited comment on HDFS-16060 at 5/12/22 6:22 AM: --- [~ferhui] Thanks for your report, I have encountered the similiar issue. There are some doubts I wanna confirm: 1-Does write operation succeed even with the Exception info on Datanode log? 2-Does read operation also fail? was (Author: daniel ma): [~ferhui] Thanks for your report, I have encountered the similiar issue. There are some doubts I wanna confirm: 1-Does write operation succeed even with the Exception info on Datanode log? 2-Does read operation fail owing to inconsistent replica data and its meta? > There is an inconsistent between replicas of datanodes when hardware is > abnormal > > > Key: HDFS-16060 > URL: https://issues.apache.org/jira/browse/HDFS-16060 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Hui Fei >Priority: Major > > We find the following case on production env. > * replicas of the same block are stored in dn1, dn2. > * replicas of dn1 and dn2 are different > * Verify meta & data for replica successfully on dn1, and the same on dn2. > User code is just copyfromlocal. > Find some error log on datanode at first > {quote} > 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Checksum error in block > BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from > /y.y.y.y:47960 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774 > at > org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:748) > {quote} > After this, new pipeline is created and then wrong data and meta written in > disk file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16060) There is an inconsistent between replicas of datanodes when hardware is abnormal
[ https://issues.apache.org/jira/browse/HDFS-16060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535888#comment-17535888 ] Daniel Ma commented on HDFS-16060: -- [~ferhui] Thanks for your report, I have encountered the similiar issue. There are some doubts I wanna confirm: 1-Does write operation succeed even with the Exception info on Datanode log? 2-Does read operation fail owing to inconsistent replica data and its meta? > There is an inconsistent between replicas of datanodes when hardware is > abnormal > > > Key: HDFS-16060 > URL: https://issues.apache.org/jira/browse/HDFS-16060 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.4.0 >Reporter: Hui Fei >Priority: Major > > We find the following case on production env. > * replicas of the same block are stored in dn1, dn2. > * replicas of dn1 and dn2 are different > * Verify meta & data for replica successfully on dn1, and the same on dn2. > User code is just copyfromlocal. > Find some error log on datanode at first > {quote} > 2021-05-27 04:54:20,471 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > Checksum error in block > BP-1453431581-x.x.x.x-1531302155027:blk_13892199285_12902824176 from > /y.y.y.y:47960 > org.apache.hadoop.fs.ChecksumException: Checksum error: > DFSClient_NONMAPREDUCE_-1760730985_129 at 0 exp: 37939694 got: -1180138774 > at > org.apache.hadoop.util.NativeCrc32.nativeComputeChunkedSumsByteArray(Native > Method) > at > org.apache.hadoop.util.NativeCrc32.verifyChunkedSumsByteArray(NativeCrc32.java:69) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:347) > at > org.apache.hadoop.util.DataChecksum.verifyChunkedSums(DataChecksum.java:294) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.verifyChunks(BlockReceiver.java:438) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:582) > at > org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:885) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:801) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opWriteBlock(Receiver.java:137) > at > org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:74) > at > org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:253) > at java.lang.Thread.run(Thread.java:748) > {quote} > After this, new pipeline is created and then wrong data and meta written in > disk file. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17377326#comment-17377326 ] Daniel Ma commented on HDFS-15796: -- [~sodonnell] Yes, NameNode will exit when the exception happens which result in unexpected NameNode switchover > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Critical > Attachments: HDFS-15796-0001.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma reassigned HDFS-16115: Assignee: Daniel Ma > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing
[ https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376938#comment-17376938 ] Daniel Ma commented on HDFS-16117: -- Hello ,[~sodonnell] Could you pls help to review this patch? > Add file count info in audit log to record the file count for delete and > getListing RPC request to assist user trouble shooting when RPC cost is > increasing > > > Key: HDFS-16117 > URL: https://issues.apache.org/jira/browse/HDFS-16117 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16117.patch > > > Currently, there is no file count in audit log for delete and getListing RPC > request, therefore, for the increasing RPC call, it is not easy to configure > it out whether the time-consuming RPC request is related to too many files > be operated in the RPC request. > > Therefore, It it necessary to add file count info in the audit log to assist > maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing
[ https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma reassigned HDFS-16117: Assignee: Daniel Ma > Add file count info in audit log to record the file count for delete and > getListing RPC request to assist user trouble shooting when RPC cost is > increasing > > > Key: HDFS-16117 > URL: https://issues.apache.org/jira/browse/HDFS-16117 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Major > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16117.patch > > > Currently, there is no file count in audit log for delete and getListing RPC > request, therefore, for the increasing RPC call, it is not easy to configure > it out whether the time-consuming RPC request is related to too many files > be operated in the RPC request. > > Therefore, It it necessary to add file count info in the audit log to assist > maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badl
[ https://issues.apache.org/jira/browse/HDFS-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma reassigned HDFS-16093: Assignee: Daniel Ma > DataNodes under decommission will still be returned to the client via > getLocatedBlocks, so the client may request decommissioning datanodes to read > which will cause badly competation on disk IO. > -- > > Key: HDFS-16093 > URL: https://issues.apache.org/jira/browse/HDFS-16093 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Assignee: Daniel Ma >Priority: Critical > > DataNodes under decommission will still be returned to the client via > getLocatedBlocks, so the client may request decommissioning datanodes to read > which will cause badly competation on disk IO. > Therefore, datanodes under decommission should be removed from the return > list of getLocatedBlocks api. > !image-2021-06-29-10-50-44-739.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376586#comment-17376586 ] Daniel Ma edited comment on HDFS-15796 at 7/7/21, 1:52 PM: --- [~sodonnell] Yes, it is more elegant.(y) was (Author: daniel ma): [~sodonnell] Yes, it is more elegant. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: HDFS-15796-0001.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: HDFS-15796-0001.patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: HDFS-15796-0001.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: (was: 0002-HDFS-15796.patch) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: (was: 0003-HDFS-15796.patch) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: (was: 0001-HDFS-15796.patch) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376586#comment-17376586 ] Daniel Ma commented on HDFS-15796: -- [~sodonnell] Yes, it is more elegant. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch, > 0003-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: 0003-HDFS-15796.patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch, > 0003-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376571#comment-17376571 ] Daniel Ma commented on HDFS-16115: -- hello [~hexiaoqiao]. Thanks for review, 1-For non-fatal error, I define two kinds at present: These two errors are caused by too many threads in OS and too many open files in OS, which probably can recover soon. Even if the OS limit cannot be recovered proactively, users expect that the HDFS can recover automatically after manual access. {code:java} //代码占位符 enum NON_FATAL_TYPES { THREAD_EXCEED("unable to create new native thread"), FILE_EXCEED("Too many open files"); private final String errorMsg; NON_FATAL_TYPES(String errorMsg){ this.errorMsg = errorMsg; } public String getErrorMsg() { return errorMsg; } } {code} 2-the main defect of HDFS-15651 is that BPServiceActor thread will never back to normal unless restart DataNode which is always unacceptable in production environment reported by real users. That is why I develop this feature. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376429#comment-17376429 ] Daniel Ma edited comment on HDFS-15796 at 7/7/21, 11:57 AM: [~sodonnell] yes, I also think it is a better way: {code:java} // A better approach, may be to return a new ArrayList from getTargets, eg: {code} The patch is updated, Pls help to review. was (Author: daniel ma): [~sodonnell] yes, I totally agree with the solution you raised. I will work on it, and update the patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: 0002-HDFS-15796.patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch, 0002-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376429#comment-17376429 ] Daniel Ma commented on HDFS-15796: -- [~sodonnell] yes, I totally agree with the solution you raised. I will work on it, and update the patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16117) Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing
[ https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16117: - Summary: Add file count info in audit log to record the file count for delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing (was: Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing ) > Add file count info in audit log to record the file count for delete and > getListing RPC request to assist user trouble shooting when RPC cost is > increasing > > > Key: HDFS-16117 > URL: https://issues.apache.org/jira/browse/HDFS-16117 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16117.patch > > > Currently, there is no file count in audit log for delete and getListing RPC > request, therefore, for the increasing RPC call, it is not easy to configure > it out whether the time-consuming RPC request is related to too many files > be operated in the RPC request. > > Therefore, It it necessary to add file count info in the audit log to assist > maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16117) Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing
[ https://issues.apache.org/jira/browse/HDFS-16117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16117: - Attachment: 0001-HDFS-16117.patch > Add file count info in audit log to record the file count count in delete and > getListing RPC request to assist user trouble shooting when RPC cost is > increasing > - > > Key: HDFS-16117 > URL: https://issues.apache.org/jira/browse/HDFS-16117 > Project: Hadoop HDFS > Issue Type: Improvement > Components: namenode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16117.patch > > > Currently, there is no file count in audit log for delete and getListing RPC > request, therefore, for the increasing RPC call, it is not easy to configure > it out whether the time-consuming RPC request is related to too many files > be operated in the RPC request. > > Therefore, It it necessary to add file count info in the audit log to assist > maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16117) Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing
Daniel Ma created HDFS-16117: Summary: Add file count info in audit log to record the file count count in delete and getListing RPC request to assist user trouble shooting when RPC cost is increasing Key: HDFS-16117 URL: https://issues.apache.org/jira/browse/HDFS-16117 Project: Hadoop HDFS Issue Type: Improvement Components: namenode Affects Versions: 3.3.1 Reporter: Daniel Ma Fix For: 3.3.1 Currently, there is no file count in audit log for delete and getListing RPC request, therefore, for the increasing RPC call, it is not easy to configure it out whether the time-consuming RPC request is related to too many files be operated in the RPC request. Therefore, It it necessary to add file count info in the audit log to assist maintenance -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario
[ https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16094: - Target Version/s: 3.1.1 (was: 3.4.0) > HDFS balancer process start failed owing to daemon pid file is not cleared in > some exception senario > > > Key: HDFS-16094 > URL: https://issues.apache.org/jira/browse/HDFS-16094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: scripts >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > > HDFS balancer process start failed owing to daemon pid file is not cleared in > some exception senario, but there is no useful information in log to trouble > shoot as below. > {code:java} > //代码占位符 > hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") > {code} > but actually, the process is not running as the error msg details above. > Therefore, some more explicit information should be print in error log to > guide users to clear the pid file and where the pid file location is. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146 ] Daniel Ma edited comment on HDFS-16115 at 7/7/21, 2:01 AM: --- [~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch attempt to solve is totally different from the one HDFS-15651 mentioned. I have noticed this Jira perviously, but it can not solve my issue perfectly. What I try to solve in this patch is : 1-Once CommandProcess thread caught a non-fatal error or exception, there will be 5 time retry instead of simply interrup it, and after it reach the max retry times , we need to stop the corresponding BPServiceActor thread as well. In HDFS-15651, no matter what kind of the error is , just simply close the thread, but there are many non-fatal errors that probably recover automatically like "cannot create native thread error", when the thread in os drop, the BPServiceActor service still dead can not recover by itself. 2-In my patch, for the non-fatal error, BPOfferService thread always running a periodical thread to try to recover the BPServiceActor thread that is dead owing to non-fatal error, which is the essential difference between our patch and HDFS-15651 was (Author: daniel ma): [~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch attempt to solve is totally different from the one HDFS-15651 mentioned. I have noticed this Jira perviously, but it can not solve my issue perfectly. What I try to solve in this patch is : 1-Once CommandProcess thread is dead, we need to stop the corresponding BPServiceActor thread as well. 2-In my patch, for the non-fatal error, BPOfferService thread always running a periodical thread to try to recover the BPServiceActor thread that is dead owing to non-fatal error, which is the essential difference between our patch and HDFS-15651 > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at >
[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401 ] Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:47 AM: --- Hello [~brahmareddy],[~hemant] Could you pls help to review this patch. thanks. was (Author: daniel ma): Hello [~brahmareddy],[~hemant] Pls help to review this patch. thanks. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401 ] Daniel Ma edited comment on HDFS-16115 at 7/7/21, 1:46 AM: --- Hello [~brahmareddy],[~hemant] Pls help to review this patch. thanks. was (Author: daniel ma): Hello [~brahmareddy], [~ayush] Pls help to review this patch. thanks. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17376146#comment-17376146 ] Daniel Ma commented on HDFS-16115: -- [~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch attempt to solve is totally different from the one HDFS-15651 mentioned. I have noticed this Jira perviously, but it can not solve my issue perfectly. What I try to solve in this patch is : 1-Once CommandProcess thread is dead, we need to stop the corresponding BPServiceActor thread as well. 2-In my patch, for the non-fatal error, BPOfferService thread always running a periodical thread to try to recover the BPServiceActor thread that is dead owing to non-fatal error, which is the essential difference between our patch and HDFS-15651 > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, {code:java} //代码占位符 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when process queue BPServiceActor.java:1393 java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) {code} currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPServiceActor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special
[jira] [Commented] (HDFS-16098) ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException
[ https://issues.apache.org/jira/browse/HDFS-16098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375427#comment-17375427 ] Daniel Ma commented on HDFS-16098: -- [~wangyanfu] Thanks for reporting this issue. Could you pls share more details about the error stack. > ERROR tools.DiskBalancerCLI: java.lang.IllegalArgumentException > --- > > Key: HDFS-16098 > URL: https://issues.apache.org/jira/browse/HDFS-16098 > Project: Hadoop HDFS > Issue Type: Bug > Components: diskbalancer >Affects Versions: 2.6.0 > Environment: VERSION info: > Hadoop 2.6.0-cdh5.14.4 >Reporter: wangyanfu >Priority: Blocker > Labels: diskbalancer > Fix For: 2.6.0 > > Attachments: image-2021-07-01-18-34-54-905.png, on-branch-3.1.jpg > > Original Estimate: 504h > Remaining Estimate: 504h > > when i tried to run > hdfs diskbalancer -plan $(hostname -f) > > > > i get this notice: > 21/06/30 11:30:41 ERROR tools.DiskBalancerCLI: > java.lang.IllegalArgumentException > > then i tried write the real hostname into my command , not work and same > error notice > i also tried use --plan instead of -plan , not work and same error notice > i found this > [link|https://community.cloudera.com/t5/Support-Questions/Error-trying-to-balance-disks-on-node/m-p/59989#M54850] > but there's no resolve solution , can somebody help me? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421 ] Daniel Ma edited comment on HDFS-15796 at 7/6/21, 9:57 AM: --- [~sodonnell] Thanks for reviewing, Actually you missed the for loop here: {code:java} //代码占位符 synchronized (pendingReconstruction) { List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } } {code} The problem happens when the code above try to travel the DataNodes stored in pendingReconstruction object, while the DataNode list is also been modifing elsewhere. In other words, if you modify a List(delete or add an element) and visit it in the same time, ConcurrentModificationException will be casted. was (Author: daniel ma): [~sodonnell] Thanks for reviewing, Actually you missed the for loop here: {code:java} //代码占位符 synchronized (pendingReconstruction) { List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } } {code} The problem happens when the code above try to travel the DataNodes stored in pendingReconstruction object, while the DataNode list is also be modified elsewhere. In other words, if you modify a List(delete or add an element) and visit it in the same time, ConcurrentModificationException will be casted. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375421#comment-17375421 ] Daniel Ma commented on HDFS-15796: -- [~sodonnell] Thanks for reviewing, Actually you missed the for loop here: {code:java} //代码占位符 synchronized (pendingReconstruction) { List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } } {code} The problem happens when the code above try to travel the DataNodes stored in pendingReconstruction object, while the DataNode list is also be modified elsewhere. In other words, if you modify a List(delete or add an element) and visit it in the same time, ConcurrentModificationException will be casted. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPServiceActor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > currently, Datanode BPServiceActor
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply removed from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > currently, Datanode BPServiceActor
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefore, in this patch, two things will be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > currently, Datanode BPServiceActor
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much torlerance instead of simply shudown the thread and never recover > automatically, because the non-fatal errors mentioned above probably can be > recovered soon by itself, > currently, Datanode BPServiceActor cannot turn to
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much more torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed in OS, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > currently, Datanode BPServiceActor cannot
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. the interval is also configurable. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal errors mentioned above > probably can be recovered soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. Therefor, in this patch, two things was be done: 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread which is 5 by default and configurable; 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread is dead owing to too many times non-fatal error, it should not be simply remove from BPServviceActor lists stored in BPOfferService, instead, the monitor thread will periodically try to start these special dead BPService Actor thread. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal errors mentioned above > probably can be recovered soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefor, in this patch, two things was be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply remove from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPService
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exceptions or errors happen in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep putting commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread was dead owing to some non-fatal errors like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal errors mentioned above probably can be recovered soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exception or errors happens in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep put commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread fails owing to some non-fatal error like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal eror mention above may recover soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal errors mentioned above > probably can be recovered soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401 ] Daniel Ma edited comment on HDFS-16115 at 7/6/21, 9:33 AM: --- Hello [~brahmareddy], [~ayush] Pls help to review this patch. thanks. was (Author: daniel ma): [~brahmareddy] [~ayush] Pls help to review this patch. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exception or > errors happens in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep put commands from > namenode into queues waiting to be handled by CommandProcessThread, actually > CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread > fails owing to some non-fatal error like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal eror mention above may > recover soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375401#comment-17375401 ] Daniel Ma commented on HDFS-16115: -- [~brahmareddy] [~ayush] Pls help to review this patch. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exception or > errors happens in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep put commands from > namenode into queues waiting to be handled by CommandProcessThread, actually > CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread > fails owing to some non-fatal error like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal eror mention above may > recover soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Description: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exception or errors happens in thread CommandProcessthread resulting the thread fails and stop, of which BPServiceActor cannot aware and still keep put commands from namenode into queues waiting to be handled by CommandProcessThread, actually CommandProcessThread was dead already. 2-the second sub issue is based on the first one, if CommandProcessThread fails owing to some non-fatal error like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal eror mention above may recover soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. was: It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exception or errors happens in thread CommandProcessthread resulting the thread fails and stop, which is not aware of it and still keep put command from namenode into queues to be handled by CommandProcessThread 2-the second sub issue is based on the first one, if CommandProcessThread fails owing to some non-fatal error like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal eror mention above may recover soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exception or > errors happens in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep put commands from > namenode into queues waiting to be handled by CommandProcessThread, actually > CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread > fails owing to some non-fatal error like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal eror mention above may > recover soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16115: - Attachment: 0001-HDFS-16115.patch > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exception or > errors happens in thread CommandProcessthread resulting the thread fails and > stop, which is not aware of it and still keep put command from namenode into > queues to be handled by CommandProcessThread > 2-the second sub issue is based on the first one, if CommandProcessThread > fails owing to some non-fatal error like "can not create native thread" which > is caused by too many threads existed on the node, this kind of problem > should be given much torlerance instead of simply shudown the thread and > never recover automatically, because the non-fatal eror mention above may > recover soon by itself, > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.
Daniel Ma created HDFS-16115: Summary: Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error. Key: HDFS-16115 URL: https://issues.apache.org/jira/browse/HDFS-16115 Project: Hadoop HDFS Issue Type: Improvement Components: datanode Affects Versions: 3.3.1 Reporter: Daniel Ma Fix For: 3.3.1 It is an improvement issue. Actually the issue has two sub issues: 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( CommandProcessThread handle commands ), so if there are any exception or errors happens in thread CommandProcessthread resulting the thread fails and stop, which is not aware of it and still keep put command from namenode into queues to be handled by CommandProcessThread 2-the second sub issue is based on the first one, if CommandProcessThread fails owing to some non-fatal error like "can not create native thread" which is caused by too many threads existed on the node, this kind of problem should be given much torlerance instead of simply shudown the thread and never recover automatically, because the non-fatal eror mention above may recover soon by itself, currently, Datanode BPServiceActor cannot turn to normal even when the non-fatal error was eliminated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17375331#comment-17375331 ] Daniel Ma commented on HDFS-15796: -- [~weichiu],[~hexiaoqiao] Pls help to review this patch, thanks > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Attachment: 0001-HDFS-15796.patch > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Target Version/s: 3.3.1 (was: 3.4.0) > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Attachments: 0001-HDFS-15796.patch > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario
[ https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16094: - Summary: HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario (was: HDFS start failed owing to daemon pid file is not cleared in some exception senario) > HDFS balancer process start failed owing to daemon pid file is not cleared in > some exception senario > > > Key: HDFS-16094 > URL: https://issues.apache.org/jira/browse/HDFS-16094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: scripts >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > > NameNode start failed owing to daemon pid file is not cleared in some > exception senario, but there is no useful information in log to trouble shoot > as below. > {code:java} > //代码占位符 > hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") > {code} > but actually, the process is not running as the error msg details above. > Therefore, some more explicit information should be print in error log to > guide users to clear the pid file and where the pid file location is. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16094) HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario
[ https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16094: - Description: HDFS balancer process start failed owing to daemon pid file is not cleared in some exception senario, but there is no useful information in log to trouble shoot as below. {code:java} //代码占位符 hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") {code} but actually, the process is not running as the error msg details above. Therefore, some more explicit information should be print in error log to guide users to clear the pid file and where the pid file location is. was: NameNode start failed owing to daemon pid file is not cleared in some exception senario, but there is no useful information in log to trouble shoot as below. {code:java} //代码占位符 hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") {code} but actually, the process is not running as the error msg details above. Therefore, some more explicit information should be print in error log to guide users to clear the pid file and where the pid file location is. > HDFS balancer process start failed owing to daemon pid file is not cleared in > some exception senario > > > Key: HDFS-16094 > URL: https://issues.apache.org/jira/browse/HDFS-16094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: scripts >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > > HDFS balancer process start failed owing to daemon pid file is not cleared in > some exception senario, but there is no useful information in log to trouble > shoot as below. > {code:java} > //代码占位符 > hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") > {code} > but actually, the process is not running as the error msg details above. > Therefore, some more explicit information should be print in error log to > guide users to clear the pid file and where the pid file location is. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16094) HDFS start failed owing to daemon pid file is not cleared in some exception senario
[ https://issues.apache.org/jira/browse/HDFS-16094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-16094: - Summary: HDFS start failed owing to daemon pid file is not cleared in some exception senario (was: NameNode start failed owing to daemon pid file is not cleared in some exception senario) > HDFS start failed owing to daemon pid file is not cleared in some exception > senario > --- > > Key: HDFS-16094 > URL: https://issues.apache.org/jira/browse/HDFS-16094 > Project: Hadoop HDFS > Issue Type: Improvement > Components: scripts >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Major > > NameNode start failed owing to daemon pid file is not cleared in some > exception senario, but there is no useful information in log to trouble shoot > as below. > {code:java} > //代码占位符 > hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") > {code} > but actually, the process is not running as the error msg details above. > Therefore, some more explicit information should be print in error log to > guide users to clear the pid file and where the pid file location is. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16094) NameNode start failed owing to daemon pid file is not cleared in some exception senario
Daniel Ma created HDFS-16094: Summary: NameNode start failed owing to daemon pid file is not cleared in some exception senario Key: HDFS-16094 URL: https://issues.apache.org/jira/browse/HDFS-16094 Project: Hadoop HDFS Issue Type: Improvement Components: scripts Affects Versions: 3.3.1 Reporter: Daniel Ma NameNode start failed owing to daemon pid file is not cleared in some exception senario, but there is no useful information in log to trouble shoot as below. {code:java} //代码占位符 hadoop_error "${daemonname} is running as process $(cat "${daemon_pidfile}") {code} but actually, the process is not running as the error msg details above. Therefore, some more explicit information should be print in error log to guide users to clear the pid file and where the pid file location is. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause bad
[ https://issues.apache.org/jira/browse/HDFS-16093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371015#comment-17371015 ] Daniel Ma commented on HDFS-16093: -- Hi [~tomscut], thanks for your quick reply. The Jira you mentioned can relieve such issue to some extent, but I think only the DataNode in service should be returned to the client. All the abnornal state DataNode like DECOMMISSION or MAINTANENCE should be removed in the return list. > DataNodes under decommission will still be returned to the client via > getLocatedBlocks, so the client may request decommissioning datanodes to read > which will cause badly competation on disk IO. > -- > > Key: HDFS-16093 > URL: https://issues.apache.org/jira/browse/HDFS-16093 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.1 >Reporter: Daniel Ma >Priority: Critical > > DataNodes under decommission will still be returned to the client via > getLocatedBlocks, so the client may request decommissioning datanodes to read > which will cause badly competation on disk IO. > Therefore, datanodes under decommission should be removed from the return > list of getLocatedBlocks api. > !image-2021-06-29-10-50-44-739.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16093) DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badly
Daniel Ma created HDFS-16093: Summary: DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badly competation on disk IO. Key: HDFS-16093 URL: https://issues.apache.org/jira/browse/HDFS-16093 Project: Hadoop HDFS Issue Type: Improvement Affects Versions: 3.3.1 Reporter: Daniel Ma DataNodes under decommission will still be returned to the client via getLocatedBlocks, so the client may request decommissioning datanodes to read which will cause badly competation on disk IO. Therefore, datanodes under decommission should be removed from the return list of getLocatedBlocks api. !image-2021-06-29-10-50-44-739.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:36 AM: [~weichiu] No idea what kind of condition can reproduce this problem. it seems the targets object is modified elsewhere when computeReconstrutionWorkForBlocks is in progress owing to unsafe thread issue. {code:java} //代码占位符 // Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); } {code} was (Author: daniel ma): [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {code:java} //代码占位符 // Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); } {code} > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:35 AM: [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {code:java} //代码占位符 // Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); } {code} was (Author: daniel ma): [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); } {quote} > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:34 AM: [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); } {quote} was (Author: daniel ma): [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); }{quote} > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:33 AM: [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); }{quote} was (Author: daniel ma): [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); }{quote} > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma edited comment on HDFS-15796 at 6/29/21, 2:32 AM: [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. {quote}// Step 2: choose target nodes for each reconstruction task for (BlockReconstructionWork rw : reconWork) { // Exclude all of the containing nodes from being targets. // This list includes decommissioning or corrupt nodes. final Set excludedNodes = new HashSet<>(rw.getContainingNodes()); List targets = pendingReconstruction .getTargets(rw.getBlock()); if (targets != null) { for (DatanodeStorageInfo dn : targets) { if (!excludedNodes.contains(dn.getDatanodeDescriptor())) { excludedNodes.add(dn.getDatanodeDescriptor()); } } } // choose replication targets: NOT HOLDING THE GLOBAL LOCK final BlockPlacementPolicy placementPolicy = placementPolicies.getPolicy(rw.getBlock().getBlockType()); rw.chooseTargets(placementPolicy, storagePolicySuite, excludedNodes); }{quote} was (Author: daniel ma): [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370997#comment-17370997 ] Daniel Ma commented on HDFS-15796: -- [~sodonnell] We have made some modifications based on OS version, like merge some patches from newer version into our 3.1.1 version. So the line number in the error statck trace is not exactly same. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370995#comment-17370995 ] Daniel Ma commented on HDFS-15796: -- [~weichiu] No idea what kind of condition can reproduce this problem. it seems the tergets object is modified elsewhere, when computeReconstrutionWorkForBlocks is in progress. > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
[ https://issues.apache.org/jira/browse/HDFS-15796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Daniel Ma updated HDFS-15796: - Description: ConcurrentModificationException error happens on NameNode occasionally. {code:java} 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor thread received Runtime exception. | BlockManager.java:4746 java.util.ConcurrentModificationException at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) at java.util.ArrayList$Itr.next(ArrayList.java:859) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) at java.lang.Thread.run(Thread.java:748) {code} was: ConcurrentModificationException error happens on NameNode occasionally !file:///C:/Users/m00425105/AppData/Roaming/eSpace_Desktop/UserData/m00425105/imagefiles/10B02DC2-A9F0-4AE6-949B-92B8F1E9249A.png! > ConcurrentModificationException error happens on NameNode occasionally > -- > > Key: HDFS-15796 > URL: https://issues.apache.org/jira/browse/HDFS-15796 > Project: Hadoop HDFS > Issue Type: Bug > Components: hdfs >Affects Versions: 3.1.1 >Reporter: Daniel Ma >Priority: Critical > Fix For: 3.1.1 > > > ConcurrentModificationException error happens on NameNode occasionally. > > {code:java} > 2021-01-23 20:21:18,107 | ERROR | RedundancyMonitor | RedundancyMonitor > thread received Runtime exception. | BlockManager.java:4746 > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:909) > at java.util.ArrayList$Itr.next(ArrayList.java:859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:1907) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeBlockReconstructionWork(BlockManager.java:1859) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeDatanodeWork(BlockManager.java:4862) > at > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager$RedundancyMonitor.run(BlockManager.java:4729) > at java.lang.Thread.run(Thread.java:748) > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-15796) ConcurrentModificationException error happens on NameNode occasionally
Daniel Ma created HDFS-15796: Summary: ConcurrentModificationException error happens on NameNode occasionally Key: HDFS-15796 URL: https://issues.apache.org/jira/browse/HDFS-15796 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Affects Versions: 3.1.1 Reporter: Daniel Ma Fix For: 3.1.1 ConcurrentModificationException error happens on NameNode occasionally !file:///C:/Users/m00425105/AppData/Roaming/eSpace_Desktop/UserData/m00425105/imagefiles/10B02DC2-A9F0-4AE6-949B-92B8F1E9249A.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org