[ https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376146#comment-17376146 ]
Daniel Ma edited comment on HDFS-16115 at 7/7/21, 2:01 AM: ----------------------------------------------------------- [~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch attempt to solve is totally different from the one HDFS-15651 mentioned. I have noticed this Jira perviously, but it can not solve my issue perfectly. What I try to solve in this patch is : 1-Once CommandProcess thread caught a non-fatal error or exception, there will be 5 time retry instead of simply interrup it, and after it reach the max retry times , we need to stop the corresponding BPServiceActor thread as well. In HDFS-15651, no matter what kind of the error is , just simply close the thread, but there are many non-fatal errors that probably recover automatically like "cannot create native thread error", when the thread in os drop, the BPServiceActor service still dead can not recover by itself. 2-In my patch, for the non-fatal error, BPOfferService thread always running a periodical thread to try to recover the BPServiceActor thread that is dead owing to non-fatal error, which is the essential difference between our patch and HDFS-15651 was (Author: daniel ma): [~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch attempt to solve is totally different from the one HDFS-15651 mentioned. I have noticed this Jira perviously, but it can not solve my issue perfectly. What I try to solve in this patch is : 1-Once CommandProcess thread is dead, we need to stop the corresponding BPServiceActor thread as well. 2-In my patch, for the non-fatal error, BPOfferService thread always running a periodical thread to try to recover the BPServiceActor thread that is dead owing to non-fatal error, which is the essential difference between our patch and HDFS-15651 > Asynchronously handle BPServiceActor command mechanism may result in > BPServiceActor never fails even CommandProcessingThread is closed with fatal > error. > -------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HDFS-16115 > URL: https://issues.apache.org/jira/browse/HDFS-16115 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 3.3.1 > Reporter: Daniel Ma > Priority: Critical > Fix For: 3.3.1 > > Attachments: 0001-HDFS-16115.patch > > > It is an improvement issue. Actually the issue has two sub issues: > 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( > CommandProcessThread handle commands ), so if there are any exceptions or > errors happen in thread CommandProcessthread resulting the thread fails and > stop, of which BPServiceActor cannot aware and still keep putting commands > from namenode into queues waiting to be handled by CommandProcessThread, > actually CommandProcessThread was dead already. > 2-the second sub issue is based on the first one, if CommandProcessThread was > dead owing to some non-fatal errors like "can not create native thread" which > is caused by too many threads existed in OS, this kind of problem should be > given much more torlerance instead of simply shudown the thread and never > recover automatically, because the non-fatal errors mentioned above probably > can be recovered soon by itself, > {code:java} > //代码占位符 > 2021-07-02 16:26:02,315 | WARN | Command processor | Exception happened when > process queue BPServiceActor.java:1393 > java.lang.OutOfMemoryError: unable to create new native thread > at java.lang.Thread.start0(Native Method) > at java.lang.Thread.start(Thread.java:717) > at > java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957) > at > java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365) > {code} > currently, Datanode BPServiceActor cannot turn to normal even when the > non-fatal error was eliminated. > Therefore, in this patch, two things will be done: > 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread > thread which is 5 by default and configurable; > 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor > thread is dead owing to too many times non-fatal error, it should not be > simply removed from BPServviceActor lists stored in BPOfferService, instead, > the monitor thread will periodically try to start these special dead > BPServiceActor thread. the interval is also configurable. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org