[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

Daniel Ma (Jira) Tue, 06 Jul 2021 19:02:05 -0700


    [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376146#comment-17376146
 ]


Daniel Ma edited comment on HDFS-16115 at 7/7/21, 2:01 AM:
-----------------------------------------------------------

[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread caught a non-fatal error or exception, there will 
be 5 time retry instead of simply interrup it,  and after it reach the max 
retry times , we need to stop the corresponding BPServiceActor thread as well. 

In HDFS-15651, no matter what kind of the error is , just simply close the 
thread, but there are many non-fatal errors  that probably recover 
automatically like "cannot create native thread error", when the thread in os 
drop, the BPServiceActor service still dead can not recover by itself.

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 


was (Author: daniel ma):
[~hexiaoqiao] Thanks for your review and tips. Actually the issue that my patch 
attempt to solve is totally different from the one HDFS-15651 mentioned.

I have noticed this Jira perviously, but it can not solve my issue perfectly. 

What I try to solve in this patch is :

1-Once CommandProcess thread is dead, we need to stop the corresponding 
BPServiceActor thread as well. 

2-In my patch, for the non-fatal error, BPOfferService thread always running a 
periodical thread to try to recover the BPServiceActor thread that is dead 
owing to non-fatal error, which is the essential difference between our patch 
and HDFS-15651

 

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> --------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16115
>                 URL: https://issues.apache.org/jira/browse/HDFS-16115
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: datanode
>    Affects Versions: 3.3.1
>            Reporter: Daniel Ma
>            Priority: Critical
>             Fix For: 3.3.1
>
>         Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> {code:java}
> //代码占位符
> 2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
> process queue BPServiceActor.java:1393
> java.lang.OutOfMemoryError: unable to create new native thread
>         at java.lang.Thread.start0(Native Method)
>         at java.lang.Thread.start(Thread.java:717)
>         at 
> java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
>         at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)
> {code}
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefore, in this patch, two things will be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply removed from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPServiceActor thread. the interval is also configurable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

Reply via email to