[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Attachment: 0001-HDFS-16115.patch

> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, which is not aware of it and still keep put command from namenode into 
> queues to be handled by CommandProcessThread
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, 
which is not aware of it and still keep put command from namenode into queues 
to be handled by CommandProcessThread

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exception or 
> errors happens in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep put commands from 
> namenode into queues waiting to be handled by CommandProcessThread, actually 
> CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread 
> fails owing to some non-fatal error like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal eror mention above may 
> recover soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exception or errors 
happens in thread CommandProcessthread resulting the thread fails and stop, of 
which BPServiceActor cannot aware and still keep put commands from namenode 
into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread fails 
owing to some non-fatal error like "can not create native thread" which is 
caused by too many threads existed on the node, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal eror mention above may recover soon by 
itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
> non-fatal error was eliminated.
> Therefor, in this patch, two things was be done:
> 1-Add retry mechanism in BPServiceActor thread and CommandProcessThread 
> thread which is 5 by default and configurable;
> 2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor 
> thread is dead owing to  too many times non-fatal error, it should not be 
> simply remove from BPServviceActor lists stored in BPOfferService, instead, 
> the monitor thread will periodically try to start these special dead 
> BPService Act

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed on the node, this kind of problem 
> should be given much torlerance instead of simply shudown the thread and 
> never recover automatically, because the non-fatal errors mentioned above 
> probably can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to normal even when the 
>

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed on the node, this kind of problem should 
be given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much torlerance instead of simply shudown the thread and never recover 
> automatically, because the non-fatal errors mentioned above probably can be 
> recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn to 

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much torlerance instead of simply shudown the thread and never recover 
automatically, because the non-fatal errors mentioned above probably can be 
recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor cannot turn

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefor, in this patch, two things was be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor cann

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
remove from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor c

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPService 
Actor thread. the interval is also configurable.


> Asynchronously handle BPServiceActor command mechanism may result in 
> BPServiceActor never fails even CommandProcessingThread is closed with fatal 
> error.
> 
>
> Key: HDFS-16115
> URL: https://issues.apache.org/jira/browse/HDFS-16115
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: datanode
>Affects Versions: 3.3.1
>Reporter: Daniel Ma
>Priority: Critical
> Fix For: 3.3.1
>
> Attachments: 0001-HDFS-16115.patch
>
>
> It is an improvement issue. Actually the issue has two sub issues:
> 1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
> CommandProcessThread handle commands ), so if there are any exceptions or 
> errors happen in thread CommandProcessthread resulting the thread fails and 
> stop, of which BPServiceActor cannot aware and still keep putting commands 
> from namenode into queues waiting to be handled by CommandProcessThread, 
> actually CommandProcessThread was dead already.
> 2-the second sub issue is based on the first one, if CommandProcessThread was 
> dead owing to some non-fatal errors like "can not create native thread" which 
> is caused by too many threads existed in OS, this kind of problem should be 
> given much more torlerance instead of simply shudown the thread and never 
> recover automatically, because the non-fatal errors mentioned above probably 
> can be recovered soon by itself,
> currently, Datanode BPServiceActor c

[jira] [Updated] (HDFS-16115) Asynchronously handle BPServiceActor command mechanism may result in BPServiceActor never fails even CommandProcessingThread is closed with fatal error.

2021-07-06 Thread Daniel Ma (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Ma updated HDFS-16115:
-
Description: 
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,
{code:java}
//代码占位符
2021-07-02 16:26:02,315 | WARN  | Command processor | Exception happened when 
process queue BPServiceActor.java:1393
java.lang.OutOfMemoryError: unable to create new native thread
at java.lang.Thread.start0(Native Method)
at java.lang.Thread.start(Thread.java:717)
at 
java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:957)
at 
java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1367)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.execute(FsDatasetAsyncDiskService.java:180)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetAsyncDiskService.deleteAsync(FsDatasetAsyncDiskService.java:229)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2315)
at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2237)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:752)
at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:698)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processCommand(BPServiceActor.java:1417)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.lambda$enqueue$2(BPServiceActor.java:1463)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1382)
at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1365)

{code}
currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead BPServiceActor 
thread. the interval is also configurable.

  was:
It is an improvement issue. Actually the issue has two sub issues:

1- BPServerActor thread handle commands from NameNode in aysnchronous way ( 
CommandProcessThread handle commands ), so if there are any exceptions or 
errors happen in thread CommandProcessthread resulting the thread fails and 
stop, of which BPServiceActor cannot aware and still keep putting commands from 
namenode into queues waiting to be handled by CommandProcessThread, actually 
CommandProcessThread was dead already.

2-the second sub issue is based on the first one, if CommandProcessThread was 
dead owing to some non-fatal errors like "can not create native thread" which 
is caused by too many threads existed in OS, this kind of problem should be 
given much more torlerance instead of simply shudown the thread and never 
recover automatically, because the non-fatal errors mentioned above probably 
can be recovered soon by itself,

currently, Datanode BPServiceActor cannot turn to normal even when the 
non-fatal error was eliminated.

Therefore, in this patch, two things will be done:

1-Add retry mechanism in BPServiceActor thread and CommandProcessThread thread 
which is 5 by default and configurable;

2-Add a monitor periodical thread in BPOfferService, if a BPServiceActor thread 
is dead owing to  too many times non-fatal error, it should not be simply 
removed from BPServviceActor lists stored in BPOfferService, instead, the 
monitor thread will periodically try to start these special dead