[ 
https://issues.apache.org/jira/browse/AMBARI-8768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hari Sekhon updated AMBARI-8768:
--------------------------------
    Description: 
Ambari agent is succeptible to hanging when the 'df' command blocks. This 
causes loss of heartbeat and manageability. I've found this has happened with 
NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set 
the NFS soft option on the mount point but then realized that wasn't a good 
idea as not everyone's processes and scripts will handle failure gracefully and 
retry properly).

When restarting the agent it also leaves the df process bound to point 8670 
which requires manually killing that in order to get the ambari agent to 
restart and bind successfully, but even then you'll see a hang at this point 
after connecting to the 8440 ca and the agent never fully initializes so the 
heartbeat still never comes back.

The df command should be either in another thread non-blocking the main 
heartbeat and management functions or should have a timeout set on the command 
execution to prevent this issue.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon

  was:
Ambari agent is succeptible to hanging when the 'df' command blocks. This 
causes loss of heartbeat and manageability. I've found this has happened with 
NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set 
the NFS hard option on the mount point but then realized that wasn't a good 
idea as not everyone's processes and scripts will handle failure gracefully and 
retry properly).

When restarting the agent it also leaves the df process bound to point 8670 
which requires manually killing that in order to get the ambari agent to 
restart and bind successfully, but even then you'll see a hang at this point 
after connecting to the 8440 ca and the agent never fully initializes so the 
heartbeat still never comes back.

The df command should be either in another thread non-blocking the main 
heartbeat and management functions or should have a timeout set on the command 
execution to prevent this issue.

Regards,

Hari Sekhon
http://www.linkedin.com/in/harisekhon


> Ambari agent Heartbeat lost when df hangs (NFS gateway), also prevents proper 
> re-initialization of agent upon restart
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: AMBARI-8768
>                 URL: https://issues.apache.org/jira/browse/AMBARI-8768
>             Project: Ambari
>          Issue Type: Bug
>          Components: ambari-agent
>    Affects Versions: 1.7.0
>         Environment: HDP 2.1
>            Reporter: Hari Sekhon
>
> Ambari agent is succeptible to hanging when the 'df' command blocks. This 
> causes loss of heartbeat and manageability. I've found this has happened with 
> NFS gateway's HDFS mount point blocking when HDFS isn't available (we had set 
> the NFS soft option on the mount point but then realized that wasn't a good 
> idea as not everyone's processes and scripts will handle failure gracefully 
> and retry properly).
> When restarting the agent it also leaves the df process bound to point 8670 
> which requires manually killing that in order to get the ambari agent to 
> restart and bind successfully, but even then you'll see a hang at this point 
> after connecting to the 8440 ca and the agent never fully initializes so the 
> heartbeat still never comes back.
> The df command should be either in another thread non-blocking the main 
> heartbeat and management functions or should have a timeout set on the 
> command execution to prevent this issue.
> Regards,
> Hari Sekhon
> http://www.linkedin.com/in/harisekhon



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to