[ 
https://issues.apache.org/jira/browse/AMBARI-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alejandro Fernandez updated AMBARI-12267:
-----------------------------------------
    Description: 
Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
that contains a mapping of HDFS data dirs to the last known mount point.
This is used to detect when a data dir becomes unmounted, in order to prevent 
HDFS from writing to the root partition.
Consider the example of a data node configured with these volumes:
/dev/sda -> / 
/dev/sdb -> /grid/0
/dev/sdc -> /grid/1
/dev/sdd -> /grid/2
Typically, each /grid/#/ directory contains a data folder.

If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, 
then DataNode will tolerate the failure, otherwise, the DataNode will die.

In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted drive 
from allowing HDFS to write to the root partition.
However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist 
file existing, and the original configuration being correct.

The ideal way to fix this is,
* h4. Track which data dirs the admin wants mounted on a non-root partition.
If the admin wishes all data dirs to be on non-root mounts, but the initial 
install is incorrect, then this should be reported as a problem.
* h4. Keep the history of the mount points in the database. 
Today, if the cache file is deleted or the host reimaged, then this information 
is lost.
* h4. Introduce a new state between FAILED and COMPLETED.
such as COMPLETED_WITH_ERRORS, that will allow tasks to look differently in the 
UI, so the user can clearly detect when a critical but non fatal error happened.
* h4. Plugin with Alert Framework

  was:
Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
that contains a mapping of HDFS data dirs to the last known mount point.
This is used to detect when a data dir becomes unmounted, in order to prevent 
HDFS from writing to the root partition.
Consider the example of a data node configured with these volumes:
/dev/sda -> / 
/dev/sdb -> /grid/0
/dev/sdc -> /grid/1
/dev/sdd -> /grid/2
Typically, each /grid/#/ directory contains a data folder.

If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, 
then DataNode will tolerate the failure, otherwise, the DataNode will die.

In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted drive 
from allowing HDFS to write to the root partition.
However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist 
file existing, and the original configuration being correct.

The ideal way to fix this is,
* Track which data dirs the admin wants mounted on a non-root partition. If the 
admin wishes all data dirs to be on non-root mounts, but the initial install is 
incorrect, then this should be reported as a problem.
* Keep the history of the mount points in the database. Today, if the cache 
file is deleted or the host reimaged, then this information is lost.
* Introduce a new state between FAILED and COMPLETED, such as 
COMPLETED_WITH_ERRORS, that will allow tasks to look differently in the UI, so 
the user can clearly detect when a critical but non fatal error happened.
Plugin with Alert Framework


> Ambari to improve tracking of data dirs becoming unmounted
> ----------------------------------------------------------
>
>                 Key: AMBARI-12267
>                 URL: https://issues.apache.org/jira/browse/AMBARI-12267
>             Project: Ambari
>          Issue Type: Story
>          Components: ambari-agent
>    Affects Versions: 2.0.0
>            Reporter: Alejandro Fernandez
>            Assignee: Alejandro Fernandez
>             Fix For: 2.2.0
>
>
> Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist 
> that contains a mapping of HDFS data dirs to the last known mount point.
> This is used to detect when a data dir becomes unmounted, in order to prevent 
> HDFS from writing to the root partition.
> Consider the example of a data node configured with these volumes:
> /dev/sda -> / 
> /dev/sdb -> /grid/0
> /dev/sdc -> /grid/1
> /dev/sdd -> /grid/2
> Typically, each /grid/#/ directory contains a data folder.
> If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, 
> then DataNode will tolerate the failure, otherwise, the DataNode will die.
> In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted 
> drive from allowing HDFS to write to the root partition.
> However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist 
> file existing, and the original configuration being correct.
> The ideal way to fix this is,
> * h4. Track which data dirs the admin wants mounted on a non-root partition.
> If the admin wishes all data dirs to be on non-root mounts, but the initial 
> install is incorrect, then this should be reported as a problem.
> * h4. Keep the history of the mount points in the database. 
> Today, if the cache file is deleted or the host reimaged, then this 
> information is lost.
> * h4. Introduce a new state between FAILED and COMPLETED.
> such as COMPLETED_WITH_ERRORS, that will allow tasks to look differently in 
> the UI, so the user can clearly detect when a critical but non fatal error 
> happened.
> * h4. Plugin with Alert Framework



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to