[ https://issues.apache.org/jira/browse/AMBARI-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Alejandro Fernandez updated AMBARI-12267: ----------------------------------------- Description: Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist that contains a mapping of HDFS data dirs to the last known mount point. This is used to detect when a data dir becomes unmounted, in order to prevent HDFS from writing to the root partition. Consider the example of a data node configured with these volumes: /dev/sda -> / /dev/sdb -> /grid/0 /dev/sdc -> /grid/1 /dev/sdd -> /grid/2 Typically, each /grid/#/ directory contains a data folder. If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, then DataNode will tolerate the failure, otherwise, the DataNode will die. In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted drive from allowing HDFS to write to the root partition. However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist file existing, and the original configuration being correct. The ideal way to fix this is, * h4. Track which data dirs the admin wants mounted on a non-root partition. If the admin wishes all data dirs to be on non-root mounts, but the initial install is incorrect, then this should be reported as a problem. * h4. Keep the history of the mount points in the database. Today, if the cache file is deleted or the host reimaged, then this information is lost. * h4. Introduce a new state between FAILED and COMPLETED. such as COMPLETED_WITH_ERRORS, that will allow tasks to look differently in the UI, so the user can clearly detect when a critical but non fatal error happened. * h4. Plugin with Alert Framework was: Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist that contains a mapping of HDFS data dirs to the last known mount point. This is used to detect when a data dir becomes unmounted, in order to prevent HDFS from writing to the root partition. Consider the example of a data node configured with these volumes: /dev/sda -> / /dev/sdb -> /grid/0 /dev/sdc -> /grid/1 /dev/sdd -> /grid/2 Typically, each /grid/#/ directory contains a data folder. If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, then DataNode will tolerate the failure, otherwise, the DataNode will die. In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted drive from allowing HDFS to write to the root partition. However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist file existing, and the original configuration being correct. The ideal way to fix this is, * Track which data dirs the admin wants mounted on a non-root partition. If the admin wishes all data dirs to be on non-root mounts, but the initial install is incorrect, then this should be reported as a problem. * Keep the history of the mount points in the database. Today, if the cache file is deleted or the host reimaged, then this information is lost. * Introduce a new state between FAILED and COMPLETED, such as COMPLETED_WITH_ERRORS, that will allow tasks to look differently in the UI, so the user can clearly detect when a critical but non fatal error happened. Plugin with Alert Framework > Ambari to improve tracking of data dirs becoming unmounted > ---------------------------------------------------------- > > Key: AMBARI-12267 > URL: https://issues.apache.org/jira/browse/AMBARI-12267 > Project: Ambari > Issue Type: Story > Components: ambari-agent > Affects Versions: 2.0.0 > Reporter: Alejandro Fernandez > Assignee: Alejandro Fernandez > Fix For: 2.2.0 > > > Ambari keeps track of a file, /etc/hadoop/conf/dfs_data_dir_mount.hist > that contains a mapping of HDFS data dirs to the last known mount point. > This is used to detect when a data dir becomes unmounted, in order to prevent > HDFS from writing to the root partition. > Consider the example of a data node configured with these volumes: > /dev/sda -> / > /dev/sdb -> /grid/0 > /dev/sdc -> /grid/1 > /dev/sdd -> /grid/2 > Typically, each /grid/#/ directory contains a data folder. > If hdfs-site contains dfs.datanode.failed.volumes.tolerated with a value > 0, > then DataNode will tolerate the failure, otherwise, the DataNode will die. > In AMBARI-12252, I fixed a bug so that Ambari would prevent an unmounted > drive from allowing HDFS to write to the root partition. > However, this approach relies on the /etc/hadoop/conf/dfs_data_dir_mount.hist > file existing, and the original configuration being correct. > The ideal way to fix this is, > * h4. Track which data dirs the admin wants mounted on a non-root partition. > If the admin wishes all data dirs to be on non-root mounts, but the initial > install is incorrect, then this should be reported as a problem. > * h4. Keep the history of the mount points in the database. > Today, if the cache file is deleted or the host reimaged, then this > information is lost. > * h4. Introduce a new state between FAILED and COMPLETED. > such as COMPLETED_WITH_ERRORS, that will allow tasks to look differently in > the UI, so the user can clearly detect when a critical but non fatal error > happened. > * h4. Plugin with Alert Framework -- This message was sent by Atlassian JIRA (v6.3.4#6332)