[ https://issues.apache.org/jira/browse/HDFS-7604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chris Nauroth updated HDFS-7604: -------------------------------- Attachment: HDFS-7604-screenshot-4.png I've done another mock-up of the UI. This version avoids adding clutter to the existing Datanodes page and instead moves failure information to its own dedicated page. Just like in the existing screenshot 3, there is a new field on the summary for Total Failed Volumes. I also intend to display lost capacity in parentheses next to it. However, unlike last time, the existing Datanodes page is unchanged. Instead, the volume failure information is on a new Datanode Volumes page. This is hyperlinked from both the Total Failed Volumes field in the summary and a new tab in the top nav. The new page has a table displaying only the DataNodes that have volume failures. For each one, it displays the address, seconds since last contact, time of last volume failure, number of failed volumes, estimated capacity lost due to these volume failures, and a list of every failed storage location's path. I say that the capacity lost is an estimate, because there are going to be some edge cases that could prevent us from displaying accurate information here. For example, if a volume has an I/O error before we get a chance to check its capacity, then it's unknown how much storage is available on that volume. The end user workflow I imagine for this is that an admin first checks the summary information and notices a non-zero count for failed volumes. Then, the admin navigates to the Datanode Volumes page to get a list of volume failures across the cluster. This view lists only the DataNodes with volume failures, so the admin won't need to scan through the master list looking for individual nodes with a non-zero volume failure count. This can act as a sort of work queue for the admin recovering or replacing disks. I have not updated the patch. I need to rework the heartbeat information to provide this data for the UI. Meanwhile, Last Failure Time and Estimated Capacity Lost are displayed as TODO in the screenshot. Further feedback is welcome while I continue coding a new patch. > Track and display failed DataNode storage locations in NameNode. > ---------------------------------------------------------------- > > Key: HDFS-7604 > URL: https://issues.apache.org/jira/browse/HDFS-7604 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode, namenode > Reporter: Chris Nauroth > Assignee: Chris Nauroth > Attachments: HDFS-7604-screenshot-1.png, HDFS-7604-screenshot-2.png, > HDFS-7604-screenshot-3.png, HDFS-7604-screenshot-4.png, HDFS-7604.001.patch, > HDFS-7604.prototype.patch > > > During heartbeats, the DataNode can report a list of its storage locations > that have been taken out of service due to failure (such as due to a bad disk > or a permissions problem). The NameNode can track these failed storage > locations and then report them in JMX and the NameNode web UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)