I think it's relevant for us, we should consider running the analysis tool too.
Tim ---------- Forwarded message ---------- From: Stack <[email protected]> Date: Tue, Oct 7, 2014 at 8:10 AM Subject: Re: An important read To: HBase Dev List <[email protected]> Nkeywal points out HBASE-10452 has fixes for problems found by the Aspirator tool mentioned in the paper. I made HBASE-12187, "Review in source the paper "Simple Testing Can Prevent Most Critical Failures", a critical against 1.0. Lets run through their list of 'catastrophic failures' before we cut the 1.0 release. St.Ack On Mon, Oct 6, 2014 at 8:55 PM, Andrew Purtell <[email protected]> wrote: > https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf > > Simple Testing Can Prevent Most Critical Failures: An Analysis of > Production Failures in Distributed Data-intensive Systems > Yuan et. al, University of Toronto > > Large, production quality distributed systems still fail periodically, and > do so sometimes catastrophically, where most or all users experience an > outage or data loss. We present the result of a comprehensive study > investigating 198 randomly selected, user-reported failures that occurred > on Cassandra, HBase, Hadoop Distributed File System (HDFS), Hadoop > MapReduce, and Redis, with the goal of understanding how one or multiple > faults eventually evolve into a user-visible failure. We found that from a > testing point of view, almost all failures require only 3 or fewer nodes to > reproduce, which is good news considering that these services typically run > on a very large number of nodes. However, multiple inputs are needed to > trigger the failures with the order between them being important. Finally, > we found the error logs of these systems typically contain sufficient data > on both the errors and the input events that triggered the failure, > enabling the diagnose and the reproduction of the production failures. > > We found the majority of catastrophic failures could easily have been > prevented by performing simple testing on error handling code – the last > line of defense – even without an understanding of the software design. We > extracted three simple rules from the bugs that have lead to some of the > catastrophic failures, and developed a static checker, Aspirator, capable > of locating these bugs. Over 30% of the catastrophic failures would have > been prevented had Aspirator been used and the identified bugs fixed. > Running Aspirator on the code of 9 distributed systems located 143 bugs and > bad practices that have been fixed or confirmed by the developers. > > > This is an interesting benefit of open source and open development > process. Please read this detailed analysis of availability and data loss > bugs resulting from improper error handling, in HBase and other systems. > The authors focus on a particular pattern of defect and cause. The point is > well taken. It would be worth taking time where possible to revisit > exception handling, especially where we have low test coverage. > > Also, consider HBASE-11912. The static analyses mentioned in this paper > could likely be implemented with error-prone. Development and code review > will always be uneven in a volunteer open source project. However if we > agree on some baseline practices, and those are amenable to static > analysis, then we could build validation of those practices into the > compiler, in effect. > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White) >
