Thanks for the write up. Would the new tests be sub-tasks of HBASE-7290 ?
Cheers On Mon, Jan 14, 2013 at 10:32 AM, Aleksandr Shulman <[email protected]>wrote: > Hi everyone, > > I'd like to start a thread about Cloudera's testing efforts on the upcoming > snapshots feature. This is a new feature and it's important that we explain > our testing efforts and get the community's opinion on what we'd all like > to see tested. My hope is that from this discussion, we can get more ideas > about what needs to be tested and gain confidence in the testing we have in > place. > > Before I begin, I'd like to introduce myself. I'm Aleks Shulman. I'm a > software engineer at Cloudera, working primarily on HBase. Within HBase, I > am focusing on the quality side of things. What this means to me is an > conversation unto itself, but in brief, I will be writing tests and test > frameworks. I will also be an advocate for the user experience, with > particular focus on API compatibility and ease-of-use. > > So let's discuss snapshots: > There are two main areas that should be tested and they correspond nicely > into what can be done as unit tests and what is better left as Jenkins job > or some other automation, unit testing and non-unit testing. We've been > working on this for a bit, so there is already some progress in these > areas: > > Unit testing - In progress or completed: > > 1. HBase Snapshots Repeatability and Idempotency Test: > This test class verifies proper behavior with regards performing > restore/clone operations on tables that themselves were created as a clone > or restored from a snapshot. This is an interesting set of cases because of > the way snapshots work. They work by pointing to the original HFiles. > We can use these tests to verify correctness in the file system and test > closure under deletion of the original table. > > 2. HBase Snapshots HTable Descriptor Test > This test class verifies proper behavior with regards to changes to the > information about the table itself before and after snapshotting in the > 'before' table and the 'after' table. > > 3. HBase Snapshots HFileLink Test > This test class inspects the correctness of the HFileLink files. It looks > into their permissioning, the naming convention, and how they respond > events. Events may include an HFile being deleted or moved. > > 4. HBase Snapshots Table Dimensions Test > This test class inspects operations on tables that are empty, have only one > row, have one or two CFs, etc. Basically if there is an edge scenario in > what the table looks like, that may affect the way it snapshotted or > restored/cloned. > > 5. HBase Snapshots Independence Test > This test should verify that all aspects of table independence are > guaranteed between the original table and the restored snapshot/clone. > This includes things like data mutations, compactions, splits, etc. It also > includes metadata changes. > > 6. HBase Snapshots Aborted or Failed Snapshot Cleanup > Verifies that no cruft is left over after an attempt to snapshot a table > fails or is aborted. We should be able to account for every file in the > file system before and after. > > 7. HBase Snapshots HFile Archive Test > This test task is to fill in any gaps in testing of archiving as it relates > to snapshots. The snapshots relies on the HFileArchiver/LogArchiver with > two new cleaners (SnapshotHFile/SnapshotLog Cleaners), so we'd need to go > through and find out what needs to be tested between them. > > 8. HBase Snapshots Export Test > This test should verify that export of a snapshot to another cluster works > properly. > Implemented as: mvn clean test -PlocalTests > -Dtest=org.apache.hadoop.hbase.snapshot.TestExportSnapshot > However, we need to add more test around chmod, chown and checksums > > 9. HBase Snapshots Concurrent Snapshots Test > This test class will enforce proper behavior in situations where race > conditions can occur. For example, if one process attempts to restore a > table and another one tries to do so simultaneously, what happens? We need > to know how dangerous this could be and whether it is possible for data to > be lost. > Covered in HBASE-7536. > > Unit testing - Lightly tested so far, or tests we are hoping to write soon: > > 1. HBase Snapshots File System Correctness Tests - > > This test class verifies proper behavior with regards to what the file > system looks like. What the file system contains should be predictable > after certain events, both snapshot-specific and environment-specific. > For example, after a snapshot, we should expect there to be files in the > /hbase/.snapshot/ folder. Also, after a split occurs on the base table and > the underlying HFiles go through flux, we should be able to know beforehand > where files move. In particular, this is important to test after repeated > deletions and modifications. Also -- we want to make sure no cruft remains > after various operations occur. > > > 2. HBase Snapshots (Re)Naming Test [Note: Renaming snapshots is not > supported yet!] > > These tests should verify valid/invalid names for snapshots. In particular, > it should use the rename_snapshot command to attempt to rename to a table > that already exists, or to a snapshot that already exists (or had existed > but was deleted). > Things like special characters or semantically-meaningful characters are > important as well. Other things that need to be tested are what happens if > a snapshot is created, deleted, the underlying table is modified, and then > another snapshot is taken. The snapshot should contain the most recent > data. > > > 3. Snapshots logline test: > Verifies that the proper loglines are generated for events. > Manual testing for this might include making sure that spurious, > misleading, or unnecessary log lines are not present. > > 4. HBase Snapshots Aborted or Failed Clone or Restore > > Verifies that no cruft is left over after an attempt to restore or clone a > snapshotted table fails or is aborted and that further snapshots can take > place. This may be tricky and could require writing some additional > utilities. > > Non-unit testing: > > This area of testing is less straightforward and more exploratory in > nature. It's open-ended but with some direction. Particularly, we want to > test a lot of "what if this happens when we do something related > snapshots". By "this happens", I mean compactions, splits, processes dying, > master failing over to backup master, etc. By "something related to > snapshots", that could mean taking a snapshot, restoring a snapshot, or > cloning a snapshot, among other things. In addition, we can see what > happens as scaling factors, (e.g. the number of regions, amount of data per > node, duration of test, and frequency of compactions/splits) increases. > Finally, we should benchmark the time it takes to take/restore/clone a > snapshot and see how it changes with scale factors. > > We are testing some of these combination internally. When we see something > go awry, we fix and rerun the trial, with the expectation that the feature > becomes more stable and reliant. > > Some of the things we have tried: > -Long running tests: Run repeated snapshots while verifying that all is > well. > > -Meanness tets: > 1. Killing the master > 2. Performing a compaction > 3. Table enable/disable > > Feel free to follow-up with questions. > > -- > Best Regards, > > Aleks Shulman > 847.814.5804 > Cloudera >
