[ https://issues.apache.org/jira/browse/ACCUMULO-4092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15074140#comment-15074140 ]
Eric Newton commented on ACCUMULO-4092: --------------------------------------- My guess is that it would perform fast enough to leave it as a conditional update. Conditional mutations are 3x slower, and lock out all other users of the row. However, the tablet server should be the only user of that row. > metadata table corruption on recovery > ------------------------------------- > > Key: ACCUMULO-4092 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4092 > Project: Accumulo > Issue Type: Bug > Components: tserver > Affects Versions: 1.6.4 > Environment: large production system, 1.6.2 with local patches, > hadoop 2.2 > Reporter: Eric Newton > > I suspect that we are getting metadata table corruption on WAL recovery. > There have been several hints that this has occurred over the past 2 years, > but I have not had strong evidence for it until today. > A large production cluster was recently upgraded to 1.6.4. Upon shutdown, it > had several consistency check failures. > When a tablet is unloaded, it double-checks the entries for the tablet held > in memory against the metadata for the tablet. When the production system was > restarted for the upgrade, this check failed for several tablets. In > particular, there were file references for the tablet, that did not exist in > memory. > This particular system has a very large table which is organized by date. > Almost all of the tablets that failed the check occurred on the same date. If > the metadata tablet for those tablets was recovered on that date, and there > is some bug recovering the WAL entries, they would have affected multiple > tablets on the same day. > After searching around the logs, we did find that the metadata tablet for the > corrupt tablets did experience a recovery on the date in question. > Unfortunately, the WAL files were GC'd many weeks ago. > We need more information to track down the bug. Some possible ways to get > this information include: > 1) add periodic consistency checks: It's simple, and would detect problems > earlier. In a test environment, we might be able to keep all the archived > WALs. > 2) upon metadata tablet recovery, the master could issue a request for > consistency checks for the affected tablets. If checks fail, the recovery > logs could be archived. > 3) add metadata splits to the long-running tests which would add many more > metadata tablet recoveries > I suspect the bug is subtle, and may not cause data loss, since we don't see > data loss in continuous ingest tests. But that doesn't mean that deleted > data isn't being returned to a table, since the CI test does not delete data. > The uptime for this system is measured in months and includes several hundred > nodes. The metadata tablet is spread over most of the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)