[ https://issues.apache.org/jira/browse/ACCUMULO-1053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13576648#comment-13576648 ]
Eric Newton commented on ACCUMULO-1053: --------------------------------------- Here's a basic analysis of the data loss. It is incomplete, but I want to document the approach. First, I look at the reported missing entries. They look like this: {noformat} 7001d8d9c37ff1a0 259018854a9e0ab5 700255fb34764bf0 598c79d47356f8da 70031a5ad5ff67c0 017b55debd8b9ba7 ... {noformat} You read this as "row id 7001d8d9c37ff1a0 is missing, it is referenced in row 259018854a9e0ab5". A basic check with the shell confirms this: {noformat} root@test ci> scan -r 7001d8d9c37ff1a0 root@test ci> scan -r 259018854a9e0ab5 -st 259018854a9e0ab5 5705:3563 [] 1360613984602 74430138-9c16-4135-8230-53b2c7e9af94:000000001987ebe3:7001d8d9c37ff1a0:87abf256 {noformat} The reference row contains: ||Key Element||Data|| |row| random id| |cf:cq| random column| |ts| ingest time| |value| ingester uuid: counter: reference: checksum Find the ingester by greping through the ingest logs for the ingester uuid. >From there, find the moment when the flush occurred to look for errors: {noformat} ... FLUSH 1360613974304 3085438 3081429 427999984 1000000 FLUSH 1360614028073 53769 49119 428999984 1000000 ... {noformat} Verify the count matches the timeframe (000000001987ebe3 == 428338147): {noformat} FLUSH 1360613974304 3085438 3081429 427999984 1000000 1360613984602 428338147 FLUSH 1360614028073 53769 49119 428999984 1000000 {noformat} It's good to verify these little assumptions because its very easy to make a small mistake and look at the wrong log file, or misread a digit. Find the tablet name by scanning the !METADATA table: {noformat} scan -b 5;7001d8d9c37ff1a0 -c ~tab:~pr,file 5;7020c49ba5e354a file:/t-0008a8t/A0008ld2.rf [] 95094250,2306420 5;7020c49ba5e354a file:/t-0008a8t/C000919w.rf [] 95750715,2319223 5;7020c49ba5e354a file:/t-0008a8t/F000921z.rf [] 6838391,170103 5;7020c49ba5e354a file:/t-0008a8t/F000924e.rf [] 5517442,136722 5;7020c49ba5e354a ~tab:~pr [] \x0170000000000000a8 {noformat} Find the location of the tablet at the time of the loss by looking for "5;7020c49ba5e354a" in all the tserver logs, sorted by time. Now you can find the walogs in use by the tablet server, the updates made by the ingester (by ip address). In this case, the data was written to a walog, but the tablet was successfully minor compacted. The minor compaction file was compacted into a 2nd file. That file was compacted into C000919w.rf. However C000919w.rf does not contain 7001d8d9c37ff1a0. However, there was no read of the WAL, no recovery took place all of this took place on a single server. I've turned on a 48-hour trash, and will re-run the test, so I can narrow down where the data is lost. > continuous ingest detected data loss > ------------------------------------ > > Key: ACCUMULO-1053 > URL: https://issues.apache.org/jira/browse/ACCUMULO-1053 > Project: Accumulo > Issue Type: Bug > Components: test, tserver > Reporter: Eric Newton > Assignee: Eric Newton > Priority: Critical > Fix For: 1.5.0 > > > Now that we're logging directly HDFS, we added datanodes to the agitator. > That is, we are now killing data nodes during ingest, and now we are losing > data. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira