[ https://issues.apache.org/jira/browse/HBASE-13853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577670#comment-14577670 ]
stack commented on HBASE-13853: ------------------------------- +1 [~enis] > ITBLL data loss in HBase-1.1 > ---------------------------- > > Key: HBASE-13853 > URL: https://issues.apache.org/jira/browse/HBASE-13853 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Priority: Blocker > Fix For: 2.0.0, 1.1.1 > > Attachments: 13853.remainder.txt, hbase-13853_v1-branch-1.1.patch > > > After 2 days of log mining with ITBLL Search tool (thanks Stack, BTW for the > tool), the sequence of events are like this: > One of mappers from Search reportedly finds missing keys in some WAL file: > {code} > 2015-06-07 06:49:24,573 INFO [main] org.apache.hadoop.mapred.MapTask: > Processing split: > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 > (-9223372036854775808:9223372036854775807) length:132679657 2015-06-07 > 06:49:24,639 INFO [main] org.apache.hadoop.hbase.mapreduce.WALInputFormat: > Opening reader for > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/oldWALs/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643 > (-9223372036854775808:9223372036854775807) length:132679657 > 2015-06-07 06:50:16,383 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Loaded keys > to find: count=2384870 2015-06-07 06:50:16,607 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:prev/1433637873293/Put/vlen=16/seqid=0 > 2015-06-07 06:50:16,663 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:count/1433637873293/Put/vlen=8/seqid=0 > 2015-06-07 06:50:16,664 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC/meta:client/1433637873293/Put/vlen=71/seqid=0 > 2015-06-07 06:50:16,671 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:prev/1433637873293/Put/vlen=16/seqid=0 > 2015-06-07 06:50:16,672 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:count/1433637873293/Put/vlen=8/seqid=0 > 2015-06-07 06:50:16,678 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=$\x1A\x99\x06\x86\xE7\x07\xA2\xA7\xB2\xFB\xCEP\x12"\x04/meta:client/1433637873293/Put/vlen=71/seqid=0 > 2015-06-07 06:50:16,679 INFO [main] > org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList$Search: Found > cell=\x1F\x80#\x95\xAE:i=)S\x01\x08 > \xD6\xFF\xD5/meta:prev/1433637873293/Put/vlen=16/seqid=0 > {code} > {{hlog -p}} confirms the missing keys are there in the file: > {code} > Sequence=7276 , region=b086e29f909c9790446cd457c1ea7674 at write > timestamp=Sun Jun 07 00:44:33 UTC 2015 > row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:prev > value: \x1B\xF5\xF3^\x8D\xB4\x85\xE3\xF4wS\x9A]\x0D\xABe > row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:count > value: \x00\x00\x00\x00\x002\x87l > row=!+\xF1CB\x08\x13\xA0W\x94\xC4\x1C\xDA\x1D\x0D\xBC, column=meta:client > value: Job: job_1433466891829_0002 Task: > attempt_1433466891829_0002_m_000003_0 > {code} > When the RS gets killing from CM, master does WAL splitting: > {code} > ./hbase-hbase-regionserver-os-enis-dal-test-jun-4-2.log:2015-06-07 > 00:46:12,581 INFO [RS_LOG_REPLAY_OPS-os-enis-dal-test-jun-4-2:16020-0] > wal.WALSplitter: Processed 2971 edits across 4 regions; edits skipped=740; > log > file=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/WALs/os-enis-dal-test-jun-4-4.openstacklocal,16020,1433636320201-splitting/os-enis-dal-test-jun-4-4.openstacklocal%2C16020%2C1433636320201.default.1433637872643, > length=132679657, corrupted=false, progress failed=false > {code} > The edits with Sequence=7276 should be in this recovered.edits file: > {code} > 2015-06-07 00:46:12,574 INFO [split-log-closeStream-2] wal.WALSplitter: > Closed wap > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000007276.temp > (wrote 739 edits in 3950ms) > 2015-06-07 00:46:12,580 INFO [split-log-closeStream-2] wal.WALSplitter: > Rename > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000007276.temp > to > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752 > {code} > Notice that the file {{0000000000000008752}} should contain {{(wrote 739 > edits in 3950ms)}}. > When the region gets opened again, it replays the recovered.edits files: > {code} > 2015-06-07 00:46:17,764 DEBUG > [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: Found > 4 recovered edits file(s) under > hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674 > {code} > However, it does not see ANY entry in the recovered.edits file > {{0000000000000008752}}: > {code} > 2015-06-07 00:46:17,796 DEBUG > [RS_OPEN_REGION-os-enis-dal-test-jun-4-5:16020-0] regionserver.HRegion: > Applied 0, skipped 0, firstSequenceIdInLog=-1, maxSequenceIdInLog=-1, > path=hdfs://os-enis-dal-test-jun-4-1.openstacklocal:8020/apps/hbase/data/data/default/IntegrationTestBigLinkedList/b086e29f909c9790446cd457c1ea7674/recovered.edits/0000000000000008752 > {code} > Looking at the length of the file, it is very small which means that the > edits are not written. > Now, If you are still with me, I think the reason why we do not write any > edit to {{0000000000000008752}} although we say we have written 739 edits is > due to HBASE-12405. > Specifically, we pass {{FSHLog.getEarliestMemstoreSeqNum() - 1}} as the > stores flush seqId, which gets saved in the master and used with > {{LastSequenceId}}. > The WALSplitter skips the cells per store using {{filterCellByStore()}} which > is logical. However, the data loss happens, because > {{FSHLog.getEarliestMemstoreSeqNum()}} ignores flushing edits. For example, > if we have seqId = 10 to 100 as flushing, and seqId 101 in the new memstore, > and heartbeat to master during the flush will carry 101 as the earliest flush > seqId. if the flush does not complete (due to RS getting killed) this results > in incorrectly skipped edits. > The WAL replay code also uses this kind of filtering so this may also explain > some of the failures with DLR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)