[ https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551344#comment-16551344 ]
stack commented on HBASE-18152: ------------------------------- Saw this again. Here is current example: {code} 2018-07-17 22:33:03,678 WARN [Thread-18] wal.ProcedureWALFormatReader: NOT INCREASING! current=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 3902 proc_id: 3906 submitted_time: 1531890545613 owner: "stack" state: RUNNABLE stack_id: 6 stack_id: 16 last_update: 1531890545980 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\0000\0008\000" } , candidate=class_name: "org.apache.hadoop.hbase.master.assignment.AssignProcedure" parent_id: 3902 proc_id: 3906 submitted_time: 1531890545613 owner: "stack" state: RUNNABLE stack_id: 6 last_update: 1531890545812 state_message { type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData" value: "\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\0000\0008\000" } ... {code} We have written the entries backwards indeed. Looking at the logging for this particular procedure, I see the below entries which correspond to the last_update times: {code} .... 2018-07-17 22:09:05,812 INFO [PEWorker-8] assignment.AssignProcedure: Starting pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; rit=OFFLINE, location=ve0538.halxg.cloudera.com,16020,1531890104902; forceNewPlan=false, retain=true .... 2018-07-17 22:09:05,980 INFO [PEWorker-10] assignment.RegionTransitionProcedure: Dispatch pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; rit=OPENING, location=ve0536.halxg.cloudera.com,16020,1531889066114 {code} Notice how they are on different worker threads. The procedures "execute" in the correct order but the edits land in the store in another order, something quiet possible given the distance between execute, post-execution cleanup, and store. > [AMv2] Corrupt Procedure WAL file; procedure data stored out of order > --------------------------------------------------------------------- > > Key: HBASE-18152 > URL: https://issues.apache.org/jira/browse/HBASE-18152 > Project: HBase > Issue Type: Bug > Components: Region Assignment > Affects Versions: 2.0.0 > Reporter: stack > Assignee: stack > Priority: Critical > Fix For: 3.0.0 > > Attachments: HBASE-18152.master.001.patch, > hbase-hbase-master-ctr-e138-1518143905142-221855-01-000002.hwx.site.log.gz, > pv2-00000000000000000036.log, pv2-00000000000000000047.log, > reading_bad_wal.patch > > > I've seen corruption from time-to-time testing. Its rare enough. Often we > can get over it but sometimes we can't. It took me a while to capture an > instance of corruption. Turns out we are write to the WAL out-of-order which > undoes a basic tenet; that WAL content is ordered in line w/ execution. > Below I'll post a corrupt WAL. > Looking at the write-side, there is a lot going on. I'm not clear on how we > could write out of order. Will try and get more insight. Meantime parking > this issue here to fill data into. -- This message was sent by Atlassian JIRA (v7.6.3#76005)