[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

stack (JIRA) Fri, 20 Jul 2018 15:00:39 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-18152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551344#comment-16551344
 ]


stack commented on HBASE-18152:
-------------------------------

Saw this again. Here is current example:

{code}
2018-07-17 22:33:03,678 WARN  [Thread-18] wal.ProcedureWALFormatReader: NOT 
INCREASING! current=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 3902
proc_id: 3906
submitted_time: 1531890545613
owner: "stack"
state: RUNNABLE
stack_id: 6
stack_id: 16
last_update: 1531890545980
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\0000\0008\000"
}
, candidate=class_name: 
"org.apache.hadoop.hbase.master.assignment.AssignProcedure"
parent_id: 3902
proc_id: 3906
submitted_time: 1531890545613
owner: "stack"
state: RUNNABLE
stack_id: 6
last_update: 1531890545812
state_message {
  type_url: "type.googleapis.com/hbase.pb.AssignRegionStateData"
  value: 
"\b\002\022F\b\200\336\354\331\312,\022\'\n\adefault\022\034IntegrationTestBigLinkedList\032\004\230\344(~\"\b\234q\307\034q\307\034h(\0000\0008\000"
}
...
{code}

We have written the entries backwards indeed.

Looking at the logging for this particular procedure, I see the below entries 
which correspond to the last_update times:

{code}
....
2018-07-17 22:09:05,812 INFO  [PEWorker-8] assignment.AssignProcedure: Starting 
pid=3906, ppid=3902, state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure 
table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; 
rit=OFFLINE, location=ve0538.halxg.cloudera.com,16020,1531890104902; 
forceNewPlan=false, retain=true
....
2018-07-17 22:09:05,980 INFO  [PEWorker-10] 
assignment.RegionTransitionProcedure: Dispatch pid=3906, ppid=3902, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=IntegrationTestBigLinkedList, region=26d8fa5172ffdf3e89319e4b73fbf164; 
rit=OPENING, location=ve0536.halxg.cloudera.com,16020,1531889066114
{code}

Notice how they are on different worker threads. The procedures "execute" in 
the correct order but the edits land in the store in another order, something 
quiet possible given the distance between execute, post-execution cleanup, and 
store.



> [AMv2] Corrupt Procedure WAL file; procedure data stored out of order
> ---------------------------------------------------------------------
>
>                 Key: HBASE-18152
>                 URL: https://issues.apache.org/jira/browse/HBASE-18152
>             Project: HBase
>          Issue Type: Bug
>          Components: Region Assignment
>    Affects Versions: 2.0.0
>            Reporter: stack
>            Assignee: stack
>            Priority: Critical
>             Fix For: 3.0.0
>
>         Attachments: HBASE-18152.master.001.patch, 
> hbase-hbase-master-ctr-e138-1518143905142-221855-01-000002.hwx.site.log.gz, 
> pv2-00000000000000000036.log, pv2-00000000000000000047.log, 
> reading_bad_wal.patch
>
>
> I've seen corruption from time-to-time testing.  Its rare enough. Often we 
> can get over it but sometimes we can't. It took me a while to capture an 
> instance of corruption. Turns out we are write to the WAL out-of-order which 
> undoes a basic tenet; that WAL content is ordered in line w/ execution.
> Below I'll post a corrupt WAL.
> Looking at the write-side, there is a lot going on. I'm not clear on how we 
> could write out of order. Will try and get more insight. Meantime parking 
> this issue here to fill data into.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HBASE-18152) [AMv2] Corrupt Procedure WAL file; procedure data stored out of order

Reply via email to