[jira] [Commented] (HBASE-10329) Fail the writes rather than proceeding silently to prevent data loss when AsyncSyncer encounters null writer and its writes aren't synced by other Asyncer

stack (JIRA) Mon, 13 Jan 2014 12:36:59 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13869838#comment-13869838
 ]


stack commented on HBASE-10329:
-------------------------------

Thanks [~fenghh].  Looks great.  Thanks for persisting and fixing my hack.

bq. ....and stack fixed this by adding 'if (writer != null)' to protect the 
sync operation....

The check for nulll writer is actually an old 'problem' done in a few places 
about the code IIRC so kudos digging in.

Over the w/e I was working on my HBASE-10156.  Long story short, I ran into the 
same issue.  I need to hold the writer thread while the log is rolled out from 
under it only I can't hold the writer thread at any arbitrary point; I have to 
hold the writer when it attains the highest outstanding sync point.  Only then 
I can roll the log (patch coming soon).  Having this issue made me wonder how 
the current implementation does this dance.  This issue seems to indicate it 
didn't. 

Good on you.

> Fail the writes rather than proceeding silently to prevent data loss when 
> AsyncSyncer encounters null writer and its writes aren't synced by other 
> Asyncer
> ----------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-10329
>                 URL: https://issues.apache.org/jira/browse/HBASE-10329
>             Project: HBase
>          Issue Type: Bug
>          Components: regionserver, wal
>    Affects Versions: 0.98.0
>            Reporter: Feng Honghua
>            Assignee: Feng Honghua
>            Priority: Critical
>             Fix For: 0.98.0, 0.99.0
>
>         Attachments: HBASE-10329-trunk_v0.patch
>
>
> Last month after I introduced multiple AsyncSyncer threads to improve the 
> throughput for lower number client write threads, [~stack] encountered a NPE 
> while doing the test where null-writer occurs in AsyncSyncer when doing sync. 
> Since we have run many times test in cluster to verify the throughput 
> improvement, and never encountered such NPE, it really confused me. (and 
> [~stack] fixed this by adding 'if (writer != null)' to protect the sync 
> operation)
> These days from time to time I wondered why the writer can be null in 
> AsyncSyncer and whether it's safe to fix it by just adding a null checking 
> before doing sync, as [~stack] did. After some digging, I find out the case 
> where AsyncSyncer can encounter null-writer, it is as below:
> 1. t1: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 1 with 
> writtenTxid==100
> 2. t2: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 2 with 
> writtenTxid==200
> 3. t3: rollWriter starts, it grabs the updateLock to prevents further writes 
> from client writes to enter pendingWrites, and then waits for all items(<= 
> 200) in pendingWrites to append and finally sync to hdfs
> 4. t4: AsyncSyncer 2 finishes, now syncedTillHere==200(it also help sync 
> <=100 as a whole)
> 5. t5: rollWriter now can close writer, set writer=null...
> 6. t6: AsyncSyncer 1 starts to do sync and finds the writer is null... before 
> rollWriter sets writer to the newly rolled Writer
> We can see:
> 1. the null writer is possible only after there are multiple AsyncSyncer 
> threads, that's why we never encountered it before introducing multiple 
> AsyncSyncer threads.
> 2. since rollWriter can set writer=null only after all items of pendingWrites 
> sync to hdfs, and AsyncWriter is in the critical path of this task and there 
> is only one single AsyncWriter thread, so AsyncWriter can't encounter null 
> writer, that's why we never encounter null writer in AsyncWriter though it 
> also uses writer. This is the same reason as why null-writer never occurs 
> when there is a single AsyncSyncer thread.
> And we should treat differently when writer == null in AsyncSyncer:
> 1. if txidToSync <= syncedTillHere, this means all writes this AsyncSyncer 
> care about have already been synced by other AsyncSyncer, we can safely 
> ignore sync(as [~stack] does here);
> 2. if txidToSync > syncedTillHere, we need fail all the writes with txid <= 
> txidToSync to avoid data loss: user gets successful write response but can't 
> read out the writes after getting the successful write response, from user's 
> perspective this is data loss (according to above analysis, such case should 
> not occur, but we still should add such defensive treatment to prevent data 
> loss if it really occurs, such as by some bug introduced later)
> also fix the bug where isSyncing needs to reset to false when writer.sync 
> encounters IOException: AsyncSyncer swallows such exception by failing all 
> writes with txid<=txidToSync, and this AsyncSyncer thread is now ready to do 
> later sync, its isSyncing needs to be reset to false in the IOException 
> handling block, otherwise it can't be selected by AsyncWriter to do sync



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (HBASE-10329) Fail the writes rather than proceeding silently to prevent data loss when AsyncSyncer encounters null writer and its writes aren't synced by other Asyncer

Reply via email to