[ https://issues.apache.org/jira/browse/HBASE-10329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ted Yu updated HBASE-10329: --------------------------- Status: Patch Available (was: Open) > Fail the writes rather than proceeding silently to prevent data loss when > AsyncSyncer encounters null writer and its writes aren't synced by other > Asyncer > ---------------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HBASE-10329 > URL: https://issues.apache.org/jira/browse/HBASE-10329 > Project: HBase > Issue Type: Bug > Components: regionserver, wal > Reporter: Feng Honghua > Assignee: Feng Honghua > Attachments: HBASE-10329-trunk_v0.patch > > > Last month after I introduced multiple AsyncSyncer threads to improve the > throughput for lower number client write threads, [~stack] encountered a NPE > while doing the test where null-writer occurs in AsyncSyncer when doing sync. > Since we have run many times test in cluster to verify the throughput > improvement, and never encountered such NPE, it really confused me. (and > [~stack] fixed this by adding 'if (writer != null)' to protect the sync > operation) > These days from time to time I wondered why the writer can be null in > AsyncSyncer and whether it's safe to fix it by just adding a null checking > before doing sync, as [~stack] did. After some digging, I find out the case > where AsyncSyncer can encounter null-writer, it is as below: > 1. t1: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 1 with > writtenTxid==100 > 2. t2: AsyncWriter appends writes to hdfs, triggers AsyncSyncer 2 with > writtenTxid==200 > 3. t3: rollWriter starts, it grabs the updateLock to prevents further writes > from client writes to enter pendingWrites, and then waits for all items(<= > 200) in pendingWrites to append and finally sync to hdfs > 4. t4: AsyncSyncer 2 finishes, now syncedTillHere==200(it also help sync > <=100 as a whole) > 5. t5: rollWriter now can close writer, set writer=null... > 6. t6: AsyncSyncer 1 starts to do sync and finds the writer is null... before > rollWriter sets writer to the newly rolled Writer > We can see: > 1. the null writer is possible only after there are multiple AsyncSyncer > threads, that's why we never encountered it before introducing multiple > AsyncSyncer threads. > 2. since rollWriter can set writer=null only after all items of pendingWrites > sync to hdfs, and AsyncWriter is in the critical path of this task and there > is only one single AsyncWriter thread, so AsyncWriter can't encounter null > writer, that's why we never encounter null writer in AsyncWriter though it > also uses writer. This is the same reason as why null-writer never occurs > when there is a single AsyncSyncer thread. > And we should treat differently when writer == null in AsyncSyncer: > 1. if txidToSync <= syncedTillHere, this means all writes this AsyncSyncer > care about have already been synced by other AsyncSyncer, we can safely > ignore sync(as [~stack] does here); > 2. if txidToSync > syncedTillHere, we need fail all the writes with txid <= > txidToSync to avoid data loss: user gets successful write response but can't > read out the writes after getting the successful write response, from user's > perspective this is data loss (according to above analysis, such case should > not occur, but we still should add such defensive treatment to prevent data > loss if it really occurs, such as by some bug introduced later) > also fix the bug where isSyncing needs to reset to false when writer.sync > encounters IOException: AsyncSyncer swallows such exception by failing all > writes with txid<=txidToSync, and this AsyncSyncer thread is now ready to do > later sync, its isSyncing needs to be reset to false in the IOException > handling block, otherwise it can't be selected by AsyncWriter to do sync -- This message was sent by Atlassian JIRA (v6.1.5#6160)