[ 
https://issues.apache.org/jira/browse/HBASE-13877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enis Soztutar updated HBASE-13877:
----------------------------------
    Attachment: hbase-13877_v3-branch-1.1.patch

Here is a v3 patch which is a little bit more comprehensive. The semantics for 
what to do if flushCache() and close() fails is hard since the region cannot do 
much by itself. 

The contract for flushCache() and close() when DroppedSnapshotException is 
thrown was already that RS should abort. Now, the patch makes it more explicit, 
as well as adds safeguard so that region itself calls abort if rss is passed. 
Since flush() can be called from multiple different callers (MemstoreFlusher, 
snapshot, etc), we also have to guarantee that before DSE is thrown, we put the 
region in closing state so that no other writes / flushes can happen. This is 
because we cannot call {{close(true)}} in the flushCache() since we cannot 
promote our read lock to a write lock. The caller should receive the DSE, then 
abort himself and RSS, which then calls close(true). But there is a window of 
time before RSS calls close(true), so no other flushes should come in while the 
caller handles the exception. 

[~saint....@gmail.com], [~Apache9] does the patch make sense? It also touches 
upon HBASE-10514. 


 



> Interrupt to flush from TableFlushProcedure causes dataloss in ITBLL
> --------------------------------------------------------------------
>
>                 Key: HBASE-13877
>                 URL: https://issues.apache.org/jira/browse/HBASE-13877
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>            Priority: Blocker
>             Fix For: 2.0.0, 1.2.0, 1.1.1
>
>         Attachments: hbase-13877_v1.patch, hbase-13877_v2-branch-1.1.patch, 
> hbase-13877_v3-branch-1.1.patch
>
>
> ITBLL with 1.25B rows failed for me (and Stack as reported in 
> https://issues.apache.org/jira/browse/HBASE-13811?focusedCommentId=14577834&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14577834)
>  
> HBASE-13811 and HBASE-13853 fixed an issue with WAL edit filtering. 
> The root cause this time seems to be different. It is due to procedure based 
> flush interrupting the flush request in case the procedure is cancelled from 
> an exception elsewhere. This leaves the memstore snapshot intact without 
> aborting the server. The next flush, then flushes the previous memstore with 
> the current seqId (as opposed to seqId from the memstore snapshot). This 
> creates an hfile with larger seqId than what its contents are. Previous 
> behavior in 0.98 and 1.0 (I believe) is that after flush prepare and 
> interruption / exception will cause RS abort.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to