[ 
https://issues.apache.org/jira/browse/HBASE-26791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17505292#comment-17505292
 ] 

Josh Elser edited comment on HBASE-26791 at 3/12/22, 4:22 PM:
--------------------------------------------------------------

{quote}isn't the broader issue here the fact RS1 doesn't abort immediately upon 
the loss of its ZK lock? Shouldn't we rather ensure an RS abort is triggered 
and all ongoing operations (including any hstore flushes) are interrupted right 
away?
{quote}
-Yes and no. In normal cases, yeah, we should just be able to interrupt the 
threads and expect them all to exit gracefully. However, when you start to 
consider JVM pauses and the like, it's non-deterministic if we can expect one 
thread in the RS to notice that we lost the RS lock, send an interrupt to all 
other flush/compaction threads, and then those threads to notice and take 
action on that.-

-If we can avoid it another way, there's value in that.-

edit: I really have to get better at making sure I refresh the page before 
commenting :(


was (Author: elserj):
{quote}isn't the broader issue here the fact RS1 doesn't abort immediately upon 
the loss of its ZK lock? Shouldn't we rather ensure an RS abort is triggered 
and all ongoing operations (including any hstore flushes) are interrupted right 
away?
{quote}
Yes and no. In normal cases, yeah, we should just be able to interrupt the 
threads and expect them all to exit gracefully. However, when you start to 
consider JVM pauses and the like, it's non-deterministic if we can expect one 
thread in the RS to notice that we lost the RS lock, send an interrupt to all 
other flush/compaction threads, and then those threads to notice and take 
action on that.

If we can avoid it another way, there's value in that.

> Memstore flush fencing issue for SFT
> ------------------------------------
>
>                 Key: HBASE-26791
>                 URL: https://issues.apache.org/jira/browse/HBASE-26791
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.6.0, 3.0.0-alpha-3
>            Reporter: Szabolcs Bukros
>            Assignee: Duo Zhang
>            Priority: Major
>
> The scenarios is the following:
>  # rs1 is flushing file to S3 for region1
>  # rs1 loses ZK lock
>  # region1 gets assigned to rs2
>  # rs2 opens region1
>  # rs1 completes flush and updates sft file for region1
>  # rs2 has a different “version” of the sft file for region1
> The flush should fail at the end, but the SFT file gets overwritten before 
> that, resulting in potential data loss.
>  
> Potential solutions include:
>  * Adding timestamp to the tracker file names. This and creating a new 
> tracker file when an rs open the region would allow us to list available 
> tracker files before an update and compare the found timestamps to the one 
> stored in memory to verify the store still owns the latest tracker file
>  * Using the existing timestamp in the tracker file content. This would also 
> require us to create a new tracker file when a new rs opens the region, but 
> instead of listing the available tracker files, we could try to load and 
> de-serialize the last tracker file and compare the timestamp found in it to 
> the one stored in memory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to