[ 
https://issues.apache.org/jira/browse/HBASE-6758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457296#comment-13457296
 ] 

Devaraj Das commented on HBASE-6758:
------------------------------------

[~jdcryans] Thanks for looking. Responses below.

bq. My understanding of this patch is that it reduces the race condition but it 
still leaves a small window eg you can take the "fileNotInUse" snapshot, get 
"false", and the moment after that the log could roll. If this is correct, I'm 
not sure it's worth the added complexity.

I don't think there is ever that window. The replication executor thread picks 
up a path that the LogRoller puts in the replicator's queue BEFORE the log roll 
happens (and the HLog constructor puts the first path before the replication 
executor starts). The replication executor is always trailing, and so when the 
HLog guy says that a path is not in use (being written to), it seems to me a 
fact that it indeed is not being written to and any writes that ever happened 
was in the past. Also note that the currentPath is reset AFTER a log roll, 
which is kind of delayed..

bq. It seems to me this is a case where we'd need to lock HLog.cacheFlushLock 
for the time we read the log to be 100% sure log rolling doesn't happen. This 
has multiple side effects like delaying flushes and log rolls for a few ms 
while replication is reading the log. It would also require having a way to get 
to the WAL from ReplicationSource.

Yeah, I tried my best to avoid taking that crucial lock!

bq. Anyways, one solution I can think of that doesn't involve leaking HRS into 
replication would be giving the log a "second chance". Basically if you get an 
EOF, flip the secondChance bit. If it's on then you don't get rid of that log 
yet. Reset the bit when you loop back to read, now if there was new data added 
you should get it else go to the next log.

I considered some variant of this. However, I gave it up and took a more 
conservative approach - make sure that the replication-executor thread gets at 
least one pass at a CLOSED file. All other solutions seemed incomplete to me 
and prone to races...

[~stack] forgot to answer one of your previous questions.
bq. Should currentFilePath be an atomic reference so all threads see the 
changes when they happen?

I think volatile suffices for the use case here.
                
> [replication] The replication-executor should make sure the file that it is 
> replicating is closed before declaring success on that file
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-6758
>                 URL: https://issues.apache.org/jira/browse/HBASE-6758
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Devaraj Das
>            Assignee: Devaraj Das
>         Attachments: 6758-1-0.92.patch
>
>
> I have seen cases where the replication-executor would lose data to replicate 
> since the file hasn't been closed yet. Upon closing, the new data becomes 
> visible. Before that happens the ZK node shouldn't be deleted in 
> ReplicationSourceManager.logPositionAndCleanOldLogs. Changes need to be made 
> in ReplicationSource.processEndOfFile as well (currentPath related).

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to