[ 
https://issues.apache.org/jira/browse/HBASE-14790?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15039718#comment-15039718
 ] 

Duo Zhang commented on HBASE-14790:
-----------------------------------

{quote}
2. dn1 received the WAL entry, and it is read by ReplicationSource and 
replicated to slave cluster.
3. dn1 and rs both crash, dn2 and dn3 has not received this WAL entry yet, and 
rs has not bumped the GS of this block yet.
4. NameNode complete the file with a length that does not contains this WAL 
entry since the GS of blocks on dn2 and dn3 is correct and NameNode does not 
know there used to be a block with longer length.
{quote}
In a fan out implementation, this problem is obvious but in a pipelined 
implementation it is not that straight-forward and I used to think I was wrong 
and this could not happen in a pipelined implementation. The data can only be 
visible on datanode only after it receives the downstream ack. So if the 
pipeline is dn1->dn2->dn3, then dn3 is the first datanode that make a data 
visible to client and usually we think the data should also be written to dn1 
and dn2. But maybe for performance reason, {{BlockReceiver}} sends a packet to 
downstream mirror before writing it to local disk. So it could happen that dn3 
make the data visible and read by client, but dn1 and dn2 crash before writing 
data to local disk. Then let us kill the client and dn3, and restart dn1 and 
dn2, whoops...

And I had a discussion with my workmate [~yangzhe1991], we think that if we 
allow duplicate WAL entries in HBase, then the pipeline recovery part could 
also be moved to a background thread. We could just rewrite the WAL entries 
after acked point to the new file, this could also reduce the recovery latency.

And for keeping an "acked length", I think we could make use of the fsync 
method in HDFS. We could call fsync asynchronously to update length on 
namenode. The replication source should not read beyond the length gotten from 
namenode(do not trust the visible length read from datanode). The advantage 
here is when region server crashes, we could still get this value from 
namenode, and the file will be closed eventually by someone so the length will 
finally be correct.

Thanks.

> Implement a new DFSOutputStream for logging WAL only
> ----------------------------------------------------
>
>                 Key: HBASE-14790
>                 URL: https://issues.apache.org/jira/browse/HBASE-14790
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>
> The original {{DFSOutputStream}} is very powerful and aims to serve all 
> purposes. But in fact, we do not need most of the features if we only want to 
> log WAL. For example, we do not need pipeline recovery since we could just 
> close the old logger and open a new one. And also, we do not need to write 
> multiple blocks since we could also open a new logger if the old file is too 
> large.
> And the most important thing is that, it is hard to handle all the corner 
> cases to avoid data loss or data inconsistency(such as HBASE-14004) when 
> using original DFSOutputStream due to its complicated logic. And the 
> complicated logic also force us to use some magical tricks to increase 
> performance. For example, we need to use multiple threads to call {{hflush}} 
> when logging, and now we use 5 threads. But why 5 not 10 or 100?
> So here, I propose we should implement our own {{DFSOutputStream}} when 
> logging WAL. For correctness, and also for performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to