[ 
https://issues.apache.org/jira/browse/HADOOP-12403?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741083#comment-14741083
 ] 

stack commented on HADOOP-12403:
--------------------------------

bq. The latest HBase WAL write model (HBASE-8755) uses multiple AsyncSyncer 
threads to sync data to HDFS.

It would be preferable if we did not have to do this against HDFS Client. A 
single thread doing syncs back-to-back would be ideal but experiment had it 
that 5 threads each running a sync seems to be optimal (throughput-wise) for 
setting up a syncing pipeline. Need to dig in as to why 5 and why this is 
needed at all. Just FYI.

> Enable multiple writes in flight for HBase WAL writing backed by WASB
> ---------------------------------------------------------------------
>
>                 Key: HADOOP-12403
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12403
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: azure
>            Reporter: Duo Xu
>            Assignee: Duo Xu
>         Attachments: HADOOP-12403.01.patch, HADOOP-12403.02.patch, 
> HADOOP-12403.03.patch
>
>
> Azure HDI HBase clusters use Azure blob storage as file system. We found that 
> the bottle neck was during writing to write ahead log (WAL). The latest HBase 
> WAL write model (HBASE-8755) uses multiple AsyncSyncer threads to sync data 
> to HDFS. However, our WASB driver is still based on a single thread model. 
> Thus when the sync threads call into WASB layer, every time only one thread 
> will be allowed to send data to Azure storage.This jira is to introduce a new 
> write model in WASB layer to allow multiple writes in parallel.
> 1. Since We use page blob for WAL, this will cause "holes" in the page blob 
> as every write starts on a new page. We use the first two bytes of every page 
> to record the actual data size of the current page.
> 2. When reading WAL, we need to know the actual size of the WAL. This should 
> be the sum of the number represented by the first two bytes of every page. 
> However looping over every page to get the size will be very slow, 
> considering normal WAL size is 128MB and each page is 512 bytes. So during 
> writing, every time a write succeeds, a metadata of the blob called 
> "total_data_uploaded" will be updated.
> 3. Although we allow multiple writes in flight, we need to make sure the sync 
> threads which call into WASB layers return in order. Reading HBase source 
> code FSHLog.java, we find that every sync request is associated with a 
> transaction id. If the sync succeeds, all the transactions prior to this 
> transaction id are assumed to be in Azure Storage. We use a queue to store 
> the sync requests and make sure they return to HBase layer in order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to