[ 
https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13840001#comment-13840001
 ] 

Himanshu Vashishtha commented on HBASE-8755:
--------------------------------------------

This is some awesome stuff happening here. 

I have few comments apart from what Stack already mentioned.

1) log rolling thread safety: Log rolling happens in parallel with flush/sync. 
Currently, in FSHLog, sync call grabs the updateLock to ensure it has a 
non-null writer (because of parallel log rolling). How does this patch address 
the non-null writer? Or is it not needed anymore? Also, if you go for the 
updatelock in sync, that might result in deadlock.

2) Error handling: It is not very clear how is flush/sync failures are being 
handled? For example, if a write fails for txid 10, it notifies AsyncSyncer 
that writer is done with txid 10. And, AsyncSyncer notifies the notifier 
thread, which finally notifies the blocked handler using notifyAll. The handler 
checks for the failedTxid here:
{code}
+          if (txid <= this.failedTxid.get()) {
{code}
Let's say there are two handlers waiting for sync, t1 on txid 8 and t2 on txid 
10. And, t1 wakes up on notification. Would t1 also get this exception? 
Wouldn't it be wrong, because txid 8 may have succeeded? Please correct me if I 
missed anything.

3) I think the current HLogPE doesn't do justice to the real use case. Almost 
*all* HLog calls are appendNoSync, followed by a sync call.
In the current HLogPE, we are calling append calls, which also does the sync. 
When I changed it to represent the above common case,
the performance numbers of current FSHLog using HLogPE improves quite a bit. I 
still need to figure out the reason, but the HLogPE change affects the perf 
numbers considerably IMO.
For example, on a 5 node cluster, I see this difference on trunk:
Earlier: 
{code}
Summary: threads=3, iterations=100000 took 218.590s 1372.432ops/s
{code}
Post HlogPE change:
{code}
Summary: threads=3, iterations=100000 took 172.648s 1737.640ops/s
{code}
I think it would be great if we can test this with the correct HLogPE. What do 
you think Fenghua?

4) Perf numbers are super impressive. It would have been wonderful to have such 
numbers for lesser number of handler threads also (e.g., 5-10 threads). IMHO, 
that represents most common case scenario, but I could be wrong. I know this 
has been beaten to death in the discussions above, but just echoing my thoughts 
here. 

5) I should also mention that while working on a different use case, I was 
trying to bring a layer of indirection b/w regionserver handlers and sync 
operation (Sync is the most costly affair in all HLog story). What resulted is 
a separate set of syncer threads, which does the work of flushing the edits and 
syncing to HDFS. This is what it looks like:
a) The handlers append their entries to the FSHLog buffer as they do currently. 
b) They invoke sync API. There, they wait on the Syncer threads to do the work 
for them and notify. 
c) It results in batched sync effort but without much extra locking/threads. 
Basically, it is similar to what you did here but minus Writer & Notifier 
threads and minus the bufferlock.
I mentioned it here because with that approach, I see some perf improvement 
even with lesser number of handler threads. And, it also keeps the current 
deferredLogFlush behavior. This work is still in progress and is still a 
prototype. It would be great to know your take on it.

Here are some numbers on a 5 node cluster; hdfs 2.1.0; hbase is trunk; client 
on a different node. I haven't tested it with larger number of threads but 
would be good to compare I think (its 2am here... ). It uses 3 syncers at the 
moment (and varying it would be a good thing to experiment too). 
Also, without patch in the below table means trunk + HLogPE patch.

||Threads||w/o patch time||w/o patch ops||w/ patch time||w/ patch ops||
|3|172.648s|1737.640ops/s|170.332s|1761.266ops/s|
|3|170.977s|1754.622ops/s|174.568s|1718.528ops/s|
|5|213.738s|2339.313ops/s|191.119s|2616.171ops/s|
|5|211.072s|2368.860ops/s|189.671s|2636.144ops/s|
|10|254.641s|3926.419ops/s|216.494s|4619.319ops/s|
|10|251.503s|3978.564ops/s|215.333s|4643.266ops/s|
|10|251.692s|3970.579ops/s|217.151s|4605.854ops/s|
|20|648.943s|6163.870ops/s|646.279s|6189.277ops/s|
|20|658.654s|6072.991ops/s|656.277s|6094.987ops/s|
|25|282.654s|8861.991ops/s|249.277s|10033.987ops/s|


> A new write thread model for HLog to improve the overall HBase write 
> throughput
> -------------------------------------------------------------------------------
>
>                 Key: HBASE-8755
>                 URL: https://issues.apache.org/jira/browse/HBASE-8755
>             Project: HBase
>          Issue Type: Improvement
>          Components: Performance, wal
>            Reporter: Feng Honghua
>            Assignee: stack
>            Priority: Critical
>         Attachments: 8755trunkV2.txt, HBASE-8755-0.94-V0.patch, 
> HBASE-8755-0.94-V1.patch, HBASE-8755-0.96-v0.patch, 
> HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch, 
> HBASE-8755-trunk-v4.patch
>
>
> In current write model, each write handler thread (executing put()) will 
> individually go through a full 'append (hlog local buffer) => HLog writer 
> append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write, 
> which incurs heavy race condition on updateLock and flushLock.
> The only optimization where checking if current syncTillHere > txid in 
> expectation for other thread help write/sync its own txid to hdfs and 
> omitting the write/sync actually help much less than expectation.
> Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi 
> proposed a new write thread model for writing hdfs sequence file and the 
> prototype implementation shows a 4X improvement for throughput (from 17000 to 
> 70000+). 
> I apply this new write thread model in HLog and the performance test in our 
> test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 
> RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size) 
> even beats the one of BigTable (Precolator published in 2011 says Bigtable's 
> write throughput then is 31002). I can provide the detailed performance test 
> results if anyone is interested.
> The change for new write thread model is as below:
>  1> All put handler threads append the edits to HLog's local pending buffer; 
> (it notifies AsyncWriter thread that there is new edits in local buffer)
>  2> All put handler threads wait in HLog.syncer() function for underlying 
> threads to finish the sync that contains its txid;
>  3> An single AsyncWriter thread is responsible for retrieve all the buffered 
> edits in HLog's local pending buffer and write to the hdfs 
> (hlog.writer.append); (it notifies AsyncFlusher thread that there is new 
> writes to hdfs that needs a sync)
>  4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs 
> to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread 
> that sync watermark increases)
>  5> An single AsyncNotifier thread is responsible for notifying all pending 
> put handler threads which are waiting in the HLog.syncer() function
>  6> No LogSyncer thread any more (since there is always 
> AsyncWriter/AsyncFlusher threads do the same job it does)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to