[ https://issues.apache.org/jira/browse/HBASE-8755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13839662#comment-13839662 ]
stack commented on HBASE-8755: ------------------------------ Here is more review on the patch. Make the changes suggested below and I'll +1 it. (Discussion off-line w/ Feng on this issue helped me better understand this patch and put to rest any notion that there is an easier 'fix' than the one proposed here. That said. There is much room for improvement but this can be done in a follow-on) Remove these asserts rather than comment them out given they depended on a facility this patch removes. Leaving them in will only make the next reader of the code -- very likely lacking the context you have -- feel uneasy thinking someone removed asserts just to get tests to pass. 8 - assertTrue("Should have an outstanding WAL edit", ((FSHLog) log).hasDeferredEntries()); 9 + //assertTrue("Should have an outstanding WAL edit", ((FSHLog) log).hasDeferredEntries()); On the below... +import java.util.Random; ... using a Random for choosing an arbitrary thread for a list of 4 is heavyweight. Can you not take last digit of timestamp or nano timestamp or some attribute of the edit instead? Something more lightweight? Please remove all mentions of AsyncFlush since it no longer exists: // all writes pending on AsyncWrite/AsyncFlush thread with Leaving it in will confuse readers when they can't find any such thread class. Is this comment right? // txid <= failedTxid will fail by throwing asyncIOE Should it be >= failedTxid? This should be volatile since it is set by AsyncSync and then used by the main FSHLog thread (you have an assert to check it not null -- maybe you ran into an issue here already?): + private IOException asyncIOE = null; bq. + private final Object bufferLock = new Object(); 'bufferLock' if a very generic name. Could it be more descriptive? It is a lock held for a short while while AsyncWriter moves queued edits off the globally seen queue to a local queue just before we send the edits to the WAL. You add a method named getPendingWrites that requires this lock be held. Could we tie the method and the lock together better? Name it pendingWritesLock? (The name of the list to hold the pending writes is pendingWrites). bq. ...because the HDFS write-method is pretty heavyweight as far as locking is concerned. I think the heavyweight referred to in the above is hbase locking, not hdfs locking as the comment would imply. If you agree (you know this code better than I), please adjust the comment. Comments on what these threads do will help the next code reader. AsyncWriter does adding of edits to HDFS. AsyncSyncer needs a comment because it is oxymoronic (though it makes sense in this context). In particular, a comment would draw out why we need so many instances of a syncer thread because everyone's first thought here is going to be why do we need this? Ditto on the AsyncNotifier. In the reviews above, folks have asked why we need this thread at all and a code reader will likely think similar on a first pass. Bottom-line, your patch raised questions from reviewers; it would be cool if the questions were answered in code comments where possible so the questions do not come up again. 4 + private final AsyncWriter asyncWriter; 5 + private final AsyncSyncer[] asyncSyncers = new AsyncSyncer[5]; 6 + private final AsyncNotifier asyncNotifier; You remove the LogSyncer facility in this patch. That is good (need to note this in release notes). Your patch should remove the optional flush config from hbase-default.xml too since it no longer is relevant. 3 - this.optionalFlushInterval = 4 - conf.getLong("hbase.regionserver.optionallogflushinterval", 1 * 1000); I see it here... hbase-common/src/main/resources/hbase-default.xml: <name>hbase.regionserver.optionallogflushinterval</name> A small nit is you might look at other threads in hbase and see how they are named... 3 + asyncWriter = new AsyncWriter("AsyncHLogWriter"); Ditto here: + asyncSyncers[i] = new AsyncSyncer("AsyncHLogSyncer" + i); Probably make the number of asyncsyncers a configuration (you don't have to put the option out in hbase-default.xml.. just make it so that if someone is reading the code and trips over this issue, they can change it by adding to hbase-site.xml w/o having to change code -- lets not reproduce the hard-coded '80' that is in the head of dfsclient we discussed yesterday -- smile). ... and here: asyncNotifier = new AsyncNotifier("AsyncHLogNotifier"); Not important but check out how other threads are named in hbase. It might be good if these better align. Maybe make a method for shutting down all these thread or use the Threads#shutdown method in Threads.java? bq. LOG.error("Exception while waiting for AsyncNotifier threads to die", e); Do LOG.error("Exception while waiting for" + t.getName() + " threads to die", e); instead? Call this method per AsyncSyncer rather than do as you have here: 6 + try { 7 + for (int i = 0; i < asyncSyncers.length; ++i) { 8 + asyncSyncers[i].interrupt(); 9 + asyncSyncers[i].join(); 10 } 11 + } catch (InterruptedException e) { 12 + LOG.error("Exception while waiting for AsyncSyncer threads to die", e); 13 } where if we fail to join on the first asyncsync, we will not shut down the rest. This is a nit. Not important since we are closing down anyways and if an exception here, it is probably fatal. These threads are NOT daemon threads though so a fail to shut them down will hold up the JVM's going down. Maybe they should all be daemon threads just in case? I see this in close: 4 + synchronized (bufferLock) { 5 + doWrite(info, logKey, edits, htd); 6 + txid = this.unflushedEntries.incrementAndGet(); 7 + } ... but we do not seem to be doing it on the other call to doWrite at around line #969 inside in append. Should we be holding the lock at this location (maybe we do and I am just not seeing it because I am just looking at FSHLog and the patch side-by-side)? What is going on here? {code} 1 + // wake up (called by (write) handler thread) AsyncWriter thread 0 + // to write buffered writes to HDFS 1 + public void setPendingTxid(long txid) { {code} This method is only called at close time if I read the patch right. It is not called by '...(write) handler thread'? I am having trouble understanding how this signaling works. Is this 'fatal'? Or is it an 'error' since we are going into recovery... trying to role the log? 2 + LOG.fatal("Error while AsyncWriter write, request close of hlog ", e); Here we failed a write: {code} 11 + } catch(IOException e) { 10 + LOG.fatal("Error while AsyncWriter write, request close of hlog ", e); 9 requestLogRoll(); 8 - Threads.sleep(this.optionalFlushInterval); 7 + 6 + asyncIOE = e; 5 + failedTxid.set(this.txidToWrite); 4 } {code} .. and request a log roll yet we carry on to try and sync, an op that will likely fail? We are ok here? We updated the write txid but not the sync txid so that should be fine. Do we need this: if (!asyncSyncers[i].isSyncing()) DFSClient will allow us call sync concurrently? Yeah... could use the last digit of the txid rather than random? That should be enough? 3 + int idx = rd.nextInt(asyncSyncers.length); 2 + asyncSyncers[idx].setWrittenTxid(this.lastWrittenTxid); In the above, what if the writtentxid is > than what we are to sync? Will that cause a problem? What will the sync do in this case? (nvm... I asked this question already. It is answered later in the code. The sync threads are running all the time. I thought they fired once and completed. My misunderstanding. Ignore the above). Can these be static classes or do they need context form the hosting FSHLog? private class AsyncSyncer extends HasThread { This class definetly needs more comment either here or above when it is set up as a data member: 3 + // thread to request HDFS to sync the WALEdits written by AsyncWriter 2 + // to make those WALEdits durable on HDFS side These method names should not talk about 'flush'. They should be named 'sync' instead: 3 + private long txidToFlush = 0; 2 + private long lastFlushedTxid = 0; Same for the flushlock. Why atomic boolean and not just a volatile here? private AtomicBoolean isSyncing = new AtomicBoolean(false); You are not using any feature of Atomic other than get and set so volatile should work. bq. + LOG.info(getName() + " exiting"); The above is very important. All your threads do this? Nothing worse than a silently exiting thread (smile). Here is that comment again: {code} 1 + // appends new writes to the pendingWrites. It is better to keep it in 0 + // our own queue rather than writing it to the HDFS output stream because 1 + // HDFSOutputStream.writeChunk is not lightweight at all. 2 + // it's caller's responsibility to hold updateLock before call this method {code} It talks about writeChunk being expensive but we are not doing anything to ameliorate dfsclient writes if I dig down into our log writer... we just write as we go. So, the comment is either correct and we are ignoring it in implementation or the comment is wrong and we should be removed. This is out of scope of your patch. You are just moving this comment I'd say. If so, update the comment to add a TODO saying needs investigation. Patch is great. Lets get it in. The above remarks are mostly minor just looking for clarification. Thanks [~fenghh] > A new write thread model for HLog to improve the overall HBase write > throughput > ------------------------------------------------------------------------------- > > Key: HBASE-8755 > URL: https://issues.apache.org/jira/browse/HBASE-8755 > Project: HBase > Issue Type: Improvement > Components: Performance, wal > Reporter: Feng Honghua > Assignee: stack > Priority: Critical > Attachments: 8755trunkV2.txt, HBASE-8755-0.94-V0.patch, > HBASE-8755-0.94-V1.patch, HBASE-8755-0.96-v0.patch, > HBASE-8755-trunk-V0.patch, HBASE-8755-trunk-V1.patch, > HBASE-8755-trunk-v4.patch > > > In current write model, each write handler thread (executing put()) will > individually go through a full 'append (hlog local buffer) => HLog writer > append (write to hdfs) => HLog writer sync (sync hdfs)' cycle for each write, > which incurs heavy race condition on updateLock and flushLock. > The only optimization where checking if current syncTillHere > txid in > expectation for other thread help write/sync its own txid to hdfs and > omitting the write/sync actually help much less than expectation. > Three of my colleagues(Ye Hangjun / Wu Zesheng / Zhang Peng) at Xiaomi > proposed a new write thread model for writing hdfs sequence file and the > prototype implementation shows a 4X improvement for throughput (from 17000 to > 70000+). > I apply this new write thread model in HLog and the performance test in our > test cluster shows about 3X throughput improvement (from 12150 to 31520 for 1 > RS, from 22000 to 70000 for 5 RS), the 1 RS write throughput (1K row-size) > even beats the one of BigTable (Precolator published in 2011 says Bigtable's > write throughput then is 31002). I can provide the detailed performance test > results if anyone is interested. > The change for new write thread model is as below: > 1> All put handler threads append the edits to HLog's local pending buffer; > (it notifies AsyncWriter thread that there is new edits in local buffer) > 2> All put handler threads wait in HLog.syncer() function for underlying > threads to finish the sync that contains its txid; > 3> An single AsyncWriter thread is responsible for retrieve all the buffered > edits in HLog's local pending buffer and write to the hdfs > (hlog.writer.append); (it notifies AsyncFlusher thread that there is new > writes to hdfs that needs a sync) > 4> An single AsyncFlusher thread is responsible for issuing a sync to hdfs > to persist the writes by AsyncWriter; (it notifies the AsyncNotifier thread > that sync watermark increases) > 5> An single AsyncNotifier thread is responsible for notifying all pending > put handler threads which are waiting in the HLog.syncer() function > 6> No LogSyncer thread any more (since there is always > AsyncWriter/AsyncFlusher threads do the same job it does) -- This message was sent by Atlassian JIRA (v6.1#6144)