[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-9775: - Attachment: 9775.rig.v3.patch v3 dumps mockito. Mockito keeps references to each invocation so can keep running counts. I could not figure how to disable this facility. Patch is better w/o it anyways. v3 no longer has heap issues, at least at current 'scales'. The patch as is is configured to do the inverse of the previous patch. Now I have a single 'server' and I have ten clients beating up on it. It doesn't take long for the clients to 'overrun' the server. The server cannot respond in time so we just keep throwing RegionBusyException more and more frequently -- which simulates I think what E was seeing on the 'big' cluster. Will dig in tomorrow on what we can do when RBE -- how to better back off (Elliott had ideas in here). Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: 9775.rig.txt, 9775.rig.v2.patch, 9775.rig.v3.patch, Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb.png, ycsb_insert_94_vs_96.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-9775: - Attachment: 9775.rig.v2.patch Rebase for 0.96. You just run the main on TestClientNoCluster. After updating, no noticeable difference. We run up to 100 threads and stay there w/ near all in wait mode. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: 9775.rig.txt, 9775.rig.v2.patch, Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png, ycsb.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] stack updated HBASE-9775: - Attachment: 9775.rig.txt I'm trying to write a rig that the client can run in so we can inspect it. Attached is a bit of code that mocks a cluster of 1k servers and 100k regions. Currently it runs w/o throwing exceptions of failures. When I put it under the profiler, we spin up 7 threads and that seems to keep us running nicely; we never go beyond 7. If I add some friction by adding pause to the mock Put handler so it takes time to process the puts, thread count spins up and tops out at 100 which looks like it is: AsyncProcess#maxTotalConcurrentTasks whose config is hbase.client.max.total.tasks. I suppose I should randomize up the way I put -- it is sort of ordered at the moment but even then, it looks like I'd be doing 1/10th of the servers at a time. Let me update and see what the [~liochon] recent changes do. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: 9775.rig.txt, Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png, ycsb.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-9775: - Attachment: ycsb.png {quote} I observed better write performances on the 0.96 than 0.94, by about 20% when inserting 100m of rows from an empty cluster. There are around 18 regions at this stage IIRC, so the cluster size should not matter that much when we start from an empty table. I've inserted around 1b w/o issue on 0.96. {quote} Our performance team independently ran some ycsb tests vs HBase 0.94.6. Here's the graph that they generated. Blue is 94. Orange is 0.96. X axis is target throughput Y axis is actual throughput Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png, ycsb.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeffrey Zhong updated HBASE-9775: - Attachment: hbase-9775.patch I think I found one bug in the AsyncProcess hurts performance. Below is the code snippet: {code} incTaskCounters(multiAction.getRegions(), loc.getServerName()); Runnable runnable = Trace.wrap(AsyncProcess.sendMultiAction, new Runnable() { receiveMultiAction(initialActions, multiAction, loc, res, numAttempt, errorsByServer); } finally { decTaskCounters(multiAction.getRegions(), loc.getServerName()); } {code} Because receiveMultiAction use recursive way to resubmit failure edits. Therefore, we double bump up the TaskCounter when error happens and the overlap timing is a retry internal which is quite long time for client operations. I attached a patch for your reference. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, hbase-9775.patch, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-9775: - Attachment: ycsb_insert_94_vs_96.png I ran a 94 vs 96 comparison. Here are the results. You can see that 0.94 handily beats 96 until compaction become the limiting factor. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager.png, short_ycsb.png, ycsb_insert_94_vs_96.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-9775: - Attachment: Charts Search Cloudera Manager - ITBLL.png Here's what the network looked like at the time. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-9775: - Attachment: job_run.log here's the logs. The RS's are running G1GC so there should be no issues with GC pausing. I'm seeing this as the pause times: 2013-10-16T10:19:23.182-0700: [GC pause (young), 0.10152600 secs] All of the boxes are on 10 gig. I ran: {code} hbase org.apache.hadoop.hbase.test.IntegrationTestBigLinkedList --monkey calm Loop 2 154 2500 IntegrationTestBigLinkedList 77 job_run.log 21 {code} So there should be 2 clients per region server. Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts Search Cloudera Manager.png, job_run.log, short_ycsb.png, ycsb_insert_94_vs_96.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Updated] (HBASE-9775) Client write path perf issues
[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Elliott Clark updated HBASE-9775: - Summary: Client write path perf issues (was: Client write path scales very badly with more servers) Client write path perf issues - Key: HBASE-9775 URL: https://issues.apache.org/jira/browse/HBASE-9775 Project: HBase Issue Type: Bug Components: Client Affects Versions: 0.96.0 Reporter: Elliott Clark Priority: Critical Attachments: Charts Search Cloudera Manager.png, short_ycsb.png Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)