[ https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797182#comment-13797182 ]
Nicolas Liochon commented on HBASE-9775: ---------------------------------------- Thanks, Elliot. bq. So there should be 2 clients per region server. That something that would work fine with the 0.94 out of the box, right? Is there anything on the server that could explain the server timeout (SocketTimeoutException)? With 150 clients, and a client being able to send 2 queries per server, a server can receive 300 queries simultaneously. On average it should be less: a client can have only 100 tasks, so it will be 200 (but it's an average: a server can be unlucky and receives these 300 requests). The limit on the threads don't hold here: there should be less then 250 threads per client. Here are the differences I see between the 0.94 and 0.96 that could be related. I may be wrong, I'm not sure about all backports. - with the settings above, a server would have received 150 queries max (1 per client), instead of 300 now worse, 150 average. - the server reject the client when it's busy (HBASE-9467). That increases the number of retries to do, and, on an heavy load, can lead us to fail on something that would have worked before. - we're much more aggressive on the time before retrying (100ms vs 1000ms), the backoff is different. It was { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }, it's now { 1, 2, 3, 5, 10, 100 }. The number of retries was 10 it's now 31. But we increase the server load as we're retrying more aggressively. For example, the new settings will make the client to send 4 queries in 1 second when they fail. If the servers can handle the load, it's great. If there are 150 clients like this may be not. - we now stop after ~5 minutes (calculated from the number of retries & back off time), this whatever the number of retries actually made. I'm not sure that the point here (I would need the debug logs to know), but I've seen it on this tests on other clusters (we were not doing all the retries). Is there anything that I forgot? If we want to compare 0.94 and 0.96, may be we should use the same settings, i.e. pause: 1000ms backoff: { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 } hbase.client.max.perserver.tasks: 1 This does not match exactly (the 0.96 will still send more tasks at peaks, as it will always sends data to all servers for example, and there is still the time limit and the effect of HBASE-9467 that makes me more comfortable with more retries), but we're not too far hopefully. We can use hbase.client.max.total.tasks if we need to control the clients more. I'm not sure it should be the default (at least for the backoff, the strategy was to improve latency vs. server load). But it could be recommended for upgrade and/or map reduce tasks. Lastly, what's the configuration of the box? > Client write path perf issues > ----------------------------- > > Key: HBASE-9775 > URL: https://issues.apache.org/jira/browse/HBASE-9775 > Project: HBase > Issue Type: Bug > Components: Client > Affects Versions: 0.96.0 > Reporter: Elliott Clark > Priority: Critical > Attachments: Charts Search Cloudera Manager - ITBLL.png, Charts > Search Cloudera Manager.png, job_run.log, short_ycsb.png, > ycsb_insert_94_vs_96.png > > > Testing on larger clusters has not had the desired throughput increases. -- This message was sent by Atlassian JIRA (v6.1#6144)