[ 
https://issues.apache.org/jira/browse/HBASE-9775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13797182#comment-13797182
 ] 

Nicolas Liochon commented on HBASE-9775:
----------------------------------------

Thanks, Elliot.

bq. So there should be 2 clients per region server.
That something that would work fine with the 0.94 out of the box, right?

Is there anything on the server that could explain the server timeout 
(SocketTimeoutException)?

With 150 clients, and a client being able to send 2 queries per server, a 
server can receive 300 queries simultaneously.
On average it should be less: a client can have only 100 tasks, so it will be 
200 (but it's an average: a server can be unlucky and receives these 300 
requests). The limit on the threads don't hold here: there should be less then 
250 threads per client.

Here are the differences I see between the 0.94 and 0.96 that could be related. 
I may be wrong, I'm not sure about all backports.
 - with the settings above, a server would have received 150 queries max (1 per 
client), instead of 300 now worse, 150 average.
 - the server reject the client when it's busy (HBASE-9467). That increases the 
number of retries to do, and, on an heavy load, can lead us to fail on 
something that would have worked before.
 - we're much more aggressive on the time before retrying (100ms vs 1000ms), 
the backoff is different. It was  { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }, it's 
now { 1, 2, 3, 5, 10, 100 }. The number of retries was 10 it's now 31. But we 
increase the server load as we're retrying more aggressively. For example, the 
new settings will make the client to send 4 queries in 1 second when they fail. 
If the servers can handle the load, it's great. If there are 150 clients like 
this may be not.
 - we now stop after ~5 minutes (calculated from the number of retries & back 
off time), this whatever the number of retries actually made. I'm not sure that 
the point here (I would need the debug logs to know), but I've seen it on this 
tests on other  clusters (we were not doing all the retries).

Is there anything that I forgot?

If we want to compare 0.94 and 0.96, may be we should use the same settings, 
i.e.
pause: 1000ms
backoff:  { 1, 1, 1, 2, 2, 4, 4, 8, 16, 32, 64 }
hbase.client.max.perserver.tasks: 1

This does not match exactly (the 0.96 will still send more tasks at peaks, as 
it will always sends data to all servers for example, and there is still the 
time limit and the effect of HBASE-9467 that makes me more comfortable with 
more retries), but we're not too far hopefully. We can use 
hbase.client.max.total.tasks if we need to control the clients more.

I'm not sure it should be the default (at least for the backoff, the strategy 
was to improve latency vs. server load). But it could be recommended for 
upgrade and/or map reduce tasks.

Lastly, what's the configuration of the box?

> Client write path perf issues
> -----------------------------
>
>                 Key: HBASE-9775
>                 URL: https://issues.apache.org/jira/browse/HBASE-9775
>             Project: HBase
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 0.96.0
>            Reporter: Elliott Clark
>            Priority: Critical
>         Attachments: Charts Search   Cloudera Manager - ITBLL.png, Charts 
> Search   Cloudera Manager.png, job_run.log, short_ycsb.png, 
> ycsb_insert_94_vs_96.png
>
>
> Testing on larger clusters has not had the desired throughput increases.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to