Hi Daryn,

 

A slightly related question is that have you used to -refreshCallQueue to tune 
config for the fair call queue instead of the normal maintenance(failover + 
restart)?  If so how is the performance impact?

 

Thanks,
Fengnan

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 10:58 AM
To: Fengnan Li <loyal...@gmail.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

We're internally running the patch I submitted on HDFS-14403 which was 
subsequently modified by other ppl in the community, so it's possible the 
community flavor may behave differently.  I vaguely remember the RpcMetrics 
timeunit was changed from micros to millis.  Measuring in millis has 
meaningless precision.

 

WeightedTimeCostProvider is what enables the feature.  The blacklist is a 
different feature so if twiddling that conf caused noticeably latency 
differences then I'd suggest examining that change.

 

I don't think you are going to see much benefit from 2 queues with a .01 decay 
factor.  I'd suggest at least 4 queues with 0.5 decay so users generating heavy 
load don't keep popping back up in priority so quickly.

 

 

 

On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <loyal...@gmail.com> wrote:

Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the 
penalty FCQ brings to the heavy users. However, in our environment, out of the 
same consideration I intentionally turned off the Call selection between 
queues. i.e. the cost is calculated as usual, but all users are stayed in the 
first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on 
this feature (making only selected users are actually added into the second 
queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. 
https://issues.apache.org/jira/browse/HADOOP-16266 and 
https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <loyal...@gmail.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members 
for porting to other branches).  We've been running with it since early 2019 on 
all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some 
bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected 
to increase and means it's working as designed.  De-prioritizing write heavy 
users will naturally result in increased qtime for those calls.  Within a 
bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 
calls.  This will appear very high for congested low priority buckets receiving 
a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We 
ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug 
but didn’t find anything suspicious. It is worth mentioning there is no memory 
issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan

Reply via email to