Thanks Daryn,

 

0.01  is just an initial config and it will not exert the penalty to heavy 
users. We are doing this to just have the code evaluated but not actually using 
the feature.

The blacklist feature is also another thing further in this direction, meaning 
heavy users won’t have their calls located in the second queue.

 

With the above two I am evaluating the qtime since it is basically 99% call 
queue size and 100% handler resource compared to single simple queue. That’s 
why I don’t understand the qtime diff.

 

I have heard the lock time metrics might be one issue, did you notice that call 
taking long?

 

Fengnan

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 10:58 AM
To: Fengnan Li <loyal...@gmail.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

We're internally running the patch I submitted on HDFS-14403 which was 
subsequently modified by other ppl in the community, so it's possible the 
community flavor may behave differently.  I vaguely remember the RpcMetrics 
timeunit was changed from micros to millis.  Measuring in millis has 
meaningless precision.

 

WeightedTimeCostProvider is what enables the feature.  The blacklist is a 
different feature so if twiddling that conf caused noticeably latency 
differences then I'd suggest examining that change.

 

I don't think you are going to see much benefit from 2 queues with a .01 decay 
factor.  I'd suggest at least 4 queues with 0.5 decay so users generating heavy 
load don't keep popping back up in priority so quickly.

 

 

 

On Thu, Nov 5, 2020 at 11:43 AM Fengnan Li <loyal...@gmail.com> wrote:

Thanks for the response Daryn!

 

I agree with you that for the overall average qtime it will increase due to the 
penalty FCQ brings to the heavy users. However, in our environment, out of the 
same consideration I intentionally turned off the Call selection between 
queues. i.e. the cost is calculated as usual, but all users are stayed in the 
first queue. This is to avoid the overall impact. 

Here are our configs, the red one is what I added for internal use to turn on 
this feature (making only selected users are actually added into the second 
queue when their cost reaches threshold).

 

There are two patches for Cost Based FCQ. 
https://issues.apache.org/jira/browse/HADOOP-16266 and 
https://issues.apache.org/jira/browse/HDFS-14667. Which version are you using? 

I am right now trying to debug one by one.

 

Thanks,
Fengnan

 

<property>

    <name>ipc.8020.callqueue.capacity.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.callqueue.impl</name>

    <value>org.apache.hadoop.ipc.FairCallQueue</value>

  </property>

  <property>

    <name>ipc.8020.cost-provider.impl</name>

    <value>org.apache.hadoop.ipc.WeightedTimeCostProvider</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.blacklisted.users.enabled</name>

    <value>true</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.decay-factor</name>

    <value>0.01</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.period-ms</name>

    <value>20000</value>

  </property>

  <property>

    <name>ipc.8020.decay-scheduler.thresholds</name>

    <value>15</value>

  </property>

  <property>

    <name>ipc.8020.faircallqueue.multiplexer.weights</name>

    <value>99,1</value>

  </property>

  <property>

    <name>ipc.8020.scheduler.priority.levels</name>

    <value>2</value>

  </property>

 

From: Daryn Sharp <da...@verizonmedia.com>
Date: Thursday, November 5, 2020 at 9:19 AM
To: Fengnan Li <loyal...@gmail.com>
Cc: Hdfs-dev <hdfs-dev@hadoop.apache.org>
Subject: Re: [E] Cost Based FairCallQueue latency issue

 

I submitted the original 2.8 cost-based FCQ patch (thanks to community members 
for porting to other branches).  We've been running with it since early 2019 on 
all clusters.  Multiple clusters run at a baseline of ~30k+ ops/sec with some 
bursting over 100k ops/sec.  

 

If you are looking at the overall average qtime, yes, that metric is expected 
to increase and means it's working as designed.  De-prioritizing write heavy 
users will naturally result in increased qtime for those calls.  Within a 
bucket, call N's qtime is the sum of the qtime+processing for the prior 0..N-1 
calls.  This will appear very high for congested low priority buckets receiving 
a fraction of the service rate and skew the overall average.

 

 

On Fri, Oct 30, 2020 at 3:51 PM Fengnan Li <loyal...@gmail.com> wrote:

Hi all,



Has someone deployed Cost Based Fair Call Queue in their production cluster? We 
ran into some RPC queue latency degradation with ~30k-40k rps. I tried to debug 
but didn’t find anything suspicious. It is worth mentioning there is no memory 
issue coming with the extra heap usage for storing the call cost.



Thanks,

Fengnan

Reply via email to