[dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

Stephen Hemminger Tue, 17 Feb 2015 11:05:00 -0500

On Mon, 16 Feb 2015 22:44:31 +0000
"Dumitrescu, Cristian" <cristian.dumitrescu at intel.com> wrote:


> Hi Stephen,
> 
> Sorry, NACK.
> 
> 1. Overflow issue
> As you declare cycles_per_byte as uint32_t, for a CPU frequency of 2-3 GHz, 
> the line of code below results in overflow:
>       port->cycles_per_byte = (rte_get_tsc_hz() << RTE_SCHED_TIME_SHIFT) / 
> params->rate;
> Therefore, there is most likely a significant accuracy loss, which might 
> result in more packets allowed to go out than it should.

The tsc shifted is still 64 bits.
and rate is 32 bits bytes/sec.

I chose scale such that
if clock = 3 Ghz
then min rate = 715 bytes/sec =  5722 bits/sec

> 2. Integer division has a higher cost than floating point division
> My understanding is we are considering a performance improvement by replacing 
> the double precision floating point division in:
>       double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte;
> with an integer division:
>       uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) / 
> port->cycles_per_byte;
> I don't think this is going to have the claimed benefit, as acording to 
> "Intel 64 and IA-32 Architectures Optimization  Reference Manual" (Appendix 
> C), the latency of the integer division instruction is significantly bigger 
> than the latency of integer division:
>       Instruction FDIV double precision: latency = 38-40 cycles
>       Instruction IDIV: latency = 56 - 80 cycles

I observed that performance when from 5Gbit/sec to 10Gbit/sec.
Mostly because the floating point engages more instruction units and does not
pipeline. Cycle count is not everything.  This was on Ivy Bridge processor.


> 3. Alternative
> I hear though your suggestion about replacing the floating point division 
> with a more performant construction. One suggestion would be to replace it 
> with an integer multiplication followed by a shift right, probably by using a 
> uint64_t bytes_per_cycle_scaled_up (the inverse of cycles_per_bytes). I need 
> to prototype this code myself. Would you be OK to look into providing an 
> alternative implementation?
>

I looked into multiplative integer method, and will do it in future. But it has
more scaling issues since it would require that the values both be 32 bits.

[dpdk-dev] [PATCH v2 6/7] rte_sched: eliminate floating point in calculating byte clock

Reply via email to