On Thu, 12 Mar 2015 19:06:39 +0000 "Dumitrescu, Cristian" <cristian.dumitrescu at intel.com> wrote:
> > > > -----Original Message----- > > From: Stephen Hemminger [mailto:stephen at networkplumber.org] > > Sent: Tuesday, March 10, 2015 4:14 PM > > To: Dumitrescu, Cristian > > Cc: dev at dpdk.org; Stephen Hemminger; Stephen Hemminger > > Subject: [PATCH v2 6/6] rte_sched: eliminate floating point in calculating > > byte > > clock > > > > From: Stephen Hemminger <shemming at brocade.com> > > > > The old code was doing a floating point divide for each rte_dequeue() > > which is very expensive. Change to using fixed point scaled math instead. > > This improved performance from 5Gbit/sec to 10 Gbit/sec > > > > Signed-off-by: Stephen Hemminger <stephen at networkplumber.org> > > --- > > v2 -- no changes > > despite objections, the performance observation is real > > on Intel(R) Core(TM) i7-3770 CPU > > > > lib/librte_sched/rte_sched.c | 14 ++++++++++---- > > 1 file changed, 10 insertions(+), 4 deletions(-) > > > > diff --git a/lib/librte_sched/rte_sched.c b/lib/librte_sched/rte_sched.c > > index 74d0e0a..522a647 100644 > > --- a/lib/librte_sched/rte_sched.c > > +++ b/lib/librte_sched/rte_sched.c > > @@ -102,6 +102,9 @@ > > > > #define RTE_SCHED_BMP_POS_INVALID UINT32_MAX > > > > +/* For cycles_per_byte calculation */ > > +#define RTE_SCHED_TIME_SHIFT 20 > > + > > struct rte_sched_subport { > > /* Token bucket (TB) */ > > uint64_t tb_time; /* time of last update */ > > @@ -239,7 +242,7 @@ struct rte_sched_port { > > uint64_t time_cpu_cycles; /* Current CPU time measured in CPU > > cyles */ > > uint64_t time_cpu_bytes; /* Current CPU time measured in bytes > > */ > > uint64_t time; /* Current NIC TX time measured in bytes > > */ > > - double cycles_per_byte; /* CPU cycles per byte */ > > + uint32_t cycles_per_byte; /* CPU cycles per byte (scaled) */ > > > > /* Scheduling loop detection */ > > uint32_t pipe_loop; > > @@ -657,7 +660,9 @@ rte_sched_port_config(struct > > rte_sched_port_params *params) > > port->time_cpu_cycles = rte_get_tsc_cycles(); > > port->time_cpu_bytes = 0; > > port->time = 0; > > - port->cycles_per_byte = ((double) rte_get_tsc_hz()) / ((double) > > params->rate); > > + > > + port->cycles_per_byte = (rte_get_tsc_hz() << > > RTE_SCHED_TIME_SHIFT) > > + / params->rate; > > > > /* Scheduling loop detection */ > > port->pipe_loop = RTE_SCHED_PIPE_INVALID; > > @@ -2126,11 +2131,12 @@ rte_sched_port_time_resync(struct > > rte_sched_port *port) > > { > > uint64_t cycles = rte_get_tsc_cycles(); > > uint64_t cycles_diff = cycles - port->time_cpu_cycles; > > - double bytes_diff = ((double) cycles_diff) / port->cycles_per_byte; > > + uint64_t bytes_diff = (cycles_diff << RTE_SCHED_TIME_SHIFT) > > + / port->cycles_per_byte; > > > > /* Advance port time */ > > port->time_cpu_cycles = cycles; > > - port->time_cpu_bytes += (uint64_t) bytes_diff; > > + port->time_cpu_bytes += bytes_diff; > > if (port->time < port->time_cpu_bytes) { > > port->time = port->time_cpu_bytes; > > } > > -- > > 2.1.4 > > Stephen, > > We agreed during the previous round to look at 64-bit multiplication option, > but looks like this patch is identical to the previous one. Did you meet any > issues in implementing this approach? As stated before, I do not think this > is the best solution for the reasons listed previously, and this part of the > code is too sensitive to take the risk. > > Since Thomas indicated these patches will be considered for 2.1 rather than > 2.0 release, it looks like we have some time to refine this work. I would > reiterate the same proposal that Thomas made: re-submit the patches where we > have consensus, and keep this one out for the moment; you and me can sync up > offline and come back with an implementation proposal that would hopefully > address the previous concerns for 2.1 release. Would this work for you? > > Thank you for your work and for your understanding! > > Regards, > Cristian > I haven't had time to do the analysis to figure out how to keep the maths in range so that 64 bit multiplicative divide would work. When I first tried it the issue was that the cycles_diff and bytes_diff would both have to scaled so that they are 32 bit values. Should be possible, might get back to it (or someone else might take up the cause).