On 20.12.2017 14:03, Ilya Maximets wrote: >>>> + >>>> + if (need_to_flush) { >>>> + /* We didn't receive anything in the process loop. >>>> + * Check if we need to send something. >>>> + * There was no time updates on current iteration. */ >>>> + pmd_thread_ctx_time_update(pmd); >>>> + process_packets = dp_netdev_pmd_flush_output_packets(pmd, >>> false); >>>> + cycles_count_intermediate(pmd, NULL, >>>> + process_packets ? >>> PMD_CYCLES_PROCESSING >>>> + : >>>> + PMD_CYCLES_IDLE); >>> >>> I noticed the processing cycles for an rxq are not counted here. It means >>> those processing cycles to tx pkts will no longer be reflected in the rxq >>> to pmd assignment (or any rxq stats). I realize the tx processing cycles >>> are now shared so we will have some inaccuracy anyway but for an >>> individual rxq that has to send to vhost, it could be a significant % of >>> it's measured cycles. >>> >>> OTOH, this code seems like it would only hit when there are low rates (no >>> packets on any queue during the last poll)? so I'm not sure how >>> significant it would be in the overall rxq-pmd assignment. e.g. if the rxq >>> is measured as using 2% or 7% of a pmd it's probably fine for rxq-pmd >>> assignment, whereas 20% or 70% could create a very imbalanced >>> distribution. >>> >>> If it was significant for rxq-pmd assignment, I'm thinking the best way >>> would be to add in the cycles required to tx flush each port to each rxq >>> that has packets in the flush. It's over counting rather than under >>> counting but we can't assume any shared batching after a new assignment. >>> >>> Let me know what you think? Do you think it would only impact the rxq >>> measurements during low traffic rates? >>> >>> Kevin. > > Hmm. I made some tests on my PVP with bonded PHY installation with > tx-flush-interval=50, packet_size=512B and n_flows=1M. > (full environment description available here: > > https://mail.openvswitch.org/pipermail/ovs-dev/2017-December/341700.html ) > Results: > > Packet rate > % of 10G line rate % of packets flushed with rxq = NULL > ------------------ ------------------------------------ > 10% 27.44% > 25% 24.03% > 50% 23.34% > 60% 18.63% > 65% 18.40% > 70% 3.57% > 75% 1.21% > 95% 0.04% > > Almost all of these flushes happened on core which polls HW NICs only. > Somewhere between 65% and 70% of 10G line rate HW NICs (10G ixgbe) starts > to receive at least one packet on each polling cycle. > > Do you think above numbers are big enough? > > About possible fix: > It's hard to find out from which rx queue packet was received at the > send time. We're not even pass this information to processing functions. > One way is to add last_rxq to pmd context on receive and add this rxq to > some list near output_pkts inside OVS_ACTION_ATTR_OUTPUT. > After that we'll have to move 'cycles_count_intermediate' inside > 'dp_netdev_process_rxq_port' and 'dp_netdev_pmd_flush_output_on_port' where > we'll count cycles for receiving+processing and sending respectively. > In this case we'll be able to collect almost accurate cycles for each > rxq separately and even count impact of each rxq to the batched send and > add appropriate portion of shared send cycles. > > For me, above approach looks like the only possible to keep nearly accurate > per-rxq statistics in case we need it. > > Thoughts? > >> >> Good catch, >> >> I was going to push this today but I'll hold off until this issue is >> resolved, I don’t want an existing feature such as the rxq balancing being >> negatively impacted upon if we can avoid it. >> >> Ian > > Ian, in general, I'm good with pushing the series without this patch because > simple output batching provides significant performance improvement itself > for bonding/multiflow cases. Maybe we can delay only time-based approach until > we have solution for rxq-cycles issue. > > Best regards, Ilya Maximets. >
Following incremental should implement above approach (compile tested only): ------------------------------------------------------------------- diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c index a612f99..e490793 100644 --- a/lib/dpif-netdev.c +++ b/lib/dpif-netdev.c @@ -540,6 +540,7 @@ struct tx_port { struct hmap_node node; long long flush_time; struct dp_packet_batch output_pkts; + struct dp_netdev_rxq *output_pkts_rxqs[NETDEV_MAX_BURST]; }; /* A set of properties for the current processing loop that is not directly @@ -551,6 +552,10 @@ struct dp_netdev_pmd_thread_ctx { long long now; /* Used to count cycles. See 'cycles_count_end()' */ unsigned long long last_cycles; + /* RX queue from which last packet was received. */ + struct dp_netdev_rxq *last_rxq; + /* Indicates how should be treated last counted cycles. */ + enum pmd_cycles_counter_type current_pmd_cycles_type; }; /* PMD: Poll modes drivers. PMD accesses devices via polling to eliminate @@ -3219,42 +3224,53 @@ cycles_counter(void) /* Fake mutex to make sure that the calls to cycles_count_* are balanced */ extern struct ovs_mutex cycles_counter_fake_mutex; -/* Start counting cycles. Must be followed by 'cycles_count_end()' */ +/* Start counting cycles. Must be followed by 'cycles_count_end()'. + * Counting starts from the idle type state. */ static inline void cycles_count_start(struct dp_netdev_pmd_thread *pmd) OVS_ACQUIRES(&cycles_counter_fake_mutex) OVS_NO_THREAD_SAFETY_ANALYSIS { + pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE; pmd->ctx.last_cycles = cycles_counter(); } -/* Stop counting cycles and add them to the counter 'type' */ +/* Stop counting cycles and add them to the counter of the current type. */ static inline void -cycles_count_end(struct dp_netdev_pmd_thread *pmd, - enum pmd_cycles_counter_type type) +cycles_count_end(struct dp_netdev_pmd_thread *pmd) OVS_RELEASES(&cycles_counter_fake_mutex) OVS_NO_THREAD_SAFETY_ANALYSIS { unsigned long long interval = cycles_counter() - pmd->ctx.last_cycles; + enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type; non_atomic_ullong_add(&pmd->cycles.n[type], interval); } -/* Calculate the intermediate cycle result and add to the counter 'type' */ +/* Calculate the intermediate cycle result and add to the counter of + * the current type */ static inline void cycles_count_intermediate(struct dp_netdev_pmd_thread *pmd, - struct dp_netdev_rxq *rxq, - enum pmd_cycles_counter_type type) + struct dp_netdev_rxq **rxqs, int n_rxqs) OVS_NO_THREAD_SAFETY_ANALYSIS { unsigned long long new_cycles = cycles_counter(); unsigned long long interval = new_cycles - pmd->ctx.last_cycles; + enum pmd_cycles_counter_type type = pmd->ctx.current_pmd_cycles_type; + int i; + pmd->ctx.last_cycles = new_cycles; non_atomic_ullong_add(&pmd->cycles.n[type], interval); - if (rxq && (type == PMD_CYCLES_PROCESSING)) { + if (n_rxqs && (type == PMD_CYCLES_PROCESSING)) { /* Add to the amount of current processing cycles. */ - non_atomic_ullong_add(&rxq->cycles[RXQ_CYCLES_PROC_CURR], interval); + interval /= n_rxqs; + for (i = 0; i < n_rxqs; i++) { + if (rxqs[i]) { + non_atomic_ullong_add(&rxqs[i]->cycles[RXQ_CYCLES_PROC_CURR], + interval); + } + } } } @@ -3299,6 +3315,16 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd, int output_cnt; bool dynamic_txqs; uint32_t tx_flush_interval; + enum pmd_cycles_counter_type save_pmd_cycles_type; + + /* In case we're in PMD_CYCLES_PROCESSING state we need to count + * cycles for rxq we're processing now. */ + cycles_count_intermediate(pmd, &pmd->ctx.last_rxq, 1); + + /* Save current cycles counting state to restore after accounting + * send cycles. */ + save_pmd_cycles_type = pmd->ctx.current_pmd_cycles_type; + pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING; dynamic_txqs = p->port->dynamic_txqs; if (dynamic_txqs) { @@ -3323,6 +3349,9 @@ dp_netdev_pmd_flush_output_on_port(struct dp_netdev_pmd_thread *pmd, dp_netdev_count_packet(pmd, DP_STAT_SENT_PKTS, output_cnt); dp_netdev_count_packet(pmd, DP_STAT_SENT_BATCHES, 1); + /* Update send cycles for all the rx queues and restore previous state. */ + cycles_count_intermediate(pmd, p->output_pkts_rxqs, output_cnt); + pmd->ctx.current_pmd_cycles_type = save_pmd_cycles_type; return output_cnt; } @@ -3348,7 +3377,7 @@ dp_netdev_pmd_flush_output_packets(struct dp_netdev_pmd_thread *pmd, static int dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, - struct netdev_rxq *rx, + struct dp_netdev_rxq *rxq, odp_port_t port_no) { struct dp_packet_batch batch; @@ -3356,20 +3385,30 @@ dp_netdev_process_rxq_port(struct dp_netdev_pmd_thread *pmd, int batch_cnt = 0, output_cnt = 0; dp_packet_batch_init(&batch); - error = netdev_rxq_recv(rx, &batch); + + cycles_count_intermediate(pmd, NULL, 0); + pmd->ctx.last_rxq = rxq; + pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_PROCESSING; + error = netdev_rxq_recv(rxq->rx, &batch); + cycles_count_intermediate(pmd, &rxq, 1); + if (!error) { *recirc_depth_get() = 0; pmd_thread_ctx_time_update(pmd); batch_cnt = batch.count; dp_netdev_input(pmd, &batch, port_no); + cycles_count_intermediate(pmd, &rxq, 1); + output_cnt = dp_netdev_pmd_flush_output_packets(pmd, false); } else if (error != EAGAIN && error != EOPNOTSUPP) { static struct vlog_rate_limit rl = VLOG_RATE_LIMIT_INIT(1, 5); VLOG_ERR_RL(&rl, "error receiving data from %s: %s", - netdev_rxq_get_name(rx), ovs_strerror(error)); + netdev_rxq_get_name(rxq->rx), ovs_strerror(error)); } + pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE; + pmd->ctx.last_rxq = NULL; return batch_cnt + output_cnt; } @@ -3994,7 +4033,6 @@ dpif_netdev_run(struct dpif *dpif) struct dp_netdev *dp = get_dp_netdev(dpif); struct dp_netdev_pmd_thread *non_pmd; uint64_t new_tnl_seq; - int process_packets; bool need_to_flush = true; ovs_mutex_lock(&dp->port_mutex); @@ -4007,15 +4045,9 @@ dpif_netdev_run(struct dpif *dpif) int i; for (i = 0; i < port->n_rxq; i++) { - process_packets = - dp_netdev_process_rxq_port(non_pmd, - port->rxqs[i].rx, - port->port_no); - cycles_count_intermediate(non_pmd, NULL, - process_packets - ? PMD_CYCLES_PROCESSING - : PMD_CYCLES_IDLE); - if (process_packets) { + if (dp_netdev_process_rxq_port(non_pmd, + &port->rxqs[i], + port->port_no)) { need_to_flush = false; } } @@ -4026,14 +4058,10 @@ dpif_netdev_run(struct dpif *dpif) * Check if we need to send something. * There was no time updates on current iteration. */ pmd_thread_ctx_time_update(non_pmd); - process_packets = dp_netdev_pmd_flush_output_packets(non_pmd, - false); - cycles_count_intermediate(non_pmd, NULL, process_packets - ? PMD_CYCLES_PROCESSING - : PMD_CYCLES_IDLE); + dp_netdev_pmd_flush_output_packets(non_pmd, false); } - cycles_count_end(non_pmd, PMD_CYCLES_IDLE); + cycles_count_end(non_pmd); dpif_netdev_xps_revalidate_pmd(non_pmd, false); ovs_mutex_unlock(&dp->non_pmd_mutex); @@ -4213,17 +4241,11 @@ reload: cycles_count_start(pmd); for (;;) { - int process_packets; bool need_to_flush = true; for (i = 0; i < poll_cnt; i++) { - process_packets = - dp_netdev_process_rxq_port(pmd, poll_list[i].rxq->rx, - poll_list[i].port_no); - cycles_count_intermediate(pmd, poll_list[i].rxq, - process_packets ? PMD_CYCLES_PROCESSING - : PMD_CYCLES_IDLE); - if (process_packets) { + if (dp_netdev_process_rxq_port(pmd, poll_list[i].rxq, + poll_list[i].port_no)) { need_to_flush = false; } } @@ -4233,10 +4255,7 @@ reload: * Check if we need to send something. * There was no time updates on current iteration. */ pmd_thread_ctx_time_update(pmd); - process_packets = dp_netdev_pmd_flush_output_packets(pmd, false); - cycles_count_intermediate(pmd, NULL, - process_packets ? PMD_CYCLES_PROCESSING - : PMD_CYCLES_IDLE); + dp_netdev_pmd_flush_output_packets(pmd, false); } if (lc++ > 1024) { @@ -4257,7 +4276,7 @@ reload: } } - cycles_count_end(pmd, PMD_CYCLES_IDLE); + cycles_count_end(pmd); poll_cnt = pmd_load_queues_and_ports(pmd, &poll_list); exiting = latch_is_set(&pmd->exit_latch); @@ -4697,6 +4716,8 @@ dp_netdev_configure_pmd(struct dp_netdev_pmd_thread *pmd, struct dp_netdev *dp, ovs_mutex_init(&pmd->port_mutex); cmap_init(&pmd->flow_table); cmap_init(&pmd->classifiers); + pmd->ctx.last_rxq = NULL; + pmd->ctx.current_pmd_cycles_type = PMD_CYCLES_IDLE; pmd_thread_ctx_time_update(pmd); pmd->next_optimization = pmd->ctx.now + DPCLS_OPTIMIZATION_INTERVAL; pmd->rxq_next_cycle_store = pmd->ctx.now + PMD_RXQ_INTERVAL_LEN; @@ -5575,6 +5596,8 @@ dp_execute_cb(void *aux_, struct dp_packet_batch *packets_, } DP_PACKET_BATCH_FOR_EACH (packet, packets_) { + p->output_pkts_rxqs[dp_packet_batch_size(&p->output_pkts)] = + pmd->ctx.last_rxq; dp_packet_batch_add(&p->output_pkts, packet); } return; ------------------------------------------------------------------- What do you think? Best regards, Ilya Maximets. _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev