Hi Jerin, Much kudos on a huge contribution to the community. Look forward to spend more time looking at it in the next few days.
I'll bite and ask the obvious questions - why would I use rte_graph over FD.io VPP? Ray K On 31/01/2020 17:01, jer...@marvell.com wrote: > From: Jerin Jacob <jer...@marvell.com> > > This RFC is targeted for v20.05 release. > > This RFC patch includes an implementation of graph architecture for packet > processing using DPDK primitives. > > Using graph traversal for packet processing is a proven architecture > that has been implemented in various open source libraries. > > Graph architecture for packet processing enables abstracting the data > processing functions as “nodes” and “links” them together to create a complex > “graph” to create reusable/modular data processing functions. > > The RFC patch further includes performance enhancements and modularity > to the DPDK as discussed in more detail below. > > What this RFC patch contains: > ----------------------------- > 1) The API definition to "create" nodes and "link" together to create a > "graph" > for packet processing. See, lib/librte_graph/rte_graph.h > > 2) The Fast path API definition for the graph walker and enqueue function > used by the workers. See, lib/librte_graph/rte_graph_worker.h > > 3) Optimized SW implementation for (1) and (2). See, lib/librte_graph/ > > 4) Test case to verify the graph infrastructure functionality > See, app/test/test_graph.c > > 5) Performance test cases to evaluate the cost of graph walker and nodes > enqueue fast-path function for various combinations. > > See app/test/test_graph_perf.c > > 6) Packet processing nodes(Null, Rx, Tx, Pkt drop, IPV4 rewrite, IPv4 lookup) > using graph infrastructure. See lib/librte_node/* > > 7) An example application to showcase l3fwd > (functionality same as existing examples/l3fwd) using graph infrastructure and > use packets processing nodes (item (6)). See examples/l3fwd-graph/. > > Performance > ----------- > 1) Graph walk and node enqueue overhead can be tested with performance test > case application [1] > # If all packets go from a node to another node (we call it as "homerun") then > it will be just a pointer swap for a burst of packets. > # In the worst case, a couple of handful cycles to move an object from a node > to another node. > > 2) Performance comparison with existing l3fwd (The complete static code with > out > any nodes) vs modular l3fwd-graph with 5 nodes > (ip4_lookup, ip4_rewrite, ethdev_tx, ethdev_rx, pkt_drop). > Here is graphical representation of the l3fwd-graph as Graphviz dot file: > http://bit.ly/39UPPGm > > # l3fwd-graph performance is -2.5% wrt static l3fwd. > > # We have simulated the similar test with existing librte_pipeline > application [4]. > ip_pipline application is -48.62% wrt static l3fwd. > > The above results are on octeontx2. It may vary on other platforms. > The platforms with higher L1 and L2 caches will have further better > performance. > > Tested architectures: > -------------------- > 1) AArch64 > 2) X86 > > > Graph library Features > ---------------------- > 1) Nodes as plugins > 2) Support for out of tree nodes > 3) Multi-process support. > 4) Low overhead graph walk and node enqueue > 5) Low overhead statistics collection infrastructure > 6) Support to export the graph as a Graphviz dot file. > See rte_graph_export() > Example of exported graph: http://bit.ly/2PqbqOy > 7) Allow having another graph walk implementation > in the future by segregating the fast path and slow path code. > > > Advantages of Graph architecture: > --------------------------------- > > 1) Memory latency is the enemy for high-speed packet processing, > moving the similar packet processing code to a node will reduce > the I cache and D caches misses. > 2) Exploits the probability that most packets will follow the same nodes in > the graph. > 3) Allow SIMD instructions for packet processing of the node. > 4) The modular scheme allows having reusable nodes for the consumers. > 5) The modular scheme allows us to abstract the vendor HW specific > optimizations as a node. > > > What is different than existing libpipeline library > --------------------------------------------------- > At a very high level, libpipeline created to allow modular plugin interface. > Based on our analysis the performance is better in the graph model. > Check the details under the Performance section, Item (2). > > This rte_graph implementation has taken care of fixing some of the > architecture/implementations limitations with libpipeline. > > 1) Use cases like IP fragmentation, TCP ACK processing > (with new TCP data sent out in the same context) > have a problem as rte_pipeline_run() passes just pkt_mask of 64 bits to > different > tables and packet pointers are stored in the single array in struct > rte_pipeline_run. > > In Graph architecture, The node has complete control of how many packets are > output to next node seamlessly. > > 2) Since pktmask is passed to different tables, it takes multiple for loops to > extract pkts out of fragmented pkts_mask. This makes it difficult to prefetch > ahead a set of packets. This issue does not exist in Graph architecture. > > 3) Every table have two/three function pointers unlike graph architecture that > has a single function pointer for node. > > 4) The current libpipeline main fast-path function doesn't support tree-like > topology where 64 packets can be redirected to 64 different tables. > It is currently limited to table-based next table id instead of per-packet > action based next table id. So in a typical case, we need to cascade tables > and > sequentially go through all the tables to reach the last table. > > 5) pkt_mask limit is 64 bits which is the max burst size possible. > The graph library supports up to 256. > > In short, both are significantly different architectures. > Allowing the end-user to choose the model would be a more appropriate decision > by keeping both in DPDK. > > > Why this RFC > ------------ > 1) We believe, Graph architecture provides the best performance for > reusable/modular packet processing framework. > Since DPDK does not have it, it is good to have it in DPDK. > > 2) Based on our experience, NPU HW accelerates are so different than one > vendor > to another vendor. Going forward, We believe, API abstraction may not be > enough > abstract the difference in HW. The Vendor-specific nodes can abstract the HW > differences and reuse generic the nodes as needed. > This would help both the silicon vendors and DPDK end users. > > 3) The framework enables the protocol stack as use native mbuf for > graph processing to avoid any conversion between the formats for > better performance. > > 4) DPDK becomes the "goto library" for userspace HW acceleration. > It is good to have native Graph packet processing library in DPDK. > > 5) Obviously, Our customers are interested in Graph library in DPDK :-) > > Identified tweaking for better performance on different targets > --------------------------------------------------------------- > 1) Test with various burst size values (256, 128, 64, 32) using > CONFIG_RTE_GRAPH_BURST_SIZE config option. > Based on our testing, on x86 and arm64 servers, The sweet spot is 256 burst > size. > While on arm64 embedded SoCs, it is either 64 or 128. > > 2) Disable node statistics (use CONFIG_RTE_LIBRTE_GRAPH_STATS config option) > if not needed. > > 3) Use arm64 optimized memory copy for arm64 architecture by > selecting CONFIG_RTE_ARCH_ARM64_MEMCPY. > > Commands to run tests > --------------------- > > [1] > perf test: > echo "graph_perf_autotest" | sudo ./build/app/test/dpdk-test -c 0x30 > > [2] > functionality test: > echo "graph_autotest" | sudo ./build/app/test/dpdk-test -c 0x30 > > [3] > l3fwd-graph: > ./l3fwd-graph -c 0x100 -- -p 0x3 --config="(0, 0, 8)" -P > > [4] > # ./ip_pipeline --c 0xff0000 -- -s route.cli > > Route.cli: (Copy paste to the shell to avoid dos format issues) > > https://pastebin.com/raw/B4Ktx7TT > > > Next steps > ----------------------------- > 1) Feedback from the community on the library. > 2) Collect the API requirements from the community. > 3) Sending the next version by addressing the community initial. > feedback and fixing the following identified "pending items". > > > Pending items (Will be addressed in next revision) > ------------------------------------------------- > 1) Add documentation as a patch > 2) Add Doxygen API documentation > 3) Split the patches at a more logical level for a better review. > 4) code cleanup > 5) more optimizations in the nodes and graph infrastructure. > > > Programming guide and API walk-through > -------------------------------------- > # Anatomy of Node: > ~~~~~~~~~~~~~~~~~ > See the https://github.com/jerinjacobk/share/blob/master/Anatomy_of_a_node.svg > > The above diagram depicts the anatomy of a node. > The node is the basic building block of the graph framework. > > A node consists of: > a) process(): > > The callback function will be invoked by worker thread using > rte_graph_walk() function when there is data to be processed by the node. > A graph node process the function using process() and enqueue to next > downstream node using rte_node_enqueue*() function. > > b) Context memory: > > It is memory allocated by the library to store the node-specific context > information. which will be used by process(), init(), fini() callbacks. > > c) init(): > > The callback function which will be invoked by rte_graph_create() on when a > node > gets attached to a graph. > > d) fini(): > > The callback function which will be invoked by rte_graph_destroy() on when a > node > gets detached to a graph. > > > e) Node name: > > It is the name of the node. When a node registers to graph library, the > library > gives the ID as rte_node_t type. Both ID or Name shall be used lookup the > node. > rte_node_from_name(), rte_node_id_to_name() are the node lookup functions. > > f) nb_edges: > > Number of downstream nodes connected to this node. The next_nodes[] stores the > downstream nodes objects. rte_node_edge_update() and rte_node_edge_shrink() > functions shall be used to update the next_node[] objects. Consumers of the > node > APIs are free to update the next_node[] objects till rte_graph_create() > invoked. > > g) next_node[]: > > The dynamic array to store the downstream nodes connected to this node. > > > # Node creation and registration > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > a) Node implementer creates the node by implementing ops and attributes of > 'struct rte_node_register' > b) The library registers the node by invoking RTE_NODE_REGISTER on library > load > using the constructor scheme. > The constructor scheme used here to support multi-process. > > > # Link the Nodes to create the graph topology > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > See the https://github.com/jerinjacobk/share/blob/master/Link_the_nodes.svg > > The above diagram shows a graph topology after linking the N nodes. > > Once nodes are available to the program, Application or node public API > functions > can links them together to create a complex packet processing graph. > > There are multiple different types of strategies to link the nodes. > > Method a) Provide the next_nodes[] at the node registration time. > See 'struct rte_node_register::nb_edges'. This is a use case to address the > static > node scheme where one knows upfront the next_nodes[] of the node. > > Method b) Use rte_node_edge_get(), rte_node_edge_update(), > rte_node_edge_shrink() to > Update the next_nodes[] links for the node dynamically. > > Method c) Use rte_node_clone() to clone a already existing node. > When rte_node_clone() invoked, The library, would clone all the attributes > of the node and creates a new one. The name for cloned node shall be > "parent_node_name-user_provided_name". This method enables the use case of Rx > and Tx > nodes where multiple of those nodes need to be cloned based on the number of > CPU > available in the system. The cloned nodes will be identical, except the > "context memory". > Context memory will have information of port, queue pair incase of Rx and Tx > ethdev nodes. > > # Create the graph object > ~~~~~~~~~~~~~~~~~~~~~~~~~ > Now that the nodes are linked, Its time to create a graph by including > the required nodes. The application can provide a set of node patterns to > form a graph object. > The fnmatch() API used underneath for the pattern matching to include > the required nodes. > > The rte_graph_create() API shall be used to create the graph. > > Example of a graph object creation: > > {"ethdev_rx_0_0", ipv4-*, ethdev_tx_0_*"} > > In the above example, A graph object will be created with ethdev Rx > node of port 0 and queue 0, all ipv4* nodes in the system, > and ethdev tx node of port 0 with all queues. > > > # Multi core graph processing > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > In the current graph library implementation, specifically, > rte_graph_walk() and rte_node_enqueue* fast path API functions > are designed to work on single-core to have better performance. > The fast path API works on graph object, So the multi-core graph > processing strategy would be to create graph object PER WORKER. > > > # In fast path: > ~~~~~~~~~~~~~~~ > > Typical fast-path code looks like below, where the application > gets the fast-path graph object through rte_graph_lookup() > on the worker thread and run the rte_graph_walk() in a tight loop. > > struct rte_graph *graph = rte_graph_lookup("worker0"); > > while (!done) { > rte_graph_walk(graph); > } > > # Context update when graph walk in action > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The fast-path object for the node is `struct rte_node`. > > It may be possible that in slow-path or after the graph walk-in action, > the user needs to update the context of the node hence access to > struct rte_node * memory. > > rte_graph_foreach_node(), rte_graph_node_get(), rte_graph_node_get_by_name() > APIs can be used to to get the struct rte_node*. rte_graph_foreach_node() > iterator > function works on struct rte_graph * fast-path graph object while others > works on graph ID or name. > > > # Get the node statistics using graph cluster > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The user may need to know the aggregate stats of the node across > multiple graph objects. Especially the situation where each > graph object bound to a worker thread. > > Introduced a graph cluster object for statistics. > rte_graph_cluster_stats_create() > shall be used for creating a graph cluster with multiple graph objects and > rte_graph_cluster_stats_get() to get the aggregate node statistics. > > An example statistics output from rte_graph_cluster_stats_get() > > +-----------+------------+-------------+---------------+------------+---------------+-----------+ > |Node |calls |objs |realloc_count |objs/call > |objs/sec(10E6) |cycles/call| > +------------------------+-------------+---------------+------------+---------------+-----------+ > |node0 |12977424 |3322220544 |5 |256.000 > |3047.151872 |20.0000 | > |node1 |12977653 |3322279168 |0 |256.000 > |3047.210496 |17.0000 | > |node2 |12977696 |3322290176 |0 |256.000 > |3047.221504 |17.0000 | > |node3 |12977734 |3322299904 |0 |256.000 > |3047.231232 |17.0000 | > |node4 |12977784 |3322312704 |1 |256.000 > |3047.243776 |17.0000 | > |node5 |12977825 |3322323200 |0 |256.000 > |3047.254528 |17.0000 | > +-----------+------------+-------------+---------------+------------+---------------+-----------+ > > # Node writing guide lines > ~~~~~~~~~~~~~~~~~~~~~~~~~~ > > The process() function of a node is fast-path function and that needs to be > written > carefully to achieve max performance. > > Broadly speaking, there are two different types of nodes. > > 1) First kind of nodes are those that have a fixed next_nodes[] for the > complete burst (like ethdev_rx, ethdev_tx) and it is simple to write. > Process() function can move the obj burst to the next node either using > rte_node_next_stream_move() or using rte_node_next_stream_get() and > rte_node_next_stream_put(). > > > 2) The second kind of such node is `intermediate nodes` that decide what is > the next_node[] > to send to on a per-packet basis. In these nodes, > > a) Firstly, there has to be the best possible packet processing logic. > b) Secondly, each packet needs to be queued to its next node. > > At least on some architectures, we get around ~10% more performance if we can > avoid copying of > packet pointers from one node to next as it is ~= memcpy(BURST_SIZE x > sizeof(void *)) x NODE_COUNT. > > This can be avoided only in the case where all the packets are destined to > the same > next node. We call this as home run case and we use > rte_node_next_stream_move() to > just move burst of object array by swapping the pointer. a.k.a move stream > from one node to next node > with least number of cycles. > > Example of intermediate node implementation with home run: > a) Start with speculation that next_node = ctx->next_node. > This could be the next_node application used in the previous function call > of this node. > b) Get the next_node stream array and space using > rte_node_next_stream_get(next_node, &space) > c) while space != 0 and n_pkts_left != 0, > prefetch next pkt_set and process current pkt_set to find their next node > d) if all the next nodes of the current pkt_set match speculated next node, > just count them as successfully speculated(last_spec) till now and > continue the loop without actually moving them to the next node. > else if there is a mismatch, > copy all the pkt_set pointers that were last_spec and > move the current pkt_set to their respective next's nodes using > rte_enqueue_next_x1(). Also one of the next_node can be updated as > speculated next_node if it is more probable. Also set last_spec = 0 > e) if n_pkts_left != 0 and space != 0 > goto c) as there is space in the speculated next_node. > f) if last_spec == n_pkts_left, > then we successfully speculated all the packets to right next node. > Just call rte_node_next_stream_move(node, next_node) to just move the > stream/obj array to next node. This is home run where we avoided > memcpy of buffer pointers to next node. > g) if space = 0 and n_pkts_left != 0 > goto b) > h) Update the ctx->next_node with more probable next node. > > # In-tree node documentation > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > a) librte_node/ethdev_rx.c: > This node does rte_eth_rx_burst() into stream buffer acquired using > rte_node_next_stream_get() and does rte_node_next_stream_put(count) > only when there are packets received. Each rte_node works on only on > one rx port and queue that it gets from node->context. > For each (port X, rx_queue Y), a rte_node is cloned from > ethdev_rx_base_node > as "ethdev_rx-X-Y" in rte_node_eth_config() along with updating > node->context. Each graph needs to be associated with a unique > rte_node for a (port, rx_queue). > > b) librte_node/ethdev_tx.c: > This node does rte_eth_tx_burst() for a burst of objs received by it. > It sends the burst to a fixed Tx Port and Queue information from > node->context. For each (port X), this rte_node is cloned from > ethdev_tx_node_base as "ethdev_tx-X" in rte_node_eth_config() > along with updating node->context. > Since each graph doesn't need more than one Txq, per port, > a Txq is assigned based on graph id to each rte_node instance. > Each graph needs to be associated with a rte_node for each (port). > > c) librte_node/pkt_drop.c: > This node frees all the objects that are passed to it. > > d) librte_node/ip4_lookup.c: > This node is an intermediate node that does lpm lookup for the received > ipv4 packets and the result determines each packets next node. > a) On successful lpm lookup, the result contains the nex_node id and > next-hop id with which the packet needs to be further processed. > b) On lpm lookup failure, objects are redirected to pkt_drop node. > rte_node_ip4_route_add() is control path API to add ipv4 routes. > To achieve home run, we use rte_node_stream_move() as mentioned in above > sections. > > e) librte_node/ip4_rewrite.c: > This node gets packets from ip4_lookup node with next-hop id for each > packet is embedded in rte_node_mbuf_priv1(mbuf)->nh. This id is used > to determine the L2 header to be written to the pkt before sending > the pkt out to a particular ethdev_tx node. > rte_node_ip4_rewrite_add() is control path API to add next-hop info. > > Jerin Jacob (1): > graph: introduce graph subsystem > > Kiran Kumar K (1): > test: add graph functional tests > > Nithin Dabilpuram (2): > node: add packet processing nodes > example/l3fwd_graph: l3fwd using graph architecture > > Pavan Nikhilesh (1): > test: add graph performance test cases. > > app/test/Makefile | 5 + > app/test/meson.build | 10 +- > app/test/test_graph.c | 820 +++++++++++++++++ > app/test/test_graph_perf.c | 888 +++++++++++++++++++ > config/common_base | 13 + > config/rte_config.h | 4 + > examples/Makefile | 3 + > examples/l3fwd-graph/Makefile | 58 ++ > examples/l3fwd-graph/main.c | 1131 ++++++++++++++++++++++++ > examples/l3fwd-graph/meson.build | 13 + > examples/meson.build | 6 +- > lib/Makefile | 6 + > lib/librte_graph/Makefile | 28 + > lib/librte_graph/graph.c | 578 ++++++++++++ > lib/librte_graph/graph_debug.c | 81 ++ > lib/librte_graph/graph_ops.c | 163 ++++ > lib/librte_graph/graph_populate.c | 224 +++++ > lib/librte_graph/graph_private.h | 113 +++ > lib/librte_graph/graph_stats.c | 396 +++++++++ > lib/librte_graph/meson.build | 11 + > lib/librte_graph/node.c | 419 +++++++++ > lib/librte_graph/rte_graph.h | 277 ++++++ > lib/librte_graph/rte_graph_version.map | 46 + > lib/librte_graph/rte_graph_worker.h | 280 ++++++ > lib/librte_node/Makefile | 30 + > lib/librte_node/ethdev_ctrl.c | 106 +++ > lib/librte_node/ethdev_rx.c | 218 +++++ > lib/librte_node/ethdev_rx.h | 17 + > lib/librte_node/ethdev_rx_priv.h | 45 + > lib/librte_node/ethdev_tx.c | 74 ++ > lib/librte_node/ethdev_tx_priv.h | 33 + > lib/librte_node/ip4_lookup.c | 657 ++++++++++++++ > lib/librte_node/ip4_lookup_priv.h | 17 + > lib/librte_node/ip4_rewrite.c | 340 +++++++ > lib/librte_node/ip4_rewrite_priv.h | 44 + > lib/librte_node/log.c | 14 + > lib/librte_node/meson.build | 8 + > lib/librte_node/node_private.h | 61 ++ > lib/librte_node/null.c | 23 + > lib/librte_node/pkt_drop.c | 26 + > lib/librte_node/rte_node_eth_api.h | 31 + > lib/librte_node/rte_node_ip4_api.h | 33 + > lib/librte_node/rte_node_version.map | 9 + > lib/meson.build | 5 +- > meson.build | 1 + > mk/rte.app.mk | 2 + > 46 files changed, 7362 insertions(+), 5 deletions(-) > create mode 100644 app/test/test_graph.c > create mode 100644 app/test/test_graph_perf.c > create mode 100644 examples/l3fwd-graph/Makefile > create mode 100644 examples/l3fwd-graph/main.c > create mode 100644 examples/l3fwd-graph/meson.build > create mode 100644 lib/librte_graph/Makefile > create mode 100644 lib/librte_graph/graph.c > create mode 100644 lib/librte_graph/graph_debug.c > create mode 100644 lib/librte_graph/graph_ops.c > create mode 100644 lib/librte_graph/graph_populate.c > create mode 100644 lib/librte_graph/graph_private.h > create mode 100644 lib/librte_graph/graph_stats.c > create mode 100644 lib/librte_graph/meson.build > create mode 100644 lib/librte_graph/node.c > create mode 100644 lib/librte_graph/rte_graph.h > create mode 100644 lib/librte_graph/rte_graph_version.map > create mode 100644 lib/librte_graph/rte_graph_worker.h > create mode 100644 lib/librte_node/Makefile > create mode 100644 lib/librte_node/ethdev_ctrl.c > create mode 100644 lib/librte_node/ethdev_rx.c > create mode 100644 lib/librte_node/ethdev_rx.h > create mode 100644 lib/librte_node/ethdev_rx_priv.h > create mode 100644 lib/librte_node/ethdev_tx.c > create mode 100644 lib/librte_node/ethdev_tx_priv.h > create mode 100644 lib/librte_node/ip4_lookup.c > create mode 100644 lib/librte_node/ip4_lookup_priv.h > create mode 100644 lib/librte_node/ip4_rewrite.c > create mode 100644 lib/librte_node/ip4_rewrite_priv.h > create mode 100644 lib/librte_node/log.c > create mode 100644 lib/librte_node/meson.build > create mode 100644 lib/librte_node/node_private.h > create mode 100644 lib/librte_node/null.c > create mode 100644 lib/librte_node/pkt_drop.c > create mode 100644 lib/librte_node/rte_node_eth_api.h > create mode 100644 lib/librte_node/rte_node_ip4_api.h > create mode 100644 lib/librte_node/rte_node_version.map >