Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 11:14, David Miller wrote: From: Eliezer Tamir Date: Tue, 11 Jun 2013 09:49:31 +0300 I would like to hear opinions on what needs to be added to make this feature complete. I actually would like to see the Kconfig option go away, that's my only request. OK, I will send a patch for this when I submit the socket option patch. -Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 10:32, Eric Dumazet wrote: On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote: I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option Yes, please. I do not believe all sockets on the machine are candidate for low latency. In fact very few of them should be, depending on the number of cpu and/or RX queues. I have a patch for that, along a patch for sockperf I will use for testing. One I will test it some more, I will send it in. 3. support for epoll For this one, I honestly do not know how to proceed. epoll Edge Trigger model is driven by the wakeups events. The wakeups come from frames being delivered by the NIC (for UDP/TCP sockets) If epoll_wait() has to scan the list of epitem to be able to perform the llpoll callback, it will be too slow : We come back to poll() model, with O(N) execution time. Ideally we would have to callback llpoll not from the tcp_poll(), but right before putting current thread in wait mode. We have a few ideas, I will do a POC and see if any of them actually work. One thing that would really help is information about use-cases that people care about: Number and type of sockets, how active are they. How many active Ethernet ports are there. Can bulk and low latency traffic be steered to separate cores or separated in any other way. Thanks, Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir Date: Tue, 11 Jun 2013 09:49:31 +0300 > I would like to hear opinions on what needs to be added to make this > feature complete. > > The list I have so far is: > 1. add a socket option > 2. support for poll/select > 3. support for epoll I actually would like to see the Kconfig option go away, that's my only request. > Also, would you accept a trailing whitespace cleanup patch for > fs/select.c? That's not really for my tree, sorry. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote: > I would like to hear opinions on what needs to be added to make this > feature complete. > > The list I have so far is: > 1. add a socket option Yes, please. I do not believe all sockets on the machine are candidate for low latency. In fact very few of them should be, depending on the number of cpu and/or RX queues. > 2. support for poll/select As long as the cost of llpoll is bounded per poll()/select() call it will be ok. > 3. support for epoll For this one, I honestly do not know how to proceed. epoll Edge Trigger model is driven by the wakeups events. The wakeups come from frames being delivered by the NIC (for UDP/TCP sockets) If epoll_wait() has to scan the list of epitem to be able to perform the llpoll callback, it will be too slow : We come back to poll() model, with O(N) execution time. Ideally we would have to callback llpoll not from the tcp_poll(), but right before putting current thread in wait mode. > > Also, would you accept a trailing whitespace cleanup patch for > fs/select.c? This has to be submitted to lkml -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 07:24, David Miller wrote: From: Eliezer Tamir Date: Tue, 11 Jun 2013 05:25:42 +0300 Here is the text from the RFC and v2 cover letters, updated and merged. If this is too long, please tell me what you think should be removed. It's perfect, and since this went through so many iterations I included the changelog too. Thank you. I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option 2. support for poll/select 3. support for epoll Also, would you accept a trailing whitespace cleanup patch for fs/select.c? -Eliezer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 07:24, David Miller wrote: From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Tue, 11 Jun 2013 05:25:42 +0300 Here is the text from the RFC and v2 cover letters, updated and merged. If this is too long, please tell me what you think should be removed. It's perfect, and since this went through so many iterations I included the changelog too. Thank you. I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option 2. support for poll/select 3. support for epoll Also, would you accept a trailing whitespace cleanup patch for fs/select.c? -Eliezer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote: I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option Yes, please. I do not believe all sockets on the machine are candidate for low latency. In fact very few of them should be, depending on the number of cpu and/or RX queues. 2. support for poll/select As long as the cost of llpoll is bounded per poll()/select() call it will be ok. 3. support for epoll For this one, I honestly do not know how to proceed. epoll Edge Trigger model is driven by the wakeups events. The wakeups come from frames being delivered by the NIC (for UDP/TCP sockets) If epoll_wait() has to scan the list of epitem to be able to perform the llpoll callback, it will be too slow : We come back to poll() model, with O(N) execution time. Ideally we would have to callback llpoll not from the tcp_poll(), but right before putting current thread in wait mode. Also, would you accept a trailing whitespace cleanup patch for fs/select.c? This has to be submitted to lkml -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Tue, 11 Jun 2013 09:49:31 +0300 I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option 2. support for poll/select 3. support for epoll I actually would like to see the Kconfig option go away, that's my only request. Also, would you accept a trailing whitespace cleanup patch for fs/select.c? That's not really for my tree, sorry. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 10:32, Eric Dumazet wrote: On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote: I would like to hear opinions on what needs to be added to make this feature complete. The list I have so far is: 1. add a socket option Yes, please. I do not believe all sockets on the machine are candidate for low latency. In fact very few of them should be, depending on the number of cpu and/or RX queues. I have a patch for that, along a patch for sockperf I will use for testing. One I will test it some more, I will send it in. 3. support for epoll For this one, I honestly do not know how to proceed. epoll Edge Trigger model is driven by the wakeups events. The wakeups come from frames being delivered by the NIC (for UDP/TCP sockets) If epoll_wait() has to scan the list of epitem to be able to perform the llpoll callback, it will be too slow : We come back to poll() model, with O(N) execution time. Ideally we would have to callback llpoll not from the tcp_poll(), but right before putting current thread in wait mode. We have a few ideas, I will do a POC and see if any of them actually work. One thing that would really help is information about use-cases that people care about: Number and type of sockets, how active are they. How many active Ethernet ports are there. Can bulk and low latency traffic be steered to separate cores or separated in any other way. Thanks, Eliezer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 11/06/2013 11:14, David Miller wrote: From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Tue, 11 Jun 2013 09:49:31 +0300 I would like to hear opinions on what needs to be added to make this feature complete. I actually would like to see the Kconfig option go away, that's my only request. OK, I will send a patch for this when I submit the socket option patch. -Eliezer -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir Date: Tue, 11 Jun 2013 05:25:42 +0300 > Here is the text from the RFC and v2 cover letters, updated and > merged. If this is too long, please tell me what you think should > be removed. It's perfect, and since this went through so many iterations I included the changelog too. Thanks! -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 10/06/2013 23:41, David Miller wrote: From: Eliezer Tamir Date: Mon, 10 Jun 2013 11:39:30 +0300 I removed the select/poll patch (was 5/7 in v9) from the set. The rest are the same patches that were in v9. Reply to this email with some text to put in the merge commit, including basic benchmark results, so that I can apply this series. sorry, Here is the text from the RFC and v2 cover letters, updated and merged. If this is too long, please tell me what you think should be removed. Thanks, Eliezer --- This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency. This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id. Patch 2 adds an ndo_ll_poll method and the code that supports it. Patch 3 adds support for busy-polling on UDP sockets. Patch 4 adds support for TCP. Patch 5 adds the ixgbe driver code implementing ndo_ll_poll. Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll. Performance numbers: setup TCP_RR UDP_RR kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7 patched optimized on 071k 3.12 14.0 84k 3.19 12.0 patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2 patched typicalon 100 72 3.13 14.0 79k 3.17 12.8 patched typicalon 060k 2.13 16.5 71k 3.18 14.0 patched typicalon adaptive 67k 3.51 16.7 75k 3.36 14.5 3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2 3.9 typicaloff 048k 1.09 7.3 52k 1.11 4.18 3.9 typical0ff adaptive 35k 1.12 4.08 38k 0.65 5.49 3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23 3.9 optimized off 057k 1.17 4.08 62k 1.04 3.95 Test setup details: Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second) Kernel: unmodified 3.9 and patched 3.9 Config: typical is derived from RH6.2, optimized is a stripped down config. Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us When C3/6 states were turned on (via BIOS) the performance governor was used. These performance numbers were measured with v2 of the patch set. Performance of the optimized config with an rx-usecs setting of 100 (the first line in the table above) was tracked during the evolution of the patches and has never varied by more than 1%. Design: A global hash table that allows us to look up a struct napi by a unique id was added. A napi_id field was added both to struct sk_buff and struct sk. This is used to track which NAPI we need to poll for a specific socket. The device driver marks every incoming skb with this id. This is propagated to the sk when the socket is looked up in the protocol handler. When the socket code does not find any more data on the socket queue, it now may call ndo_ll_poll which will crank the device's rx queue and feed incoming packets to the stack directly from the context of the socket. A sysctl value (net.core4.low_latency_poll) controls how many microseconds we busy-wait before giving up. (setting to 0 globally disables busy-polling) Locking: 1. Locking between napi poll and ndo_ll_poll: Since what needs to be locked between a device's NAPI poll and ndo_ll_poll, is highly device / configuration dependent, we do this inside the Ethernet driver. For example, when packets for high priority connections are sent to separate rx queues, you might not need locking between napi poll and ndo_ll_poll at all. For ixgbe we only lock the RX queue. ndo_ll_poll does not touch the interrupt state or the TX queues. (earlier versions of this patchset did touch them, but this design is simpler and works better.) If a queue is actively polled by a socket (on another CPU) napi poll will not service it, but will wait until the queue can be locked and cleaned before doing a napi_complete(). If a socket can't lock the queue because another CPU has it, either from napi or from another socket polling on the queue, the socket code can busy wait on the socket's skb queue. Ndo_ll_poll does not have preferential treatment for the data from the calling socket vs. data from others, so if another CPU is polling, you will see your data on this socket's queue when it arrives. Ndo_ll_poll is called with local BHs disabled, so it won't race on the same CPU with net_rx_action, which calls the napi poll method. 2. Napi_hash The napi hash mechanism uses RCU. napi_by_id() must be called under rcu_read_lock(). After a call to napi_hash_del(),
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir Date: Mon, 10 Jun 2013 11:39:30 +0300 > I removed the select/poll patch (was 5/7 in v9) from the set. > The rest are the same patches that were in v9. > > Please consider applying. > > Thanks to everyone for their input. There used to be a really nice, detailed and verbose, description of the goals and general idea of these changes, along with a lot of benchmark data. Now I don't see it, either here in this posting, or in any of the patch commit messages. Don't get rid of stuff like that, for a set of changes of this magnitude you can basically consider such details descriptions and information mandatory. Reply to this email with some text to put in the merge commit, including basic benchmark results, so that I can apply this series. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Mon, 10 Jun 2013 11:39:30 +0300 I removed the select/poll patch (was 5/7 in v9) from the set. The rest are the same patches that were in v9. Please consider applying. Thanks to everyone for their input. There used to be a really nice, detailed and verbose, description of the goals and general idea of these changes, along with a lot of benchmark data. Now I don't see it, either here in this posting, or in any of the patch commit messages. Don't get rid of stuff like that, for a set of changes of this magnitude you can basically consider such details descriptions and information mandatory. Reply to this email with some text to put in the merge commit, including basic benchmark results, so that I can apply this series. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
On 10/06/2013 23:41, David Miller wrote: From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Mon, 10 Jun 2013 11:39:30 +0300 I removed the select/poll patch (was 5/7 in v9) from the set. The rest are the same patches that were in v9. Reply to this email with some text to put in the merge commit, including basic benchmark results, so that I can apply this series. sorry, Here is the text from the RFC and v2 cover letters, updated and merged. If this is too long, please tell me what you think should be removed. Thanks, Eliezer --- This patch set adds the ability for the socket layer code to poll directly on an Ethernet device's RX queue. This eliminates the cost of the interrupt and context switch and with proper tuning allows us to get very close to the HW latency. This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last year http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id. Patch 2 adds an ndo_ll_poll method and the code that supports it. Patch 3 adds support for busy-polling on UDP sockets. Patch 4 adds support for TCP. Patch 5 adds the ixgbe driver code implementing ndo_ll_poll. Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll. Performance numbers: setup TCP_RR UDP_RR kernel Config C3/6 rx-usecs tps cpu% S.dem tps cpu% S.dem patched optimized on 100 87k 3.13 11.4 94K 3.17 10.7 patched optimized on 071k 3.12 14.0 84k 3.19 12.0 patched optimized on adaptive 80k 3.13 12.5 90k 3.46 12.2 patched typicalon 100 72 3.13 14.0 79k 3.17 12.8 patched typicalon 060k 2.13 16.5 71k 3.18 14.0 patched typicalon adaptive 67k 3.51 16.7 75k 3.36 14.5 3.9 optimized on adaptive 25k 1.0 12.7 28k 0.98 11.2 3.9 typicaloff 048k 1.09 7.3 52k 1.11 4.18 3.9 typical0ff adaptive 35k 1.12 4.08 38k 0.65 5.49 3.9 optimized off adaptive 40k 0.82 4.83 43k 0.70 5.23 3.9 optimized off 057k 1.17 4.08 62k 1.04 3.95 Test setup details: Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second) Kernel: unmodified 3.9 and patched 3.9 Config: typical is derived from RH6.2, optimized is a stripped down config. Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us When C3/6 states were turned on (via BIOS) the performance governor was used. These performance numbers were measured with v2 of the patch set. Performance of the optimized config with an rx-usecs setting of 100 (the first line in the table above) was tracked during the evolution of the patches and has never varied by more than 1%. Design: A global hash table that allows us to look up a struct napi by a unique id was added. A napi_id field was added both to struct sk_buff and struct sk. This is used to track which NAPI we need to poll for a specific socket. The device driver marks every incoming skb with this id. This is propagated to the sk when the socket is looked up in the protocol handler. When the socket code does not find any more data on the socket queue, it now may call ndo_ll_poll which will crank the device's rx queue and feed incoming packets to the stack directly from the context of the socket. A sysctl value (net.core4.low_latency_poll) controls how many microseconds we busy-wait before giving up. (setting to 0 globally disables busy-polling) Locking: 1. Locking between napi poll and ndo_ll_poll: Since what needs to be locked between a device's NAPI poll and ndo_ll_poll, is highly device / configuration dependent, we do this inside the Ethernet driver. For example, when packets for high priority connections are sent to separate rx queues, you might not need locking between napi poll and ndo_ll_poll at all. For ixgbe we only lock the RX queue. ndo_ll_poll does not touch the interrupt state or the TX queues. (earlier versions of this patchset did touch them, but this design is simpler and works better.) If a queue is actively polled by a socket (on another CPU) napi poll will not service it, but will wait until the queue can be locked and cleaned before doing a napi_complete(). If a socket can't lock the queue because another CPU has it, either from napi or from another socket polling on the queue, the socket code can busy wait on the socket's skb queue. Ndo_ll_poll does not have preferential treatment for the data from the calling socket vs. data from others, so if another CPU is polling, you will see your data on this socket's queue when it arrives. Ndo_ll_poll is called with local BHs disabled, so it won't race on the same CPU with net_rx_action, which calls the napi poll method. 2. Napi_hash The napi hash mechanism uses RCU. napi_by_id() must be called under rcu_read_lock(). After a
Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling
From: Eliezer Tamir eliezer.ta...@linux.intel.com Date: Tue, 11 Jun 2013 05:25:42 +0300 Here is the text from the RFC and v2 cover letters, updated and merged. If this is too long, please tell me what you think should be removed. It's perfect, and since this went through so many iterations I included the changelog too. Thanks! -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/