Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 11:14, David Miller wrote:

From: Eliezer Tamir 
Date: Tue, 11 Jun 2013 09:49:31 +0300


I would like to hear opinions on what needs to be added to make this
feature complete.


I actually would like to see the Kconfig option go away, that's
my only request.


OK,

I will send a patch for this when I submit the socket option patch.

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 10:32, Eric Dumazet wrote:

On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:


I would like to hear opinions on what needs to be added to make this
feature complete.

The list I have so far is:
1. add a socket option


Yes, please. I do not believe all sockets on the machine are candidate
for low latency. In fact very few of them should be, depending on the
number of cpu and/or RX queues.


I have a patch for that, along a patch for sockperf I will use for
testing.
One I will test it some more, I will send it in.


3. support for epoll


For this one, I honestly do not know how to proceed.

epoll Edge Trigger model is driven by the wakeups events.

The wakeups come from frames being delivered by the NIC (for UDP/TCP
sockets)

If epoll_wait() has to scan the list of epitem to be able to perform the
llpoll callback, it will be too slow : We come back to poll() model,
with O(N) execution time.

Ideally we would have to callback llpoll not from the tcp_poll(), but
right before putting current thread in wait mode.


We have a few ideas, I will do a POC and see if any of them actually
work.

One thing that would really help is information about use-cases that
people care about:

Number and type of sockets, how active are they.
How many active Ethernet ports are there.
Can bulk and low latency traffic be steered to separate cores or
separated in any other way.

Thanks,
Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread David Miller
From: Eliezer Tamir 
Date: Tue, 11 Jun 2013 09:49:31 +0300

> I would like to hear opinions on what needs to be added to make this
> feature complete.
> 
> The list I have so far is:
> 1. add a socket option
> 2. support for poll/select
> 3. support for epoll

I actually would like to see the Kconfig option go away, that's
my only request.

> Also, would you accept a trailing whitespace cleanup patch for
> fs/select.c?

That's not really for my tree, sorry.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eric Dumazet
On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:

> I would like to hear opinions on what needs to be added to make this
> feature complete.
> 
> The list I have so far is:
> 1. add a socket option

Yes, please. I do not believe all sockets on the machine are candidate
for low latency. In fact very few of them should be, depending on the
number of cpu and/or RX queues.

> 2. support for poll/select

As long as the cost of llpoll is bounded per poll()/select() call it
will be ok.

> 3. support for epoll

For this one, I honestly do not know how to proceed.

epoll Edge Trigger model is driven by the wakeups events.

The wakeups come from frames being delivered by the NIC (for UDP/TCP
sockets)

If epoll_wait() has to scan the list of epitem to be able to perform the
llpoll callback, it will be too slow : We come back to poll() model,
with O(N) execution time.

Ideally we would have to callback llpoll not from the tcp_poll(), but
right before putting current thread in wait mode.

> 
> Also, would you accept a trailing whitespace cleanup patch for
> fs/select.c?

This has to be submitted to lkml



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 07:24, David Miller wrote:

From: Eliezer Tamir 
Date: Tue, 11 Jun 2013 05:25:42 +0300


Here is the text from the RFC and v2 cover letters, updated and
merged.  If this is too long, please tell me what you think should
be removed.


It's perfect, and since this went through so many iterations I
included the changelog too.



Thank you.

I would like to hear opinions on what needs to be added to make this
feature complete.

The list I have so far is:
1. add a socket option
2. support for poll/select
3. support for epoll

Also, would you accept a trailing whitespace cleanup patch for
fs/select.c?

-Eliezer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 07:24, David Miller wrote:

From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Tue, 11 Jun 2013 05:25:42 +0300


Here is the text from the RFC and v2 cover letters, updated and
merged.  If this is too long, please tell me what you think should
be removed.


It's perfect, and since this went through so many iterations I
included the changelog too.



Thank you.

I would like to hear opinions on what needs to be added to make this
feature complete.

The list I have so far is:
1. add a socket option
2. support for poll/select
3. support for epoll

Also, would you accept a trailing whitespace cleanup patch for
fs/select.c?

-Eliezer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eric Dumazet
On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:

 I would like to hear opinions on what needs to be added to make this
 feature complete.
 
 The list I have so far is:
 1. add a socket option

Yes, please. I do not believe all sockets on the machine are candidate
for low latency. In fact very few of them should be, depending on the
number of cpu and/or RX queues.

 2. support for poll/select

As long as the cost of llpoll is bounded per poll()/select() call it
will be ok.

 3. support for epoll

For this one, I honestly do not know how to proceed.

epoll Edge Trigger model is driven by the wakeups events.

The wakeups come from frames being delivered by the NIC (for UDP/TCP
sockets)

If epoll_wait() has to scan the list of epitem to be able to perform the
llpoll callback, it will be too slow : We come back to poll() model,
with O(N) execution time.

Ideally we would have to callback llpoll not from the tcp_poll(), but
right before putting current thread in wait mode.

 
 Also, would you accept a trailing whitespace cleanup patch for
 fs/select.c?

This has to be submitted to lkml



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread David Miller
From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Tue, 11 Jun 2013 09:49:31 +0300

 I would like to hear opinions on what needs to be added to make this
 feature complete.
 
 The list I have so far is:
 1. add a socket option
 2. support for poll/select
 3. support for epoll

I actually would like to see the Kconfig option go away, that's
my only request.

 Also, would you accept a trailing whitespace cleanup patch for
 fs/select.c?

That's not really for my tree, sorry.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 10:32, Eric Dumazet wrote:

On Tue, 2013-06-11 at 09:49 +0300, Eliezer Tamir wrote:


I would like to hear opinions on what needs to be added to make this
feature complete.

The list I have so far is:
1. add a socket option


Yes, please. I do not believe all sockets on the machine are candidate
for low latency. In fact very few of them should be, depending on the
number of cpu and/or RX queues.


I have a patch for that, along a patch for sockperf I will use for
testing.
One I will test it some more, I will send it in.


3. support for epoll


For this one, I honestly do not know how to proceed.

epoll Edge Trigger model is driven by the wakeups events.

The wakeups come from frames being delivered by the NIC (for UDP/TCP
sockets)

If epoll_wait() has to scan the list of epitem to be able to perform the
llpoll callback, it will be too slow : We come back to poll() model,
with O(N) execution time.

Ideally we would have to callback llpoll not from the tcp_poll(), but
right before putting current thread in wait mode.


We have a few ideas, I will do a POC and see if any of them actually
work.

One thing that would really help is information about use-cases that
people care about:

Number and type of sockets, how active are they.
How many active Ethernet ports are there.
Can bulk and low latency traffic be steered to separate cores or
separated in any other way.

Thanks,
Eliezer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-11 Thread Eliezer Tamir

On 11/06/2013 11:14, David Miller wrote:

From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Tue, 11 Jun 2013 09:49:31 +0300


I would like to hear opinions on what needs to be added to make this
feature complete.


I actually would like to see the Kconfig option go away, that's
my only request.


OK,

I will send a patch for this when I submit the socket option patch.

-Eliezer
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread David Miller
From: Eliezer Tamir 
Date: Tue, 11 Jun 2013 05:25:42 +0300

> Here is the text from the RFC and v2 cover letters, updated and
> merged.  If this is too long, please tell me what you think should
> be removed.

It's perfect, and since this went through so many iterations I
included the changelog too.

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread Eliezer Tamir

On 10/06/2013 23:41, David Miller wrote:

From: Eliezer Tamir 
Date: Mon, 10 Jun 2013 11:39:30 +0300


I removed the select/poll patch (was 5/7 in v9) from the set.
The rest are the same patches that were in v9.


Reply to this email with some text to put in the merge commit,
including basic benchmark results, so that I can apply this series.


sorry,

Here is the text from the RFC and v2 cover letters, updated and merged.
If this is too long, please tell me what you think should be removed.

Thanks,
Eliezer

---

This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last 
year

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.


Performance numbers:
 setup TCP_RR   UDP_RR
kernel  Config C3/6 rx-usecs tps cpu% S.dem  tps cpu% S.dem
patched optimized  on   100  87k 3.13 11.4   94K 3.17 10.7
patched optimized  on   071k 3.12 14.0   84k 3.19 12.0
patched optimized  on   adaptive 80k 3.13 12.5   90k 3.46 12.2
patched typicalon   100  72  3.13 14.0   79k 3.17 12.8
patched typicalon   060k 2.13 16.5   71k 3.18 14.0
patched typicalon   adaptive 67k 3.51 16.7   75k 3.36 14.5
3.9 optimized  on   adaptive 25k 1.0  12.7   28k 0.98 11.2
3.9 typicaloff  048k 1.09  7.3   52k 1.11 4.18
3.9 typical0ff  adaptive 35k 1.12 4.08   38k 0.65 5.49
3.9 optimized  off  adaptive 40k 0.82 4.83   43k 0.70 5.23
3.9 optimized  off  057k 1.17 4.08   62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us
When C3/6 states were turned on (via BIOS) the performance governor was 
used.


These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.


Design:
A global hash table that allows us to look up a struct napi by a unique 
id was added.


A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the 
protocol handler.


When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally 
disables busy-polling)



Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a call to napi_hash_del(), 

Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread David Miller
From: Eliezer Tamir 
Date: Mon, 10 Jun 2013 11:39:30 +0300

> I removed the select/poll patch (was 5/7 in v9) from the set.
> The rest are the same patches that were in v9.
> 
> Please consider applying.
> 
> Thanks to everyone for their input.

There used to be a really nice, detailed and verbose, description of
the goals and general idea of these changes, along with a lot of
benchmark data.

Now I don't see it, either here in this posting, or in any of the
patch commit messages.

Don't get rid of stuff like that, for a set of changes of this
magnitude you can basically consider such details descriptions
and information mandatory.

Reply to this email with some text to put in the merge commit,
including basic benchmark results, so that I can apply this series.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread David Miller
From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Mon, 10 Jun 2013 11:39:30 +0300

 I removed the select/poll patch (was 5/7 in v9) from the set.
 The rest are the same patches that were in v9.
 
 Please consider applying.
 
 Thanks to everyone for their input.

There used to be a really nice, detailed and verbose, description of
the goals and general idea of these changes, along with a lot of
benchmark data.

Now I don't see it, either here in this posting, or in any of the
patch commit messages.

Don't get rid of stuff like that, for a set of changes of this
magnitude you can basically consider such details descriptions
and information mandatory.

Reply to this email with some text to put in the merge commit,
including basic benchmark results, so that I can apply this series.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread Eliezer Tamir

On 10/06/2013 23:41, David Miller wrote:

From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Mon, 10 Jun 2013 11:39:30 +0300


I removed the select/poll patch (was 5/7 in v9) from the set.
The rest are the same patches that were in v9.


Reply to this email with some text to put in the merge commit,
including basic benchmark results, so that I can apply this series.


sorry,

Here is the text from the RFC and v2 cover letters, updated and merged.
If this is too long, please tell me what you think should be removed.

Thanks,
Eliezer

---

This patch set adds the ability for the socket layer code to
poll directly on an Ethernet device's RX queue.
This eliminates the cost of the interrupt and context switch
and with proper tuning allows us to get very close to the HW latency.

This is a follow up to Jesse Brandeburg's Kernel Plumbers talk from last 
year

http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-Low-Latency-Sockets-slides-brandeburg.pdf

Patch 1 adds a napi_id and a hashing mechanism to lookup a napi by id.
Patch 2 adds an ndo_ll_poll method and the code that supports it.
Patch 3 adds support for busy-polling on UDP sockets.
Patch 4 adds support for TCP.
Patch 5 adds the ixgbe driver code implementing ndo_ll_poll.
Patch 6 adds additional statistics to the ixgbe driver for ndo_ll_poll.


Performance numbers:
 setup TCP_RR   UDP_RR
kernel  Config C3/6 rx-usecs tps cpu% S.dem  tps cpu% S.dem
patched optimized  on   100  87k 3.13 11.4   94K 3.17 10.7
patched optimized  on   071k 3.12 14.0   84k 3.19 12.0
patched optimized  on   adaptive 80k 3.13 12.5   90k 3.46 12.2
patched typicalon   100  72  3.13 14.0   79k 3.17 12.8
patched typicalon   060k 2.13 16.5   71k 3.18 14.0
patched typicalon   adaptive 67k 3.51 16.7   75k 3.36 14.5
3.9 optimized  on   adaptive 25k 1.0  12.7   28k 0.98 11.2
3.9 typicaloff  048k 1.09  7.3   52k 1.11 4.18
3.9 typical0ff  adaptive 35k 1.12 4.08   38k 0.65 5.49
3.9 optimized  off  adaptive 40k 0.82 4.83   43k 0.70 5.23
3.9 optimized  off  057k 1.17 4.08   62k 1.04 3.95

Test setup details:
Machines: each with two Intel Xeon 2680 CPUs and X520 (82599) optical NICs
Tests: Netperf tcp_rr and udp_rr, 1 byte (round trips per second)
Kernel: unmodified 3.9 and patched 3.9
Config: typical is derived from RH6.2, optimized is a stripped down config.
Interrupt coalescing (ethtool rx-usecs) settings: 0=off, 1=adaptive, 100 us
When C3/6 states were turned on (via BIOS) the performance governor was 
used.


These performance numbers were measured with v2 of the patch set.
Performance of the optimized config with an rx-usecs setting of 100
(the first line in the table above) was tracked during the evolution
of the patches and has never varied by more than 1%.


Design:
A global hash table that allows us to look up a struct napi by a unique 
id was added.


A napi_id field was added both to struct sk_buff and struct sk.
This is used to track which NAPI we need to poll for a specific socket.

The device driver marks every incoming skb with this id.
This is propagated to the sk when the socket is looked up in the 
protocol handler.


When the socket code does not find any more data on the socket queue,
it now may call ndo_ll_poll which will crank the device's rx queue and
feed incoming packets to the stack directly from the context of the
socket.

A sysctl value (net.core4.low_latency_poll) controls how many
microseconds we busy-wait before giving up. (setting to 0 globally 
disables busy-polling)



Locking:

1. Locking between napi poll and ndo_ll_poll:
Since what needs to be locked between a device's NAPI poll and
ndo_ll_poll, is highly device / configuration dependent, we do this
inside the Ethernet driver.
For example, when packets for high priority connections are sent to
separate rx queues, you might not need locking between napi poll and
ndo_ll_poll at all.

For ixgbe we only lock the RX queue.
ndo_ll_poll does not touch the interrupt state or the TX queues.
(earlier versions of this patchset did touch them,
but this design is simpler and works better.)

If a queue is actively polled by a socket (on another CPU) napi poll
will not service it, but will wait until the queue can be locked
and cleaned before doing a napi_complete().
If a socket can't lock the queue because another CPU has it,
either from napi or from another socket polling on the queue,
the socket code can busy wait on the socket's skb queue.

Ndo_ll_poll does not have preferential treatment for the data from the
calling socket vs. data from others, so if another CPU is polling,
you will see your data on this socket's queue when it arrives.

Ndo_ll_poll is called with local BHs disabled, so it won't race on
the same CPU with net_rx_action, which calls the napi poll method.

2. Napi_hash
The napi hash mechanism uses RCU.
napi_by_id() must be called under rcu_read_lock().
After a 

Re: [PATCH v10 net-next 0/6] net: low latency Ethernet device polling

2013-06-10 Thread David Miller
From: Eliezer Tamir eliezer.ta...@linux.intel.com
Date: Tue, 11 Jun 2013 05:25:42 +0300

 Here is the text from the RFC and v2 cover letters, updated and
 merged.  If this is too long, please tell me what you think should
 be removed.

It's perfect, and since this went through so many iterations I
included the changelog too.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/