Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Job Snijders via Bird-users
On Thu, May 16, 2024 at 12:55:38PM +0200, Ondrej Zajicek wrote:
> Yeah, i think that using TCP_NODELAY for BGP makes sense, considering
> there is already non-trivial framing and we write individual BGP
> messages with one write().

I suspect it might be better for throughput to send multiple BGP
messages with a single writev() or sendmsg() call.

Kind regards,

Job


Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Maria Matejka via Bird-users
Hello Douglas,

just a really fast response, we aren't going to implement BGP over QUIC
at all, at least when TLS is a required part of QUIC.

There are some thoughts about BGP 5 being encoded as CBOR and sent over
something like QUIC, anyway it doesn't make any sense to introduce QUIC
to BGP now.

Maria

On Thu, May 16, 2024 at 09:45:56AM -0300, Douglas Fischer wrote:

> I'm not a programmer, and I know almost nothing about it...
> 
> But reading this thread, I remembered this draft about BGP-over-QUIC.
> https://datatracker.ietf.org/doc/draft-retana-idr-bgp-quic/
> 
> I'm not sure what I'm going to say, but I have the impression that the
> session closure control method that will be chosen may have something to do
> with this BoQ idea.
> 
> I don't even know if what I'm saying makes any sense or not.
> But if you do, it may be something that you have to take into consideration
> when choosing the method.
> 
> And besides, there are many other "ifs"...
> We don't even know if this Draft will actually become an RFC. And if it
> becomes RFC, we don't even know if it will have traction.
> We also don't know if the IBRD project will adhere to this protocol, etc...
> 
> But, I thought it was opportune to mention.
> 
> Em qui., 16 de mai. de 2024 às 07:59, Ondrej Zajicek 
> escreveu:
> 
> > On Wed, May 15, 2024 at 06:37:18PM +0200, Job Snijders via Bird-users
> > wrote:
> >
> > > Dear BIRD people,
> > >
> > > On most systems RFC 896 TCP congestion control is used, also known as
> > > "Nagle's algorithm". This algorithm is intended to help coalesce
> > > consecutive small packets from userland applications (like BIRD) into a
> > > single larger TCP packet. The idea being it reduces bandwidth because
> > > there is less TCP overhead if data is bundled into fewer packets.
> > > ...
> > > I think using TCP_NODELAY is interesting to consider, because it seems
> > > sensible to try to deliver BGP messages as fast as possible. OpenBGPD
> > > and FRR set the TCP_NODELAY socket option.
> >
> > Hi
> >
> > Yeah, i think that using TCP_NODELAY for BGP makes sense, considering
> > there is already non-trivial framing and we write individual BGP messages
> > with one write().
> >
> > --
> > Elen sila lumenn' omentielvo
> >
> > Ondrej 'Santiago' Zajicek (email: santi...@crfreenet.org)
> > "To err is human -- to blame it on a computer is even more so."
> >
> 
> 
> -- 
> Douglas Fernando Fischer
> Engº de Controle e Automação

-- 
Maria Matejka (she/her) | BIRD Team Leader | CZ.NIC, z.s.p.o.


Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Douglas Fischer
I'm not a programmer, and I know almost nothing about it...

But reading this thread, I remembered this draft about BGP-over-QUIC.
https://datatracker.ietf.org/doc/draft-retana-idr-bgp-quic/

I'm not sure what I'm going to say, but I have the impression that the
session closure control method that will be chosen may have something to do
with this BoQ idea.

I don't even know if what I'm saying makes any sense or not.
But if you do, it may be something that you have to take into consideration
when choosing the method.

And besides, there are many other "ifs"...
We don't even know if this Draft will actually become an RFC. And if it
becomes RFC, we don't even know if it will have traction.
We also don't know if the IBRD project will adhere to this protocol, etc...

But, I thought it was opportune to mention.

Em qui., 16 de mai. de 2024 às 07:59, Ondrej Zajicek 
escreveu:

> On Wed, May 15, 2024 at 06:37:18PM +0200, Job Snijders via Bird-users
> wrote:
> > Dear BIRD people,
> >
> > On most systems RFC 896 TCP congestion control is used, also known as
> > "Nagle's algorithm". This algorithm is intended to help coalesce
> > consecutive small packets from userland applications (like BIRD) into a
> > single larger TCP packet. The idea being it reduces bandwidth because
> > there is less TCP overhead if data is bundled into fewer packets.
> > ...
> > I think using TCP_NODELAY is interesting to consider, because it seems
> > sensible to try to deliver BGP messages as fast as possible. OpenBGPD
> > and FRR set the TCP_NODELAY socket option.
>
> Hi
>
> Yeah, i think that using TCP_NODELAY for BGP makes sense, considering
> there is already non-trivial framing and we write individual BGP messages
> with one write().
>
> --
> Elen sila lumenn' omentielvo
>
> Ondrej 'Santiago' Zajicek (email: santi...@crfreenet.org)
> "To err is human -- to blame it on a computer is even more so."
>


-- 
Douglas Fernando Fischer
Engº de Controle e Automação


Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Erin Shepherd
send(..., MSG_MORE)
send(..., 0)

Is roughly equivalent to
setsockopt(..., TCP_CORK, 1)
send(..., 0)
send(..., 0)
setsockopt(..., TCP_CORK, 0)

In cases where you know that you're going to write more but you want to be able 
to discard the user space buffer, MSG_MORE is useful. If holding onto the 
buffers is easy so they can then be passed to sendmsg(), then you may as well 
do that. 

(And of course you can combine both techniques)

- Erin

On Thu, 16 May 2024, at 01:32, Job Snijders wrote:
> On Wed, May 15, 2024 at 09:09:47PM +0200, Erin Shepherd wrote:
> > It seems absent from the BSDs, but on Linux you can pass the MSG_MORE
> > flag to send() to override TCP_NODELAY for a specific write
> 
> Am I understanding correctly this is a variant on TCP_NOPUSH/TCP_CORK?
> "more data is coming, dont push the send button yet!"
> 
> In OpenBGPD, TCP_NODELAY is set on the socket (a socket option available
> on all platforms, I think?), and then all data is coalesced into
> sendmsg(), no need for 'corking'. From my limited testing it seems a
> full routing table should fit in ~ TCP 41,000 packets.
> 
> BIRD has a code path sk_sendmsg()->sendmsg() called from
> sk_maybe_write(); but based my limited testing I'm not sure this path is
> followed in all cases, because I see way more than 41K packets for a
> full table feed (with TCP_NODELAY enabled).
> 
> Perhaps there are two separate questions here:
> 
> - are BGP messages (slightly) delayed because of TCP_NODELAY not being
>   set? (I think yes)
> - are BGP messages as efficiently coalesced into as few TCP packets as
>   possible? (with TCP_NODELAY set, I am not sure)
> 
> Kind regards,
> 
> Job
> 
> ps. To clarify why I started this thread: last week I fell into the TCP
> subsystem rabbit hole: why are things the way they are? I started
> auditing various programs related to my $dayjob and thought it would be
> good to open a conversation with the BIRD developer community. My goal
> is not necessarily to get this patch 'as-is' merged, but to learn from
> and with friendly and respected BGP developers.
> 

Re: use TCP_NODELAY on TCP sockets?

2024-05-16 Thread Ondrej Zajicek
On Wed, May 15, 2024 at 06:37:18PM +0200, Job Snijders via Bird-users wrote:
> Dear BIRD people,
> 
> On most systems RFC 896 TCP congestion control is used, also known as
> "Nagle's algorithm". This algorithm is intended to help coalesce
> consecutive small packets from userland applications (like BIRD) into a
> single larger TCP packet. The idea being it reduces bandwidth because
> there is less TCP overhead if data is bundled into fewer packets.
> ...
> I think using TCP_NODELAY is interesting to consider, because it seems
> sensible to try to deliver BGP messages as fast as possible. OpenBGPD
> and FRR set the TCP_NODELAY socket option.

Hi

Yeah, i think that using TCP_NODELAY for BGP makes sense, considering
there is already non-trivial framing and we write individual BGP messages
with one write().

-- 
Elen sila lumenn' omentielvo

Ondrej 'Santiago' Zajicek (email: santi...@crfreenet.org)
"To err is human -- to blame it on a computer is even more so."


Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
On Wed, May 15, 2024 at 09:09:47PM +0200, Erin Shepherd wrote:
> It seems absent from the BSDs, but on Linux you can pass the MSG_MORE
> flag to send() to override TCP_NODELAY for a specific write

Am I understanding correctly this is a variant on TCP_NOPUSH/TCP_CORK?
"more data is coming, dont push the send button yet!"

In OpenBGPD, TCP_NODELAY is set on the socket (a socket option available
on all platforms, I think?), and then all data is coalesced into
sendmsg(), no need for 'corking'. From my limited testing it seems a
full routing table should fit in ~ TCP 41,000 packets.

BIRD has a code path sk_sendmsg()->sendmsg() called from
sk_maybe_write(); but based my limited testing I'm not sure this path is
followed in all cases, because I see way more than 41K packets for a
full table feed (with TCP_NODELAY enabled).

Perhaps there are two separate questions here:

- are BGP messages (slightly) delayed because of TCP_NODELAY not being
  set? (I think yes)
- are BGP messages as efficiently coalesced into as few TCP packets as
  possible? (with TCP_NODELAY set, I am not sure)

Kind regards,

Job

ps. To clarify why I started this thread: last week I fell into the TCP
subsystem rabbit hole: why are things the way they are? I started
auditing various programs related to my $dayjob and thought it would be
good to open a conversation with the BIRD developer community. My goal
is not necessarily to get this patch 'as-is' merged, but to learn from
and with friendly and respected BGP developers.


Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Erin Shepherd
It seems absent from the BSDs, but on Linux you can pass the MSG_MORE flag to 
send() to override TCP_NODELAY for a specific write

On Wed, 15 May 2024, at 19:40, Job Snijders via Bird-users wrote:
> Dear Marco,
> 
> On Wed, 15 May 2024 at 19:27, Marco d'Itri  wrote:
>> On May 15, Job Snijders via Bird-users  wrote:
>> 
>> > But please be very careful in considering this patch, because it does
>> > introduce some subtle changes in the on-the-wire behaviour of BIRD. For
>> > example, without this patch an UPDATE for a handful of routes and the
>> > End-of-RIB marker might end up in the same TCP packet (if this fits);
>> > but with this patch, the End-of-RIB marker ends up in its own TCP
>> > packet. As things are today, setting TCP_NODELAY will increase the
> 
>> 
>> I think that this can be fixed easily with TCP_CORK.
>> Basically, with TCP_NODELAY + TCP_CORK you can have the optimal balance 
>> of latency and overhead without using writev.
> 
> 
> Yes, thanks for sharing this suggestion.
> 
> Note that TCP_CORK is a Linux-only feature (on FreeBSD it seems aliased to 
> TCP_NOPUSH, but I might be misunderstanding that code). There are subtle 
> differences between NOPUSH and CORK: resetting CORK triggers a flush, whereas 
> resetting NOPUSH does not.
> 
> It is of course reasonable to optimize one platform and not others, we work 
> with the tools that we have - but a portable approach (using writev()?) would 
> seem attractive to me :-)
> 
> Kind regards,
> 
> Job
>> 

Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
Dear Marco,

On Wed, 15 May 2024 at 19:27, Marco d'Itri  wrote:

> On May 15, Job Snijders via Bird-users  wrote:
>
> > But please be very careful in considering this patch, because it does
> > introduce some subtle changes in the on-the-wire behaviour of BIRD. For
> > example, without this patch an UPDATE for a handful of routes and the
> > End-of-RIB marker might end up in the same TCP packet (if this fits);
> > but with this patch, the End-of-RIB marker ends up in its own TCP
> > packet. As things are today, setting TCP_NODELAY will increase the



> I think that this can be fixed easily with TCP_CORK.
> Basically, with TCP_NODELAY + TCP_CORK you can have the optimal balance
> of latency and overhead without using writev.



Yes, thanks for sharing this suggestion.

Note that TCP_CORK is a Linux-only feature (on FreeBSD it seems aliased to
TCP_NOPUSH, but I might be misunderstanding that code). There are subtle
differences between NOPUSH and CORK: resetting CORK triggers a flush,
whereas resetting NOPUSH does not.

It is of course reasonable to optimize one platform and not others, we work
with the tools that we have - but a portable approach (using writev()?)
would seem attractive to me :-)

Kind regards,

Job

>


Re: use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Marco d'Itri
On May 15, Job Snijders via Bird-users  wrote:

> But please be very careful in considering this patch, because it does
> introduce some subtle changes in the on-the-wire behaviour of BIRD. For
> example, without this patch an UPDATE for a handful of routes and the
> End-of-RIB marker might end up in the same TCP packet (if this fits);
> but with this patch, the End-of-RIB marker ends up in its own TCP
> packet. As things are today, setting TCP_NODELAY will increase the
I think that this can be fixed easily with TCP_CORK.
Basically, with TCP_NODELAY + TCP_CORK you can have the optimal balance 
of latency and overhead without using writev.

-- 
ciao,
Marco


use TCP_NODELAY on TCP sockets?

2024-05-15 Thread Job Snijders via Bird-users
Dear BIRD people,

On most systems RFC 896 TCP congestion control is used, also known as
"Nagle's algorithm". This algorithm is intended to help coalesce
consecutive small packets from userland applications (like BIRD) into a
single larger TCP packet. The idea being it reduces bandwidth because
there is less TCP overhead if data is bundled into fewer packets.

This happens at the cost of an increase in end-to-end latency: the
sender will locally queue up data (in the operating system's TCP
buffers) until it either receives a TCP Ack from the remote side for all
previously sent data, or sufficient additional data piled up to send out
a full-sized TCP segment.

In my simple testing setup (announce 255 routes to a peer) with this
patch it takes ~ 0.007478 seconds after SYN to send the UPDATE message
out. Without this patch, it takes ~ 0.206955 seconds to send the UPDATE
out (perhaps because of negative interaction between Nagle's algorithm
and Delayed Acks?).

I think using TCP_NODELAY is interesting to consider, because it seems
sensible to try to deliver BGP messages as fast as possible. OpenBGPD
and FRR set the TCP_NODELAY socket option.

But please be very careful in considering this patch, because it does
introduce some subtle changes in the on-the-wire behaviour of BIRD. For
example, without this patch an UPDATE for a handful of routes and the
End-of-RIB marker might end up in the same TCP packet (if this fits);
but with this patch, the End-of-RIB marker ends up in its own TCP
packet. As things are today, setting TCP_NODELAY will increase the
average number of packets send to peers. Packing multiple BGP messages
into fewer TCP frames could perhaps be reintroduced by using writev()
instead of write(), when it is known a bunch of messages have been
queued up for a peer.

Another complex behaviour is that the effect of the Nagle algorithm
depends on latency between local system and remote system. Nagle's
algorithm makes it so that data is locally buffered as long as the
remote side has not yet acked all previously sent data; this means that
proportional to the distance (latency) chances of message coalescence
increase. Without this patch BIRD will send more individual TCP packets
to peers that are close by compared to peers that are far away. But with
TCP_NODELAY applied, all peers will receive the same number of TCP
packets, regardless of latency.

Look forward to hear your thoughts!

Kind regards,

Job

 sysdep/unix/io.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c
index 9b499020..f9746cb7 100644
--- a/sysdep/unix/io.c
+++ b/sysdep/unix/io.c
@@ -946,7 +946,7 @@ sock_new(pool *p)
 static int
 sk_setup(sock *s)
 {
-  int y = 1;
+  int y = 1, nodelay = 1;
   int fd = s->fd;
 
   if (s->type == SK_SSH_ACTIVE)
@@ -1048,6 +1048,10 @@ sk_setup(sock *s)
return -1;
   }
 
+  if ((s->type == SK_TCP_PASSIVE) || (s->type == SK_TCP_ACTIVE))
+if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &nodelay, sizeof(nodelay)) < 
0)
+  ERR("TCP_NODELAY");
+
   /* Must be after sk_set_tos4() as setting ToS on Linux also mangles priority 
*/
   if (s->priority >= 0)
 if (sk_set_priority(s, s->priority) < 0)