On Mon, Apr 05, 2021 at 01:33:48PM -0400, Alexander Ahring Oder Aring wrote: > Hi, > > On Sat, Apr 3, 2021 at 11:34 AM Alexander Ahring Oder Aring > <aahri...@redhat.com> wrote: > > > ... > > > > > It seems to me that the only time DLM might need to retransmit data, is > > > when recovering from a connection failure. So why can't we just resend > > > unacknowledged data at reconnection time? That'd probably simplify the > > > code a lot (no need to maintain a retransmission timeout on TX, no need > > > to handle sequence numbers that are in the future on RX). > > > > > > > I can try to remove the timer, timeout and do the above approach to > > retransmit at reconnect. Then I test it again and I will report back > > to see if it works or why we have other problems. > > > > I have an implementation of this running and so far I don't see any problems. > > > > Also, couldn't we set the DLM sequence numbers in > > > dlm_midcomms_commit_buffer_3_2() rather than using a callback function > > > in dlm_lowcomms_new_buffer()? > > > > ... > > > > Yes, I looked into TCP_REPAIR at first and I agree it can be used to > > solve this problem. However TCP_REPAIR can be used as a part of a more > > generic solution, there needs to be something "additional handling" > > done e.g. additional socket options to let the application layer save > > states before receiving errors. I am also concerned how it would work > > The code [0] is what I meant above. It will call > tcp_write_queue_purge(); before reporting the error over error > queue/callback. That need to be handled differently to allow dumping > the actual TCP state and restore at reconnect, at least that is what I > have in my mind.
Thanks. That's not usable as is, indeed. Also, by retransmitting data from the previous send-queue, we risk resending messages that the peer already received (for example because the previous connection didn't receive the latest ACKs). I guess that receiving the same DLM messages twice is going to confuse the peer. So it looks like we'll need application level sequence numbers anyway. > - Alex > > [0] > https://elixir.bootlin.com/linux/v5.12-rc6/source/net/ipv4/tcp_input.c#L4239 >