[lwip-users] Debugging a hang in an lwIP-based application

Freddie Chopin Thu, 21 Feb 2019 06:41:28 -0800

Hello!

For the past few days I have been debugging a hang in one of my
projects which uses lwIP and my C++ RTOS - http://distortos.org/


The networking part of the application is rather simple and consists of
a single functionality (using 2 threads synchronized with mutexes),
which uses netconn API for Modbus/TCP. This generally works good and
has been used in production for ~2 years now.

However recently it was decided that the refresh of Modbus data has to
be done twice as fast - instead of doing it every second, now I see ~13
Modbus frames every 500 ms (~26 Modbus frames per second).

This again works ok on the first sight. However - and this is one of my
issues here - very infrequently and for a reason I cannot currently
find, it does hang forever. Worth noting is that under lower load (~13
frames per second) it probably does not ever hang. Usually this happens
like only once a day, so it is pretty hard to debug - I have ~100 MB of
wireshark logs and these are just from the last few hours. The firmware
has a watchdog timer manager designed for multithreaded applications,
which includes logging in case of reset, so I am ~100% certain where
the hang happened and it happened inside my 2 Modbus/TCP threads - the
second one is waiting for the first one, so in reality the hang is in
the main Modbus/TCP thread. What is important here - it seems that most
likely lwIP's main TCP/IP thread, as well as the thread used for ETH
input, are running as expected and are not hanged.

Here is what I found with some additional logging:
- My thread sets netconn client connection to blocking;
- My thread calls netconn_write() with NETCONN_COPY flag set;
- This calls netconn_write_partly(), netconn_apimsg() and finally
tcpip_send_msg_wait_sem(), which all arrange for
lwip_netconn_do_write() to be called by lwIP's main TCP/IP thread - the
message is queued correctly and this thread is blocked on the semaphore
of the message;
- lwip_netconn_do_write() is called in lwIP's main TCP/IP thread and it
then calls lwip_netconn_do_writemore(),
- lwip_netconn_do_writemore() calls tcp_write(), which I believe
returns ERR_MEM;
- lwip_netconn_do_writemore() returns _WITHOUT_ notifying the waiting
thread, assuming that sent_tcp() or poll_tcp() will call it again when
some memory is available (based on the comment there, which describes
this mechanism);
- lwip_netconn_do_writemore() is never called again, the system is
reset after 20 seconds. During that time the PC application keeps
sending new Modbus/TCP frames to my device, including both new requests
as well as TCP retransmissions of the frames which were not ACKed. My
devices does respond to the TCP Retransmission frames with a simple
ACKs (which confirms that lwIP's main TCP/IP thread as well as ETH
thread are working). I also see several "TCP Dup ACK" frames;

One important note - my 200 MBs of console logs show that this is
actually the one and only time ERR_MEM was detected inside
lwip_netconn_do_writemore().

I'm using a snapshot of lwIP from May 2016 (commit
6be7e221a55a3b80cfc03ceaf8cea86207982238, 2016-05-24 20:29:18). It is
obviously a good idea to update to the most recent version and I plan
to do that, but I'm wondering whether this can be considered a real
solution? I've browsed the history of api_msg.c file and I did not see
anything which would look like a fix to some bug which could cause the
problem I'm seeing, but of course this was not an in-depth analysis
(there are quite a lot of changes [; ). This question is based on the
fact that I don't know how to cause the problem to appear, so after the
update all I can do is run the application again and wait for the whole
day, hoping that the problem does not appear again (which in reality
proves nothing...).

My second question - maybe someone with much better lwIP experience can
tell me whether the description of the problem fits some (common)
configuration/porting/threading/... problem? I'm obviously not saying
that the application, the driver or lwIP configuration are 100%
correct. With lwipopts.h I tried to keep my tweaks to the bare minimum,
however it still has like 40 options and I will not try to convince
anyone that I understand them all or that I'm sure I should not include
even more #defines there.

Another question - I assume that this is somehow related to the memory
management of lwIP (I'm not using any custom allocators and I'm also
not using malloc()/free() for lwIP). Is it possible that wrong - or
rather sub-optimal - configuration of pool sizes could cause my
connection to hang forever? Maybe wrong configuration in that aspect
should only result in poor performance and dropped connection rather
than a fatal hang? Is it somehow possible that every memory freed by
sending out the data is immediately used by received packets, or maybe
these are two completely separate pools?

Last question in this rather long e-mail - any ideas what I should
check, what to debug and what other information should I include here?

Thanks in advance for any help!

Regards,
FCh


_______________________________________________
lwip-users mailing list
lwip-users@nongnu.org
https://lists.nongnu.org/mailman/listinfo/lwip-users

[lwip-users] Debugging a hang in an lwIP-based application

Reply via email to