Hello! For the past few days I have been debugging a hang in one of my projects which uses lwIP and my C++ RTOS - http://distortos.org/
The networking part of the application is rather simple and consists of a single functionality (using 2 threads synchronized with mutexes), which uses netconn API for Modbus/TCP. This generally works good and has been used in production for ~2 years now. However recently it was decided that the refresh of Modbus data has to be done twice as fast - instead of doing it every second, now I see ~13 Modbus frames every 500 ms (~26 Modbus frames per second). This again works ok on the first sight. However - and this is one of my issues here - very infrequently and for a reason I cannot currently find, it does hang forever. Worth noting is that under lower load (~13 frames per second) it probably does not ever hang. Usually this happens like only once a day, so it is pretty hard to debug - I have ~100 MB of wireshark logs and these are just from the last few hours. The firmware has a watchdog timer manager designed for multithreaded applications, which includes logging in case of reset, so I am ~100% certain where the hang happened and it happened inside my 2 Modbus/TCP threads - the second one is waiting for the first one, so in reality the hang is in the main Modbus/TCP thread. What is important here - it seems that most likely lwIP's main TCP/IP thread, as well as the thread used for ETH input, are running as expected and are not hanged. Here is what I found with some additional logging: - My thread sets netconn client connection to blocking; - My thread calls netconn_write() with NETCONN_COPY flag set; - This calls netconn_write_partly(), netconn_apimsg() and finally tcpip_send_msg_wait_sem(), which all arrange for lwip_netconn_do_write() to be called by lwIP's main TCP/IP thread - the message is queued correctly and this thread is blocked on the semaphore of the message; - lwip_netconn_do_write() is called in lwIP's main TCP/IP thread and it then calls lwip_netconn_do_writemore(), - lwip_netconn_do_writemore() calls tcp_write(), which I believe returns ERR_MEM; - lwip_netconn_do_writemore() returns _WITHOUT_ notifying the waiting thread, assuming that sent_tcp() or poll_tcp() will call it again when some memory is available (based on the comment there, which describes this mechanism); - lwip_netconn_do_writemore() is never called again, the system is reset after 20 seconds. During that time the PC application keeps sending new Modbus/TCP frames to my device, including both new requests as well as TCP retransmissions of the frames which were not ACKed. My devices does respond to the TCP Retransmission frames with a simple ACKs (which confirms that lwIP's main TCP/IP thread as well as ETH thread are working). I also see several "TCP Dup ACK" frames; One important note - my 200 MBs of console logs show that this is actually the one and only time ERR_MEM was detected inside lwip_netconn_do_writemore(). I'm using a snapshot of lwIP from May 2016 (commit 6be7e221a55a3b80cfc03ceaf8cea86207982238, 2016-05-24 20:29:18). It is obviously a good idea to update to the most recent version and I plan to do that, but I'm wondering whether this can be considered a real solution? I've browsed the history of api_msg.c file and I did not see anything which would look like a fix to some bug which could cause the problem I'm seeing, but of course this was not an in-depth analysis (there are quite a lot of changes [; ). This question is based on the fact that I don't know how to cause the problem to appear, so after the update all I can do is run the application again and wait for the whole day, hoping that the problem does not appear again (which in reality proves nothing...). My second question - maybe someone with much better lwIP experience can tell me whether the description of the problem fits some (common) configuration/porting/threading/... problem? I'm obviously not saying that the application, the driver or lwIP configuration are 100% correct. With lwipopts.h I tried to keep my tweaks to the bare minimum, however it still has like 40 options and I will not try to convince anyone that I understand them all or that I'm sure I should not include even more #defines there. Another question - I assume that this is somehow related to the memory management of lwIP (I'm not using any custom allocators and I'm also not using malloc()/free() for lwIP). Is it possible that wrong - or rather sub-optimal - configuration of pool sizes could cause my connection to hang forever? Maybe wrong configuration in that aspect should only result in poor performance and dropped connection rather than a fatal hang? Is it somehow possible that every memory freed by sending out the data is immediately used by received packets, or maybe these are two completely separate pools? Last question in this rather long e-mail - any ideas what I should check, what to debug and what other information should I include here? Thanks in advance for any help! Regards, FCh _______________________________________________ lwip-users mailing list lwip-users@nongnu.org https://lists.nongnu.org/mailman/listinfo/lwip-users