Hello, I was on vacation last week, so I didn't make any progress on this. I want to fix it, but I need guidance. No one has commented on this...
Did anyone manage to reproduce the issue using my fork? On Tue, Aug 23, 2022 at 10:42 AM Sebastien Lorquet <sebast...@lorquet.fr> wrote: > Hi, > > is there any follow up on this point? > > Sebastien > > > Le 13/08/2022 à 16:44, Fotis Panagiotopoulos a écrit : > > Ok, I just managed to reproduce the issue on a NUCLEO-F429ZI, using the > > NuttX apps. > > > > Please check my fork on > > https://github.com/fjpanag/incubator-nuttx-apps/tree/tcp_issue > > See the branch tcp_issue. > > > > I have "hacked" the NSH code to reproduce the issue. > > A TCP connection is opened, and then closed. Then the network interface > is > > brought down. At this point the system crashes immediately. > > > > Note that I have locked the scheduler when the connection is closed. > > This is to simulate an ifdown action BEFORE the FIN ACK is processed (as > it > > happens in my case). > > My code does not have this locking, this is only for simulation purposes. > > > > Please use the provided defconfig. It is stored in the root of my apps > fork. > > I guess it is not related to the configuration, but my "working" sample > is > > provided. > > > > > > > > On Fri, Aug 12, 2022 at 7:15 PM Alan Carvalho de Assis < > acas...@gmail.com> > > wrote: > > > >> Hi Fotis, > >> > >> Yes, I understood the point. Because it needs the right timing it > >> could be trick to duplicate. > >> > >> Did you try to create a simple host server to try to emulate this > >> connection issue? > >> > >> BR, > >> > >> Alan > >> > >> On 8/12/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote: > >>> I think I understand the nature of the bug. > >>> > >>> When closing a socket, tcp_close_eventhandler() is set as a callback in > >> the > >>> dev->d_devcb list. > >>> > >>> Typically, the server's response (FIN ACK) will have as a result > >>> tcp_callback() to be executed, and thus the callback to be properly > >> called, > >>> with proper arguments. > >>> Then the cb is properly free'd. > >>> > >>> If however devif_dev_event() has the chance to execute before > >>> tcp_callback() (e.g. server's response was lost), then the callbacks > take > >>> NULL as a conn argument. > >>> This crashes the whole system horribly. > >>> > >>> As you see, this requires specific timings with the server > communication, > >>> that's why this is so hard to reproduce. > >>> > >>> > >>> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos < > >> f.j.pa...@gmail.com> > >>> wrote: > >>> > >>>> Hi Alan, > >>>> > >>>> I am trying hard to reproduce the issue reliably, but I haven't been > >> able > >>>> to do so yet. > >>>> > >>>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the > problem > >>>> does not disappear, rather it changes form. > >>>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95. > >>>> > >>>> I have to mention that everything in my system is commented out. > >>>> Currently the only thing working is the network thread that opens the > >> TCP > >>>> connection, nothing else. > >>>> I have disabled all of my usage of the workers, all signals etc. > >>>> I verify that when the fault occurs, this thread is not interrupted by > >>>> anything (using Segger SystemView). > >>>> It looks like a scheduling issue is unlikely. > >>>> > >>>> I also increased the stacks more, and I added padding to the very few > >>>> malloc's that I use. > >>>> > >>>> --- > >>>> > >>>> At this moment I observe something very interesting. > >>>> I am calling netlib_ifdown(), which causes the attached stack trace. > >>>> > >>>> So: > >>>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn > set > >>>> explicitly to NULL. > >>>> 2. devif_dev_event() eventually calls tcp_close_eventhandler() > >>>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which > causes > >>>> the crash. > >>>> > >>>> This is wrong, but I don't have the understanding of it yet. > >>>> Shall there be a check for a NULL conn? > >>>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in > the > >>>> first place? > >>>> Or tcp_close_eventhandler() should be tolerant to a NULL conn > argument? > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis > >>>> <acas...@gmail.com> > >>>> wrote: > >>>> > >>>>> Hi Fotis, > >>>>> > >>>>> Are you in sync with mainline? > >>>>> > >>>>> If you can create a host application to induce the issue will be > >>>>> easier for us to test. > >>>>> > >>>>> BR, > >>>>> > >>>>> Alan > >>>>> > >>>>> On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote: > >>>>>> Hello, > >>>>>> > >>>>>> still trying to make the network work reliably. > >>>>>> After fixing another issue of my application, I hit another problem. > >>>>>> > >>>>>> The following sequence causes NuttX to crash: > >>>>>> > >>>>>> 1. My application is creating a TCP socket and communicates with a > >>>>> server. > >>>>>> 2. At one point the server stops responding (unrelated to NuttX / > >>>>> network > >>>>>> issue). > >>>>>> 3. The application detects the timeout, and calls close() on the > >>>>>> socket. > >>>>>> 4. A new socket is created, and it is connected to the server. > >>>>>> 5. At this point, the server decides to send a FIN message for the > >>>>> previous > >>>>>> connection. > >>>>>> 6. I get a failed assertion in devif_callback.c at line 85. > >>>>>> > >>>>>> Note that I haven't managed to manually reproduce this issue. > >>>>>> No matter what I do manually, everything seems to be working > >>>>>> correctly. > >>>>>> I just have to wait for it to happen. > >>>>>> It seems that it is only triggered if a FIN arrives **after** a SYN. > >>>>>> > >>>>>> I am sure that this is only happening with > >>>>>> CONFIG_NET_TCP_WRITE_BUFFERS > >>>>>> enabled. > >>>>>> I have no problems without buffering. > >>>>>> > >>>>>> The assertion seems right to fire. > >>>>>> When a FIN is received for a closed connection, the same callback is > >>>>> free'd > >>>>>> both by tcp_lost_connection() and later on by > >>>>>> tcp_close_eventhandler(). > >>>>>> All these are happening within the same execution of tcp_input(). > >>>>>> > >>>>>> Any ideas? > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet > >>>>>> <sebast...@lorquet.fr > >>>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Hi, > >>>>>>> > >>>>>>> good find but > >>>>>>> > >>>>>>> -I dont think any usual application tinkers with PHY regs during > its > >>>>>>> lifetime except the ethernet monitor > >>>>>>> > >>>>>>> -the fix is certainly a lock somewhere but global or fine grained I > >>>>> dont > >>>>>>> know. > >>>>>>> > >>>>>>> Not all calls need to be locked, eg the one that returns the PHY > >>>>>>> address. Probaby not needed by default, but a PHY access lock would > >>>>>>> prevent any issue you describe. > >>>>>>> > >>>>>>> I will wait for people with more expertise about this. > >>>>>>> > >>>>>>> Just a note, dont forget that not all PHY have an interrupt, the > one > >>>>>>> on > >>>>>>> the nucleo stm32h743zi[2] board does not have one. > >>>>>>> > >>>>>>> Sebastien > >>>>>>> > >>>>>>> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit : > >>>>>>>> Hello, > >>>>>>>> > >>>>>>>> I have eventually found 2 issues regarding networking in my > >>>>>>>> application. > >>>>>>>> I would like to discuss the first one. > >>>>>>>> > >>>>>>>> > >>>>>>>> My code contains something like this: > >>>>>>>> > >>>>>>>> int sd = socket(AF_INET, SOCK_DGRAM, 0); > >>>>>>>> > >>>>>>>> struct ifreq ifr; > >>>>>>>> memset(&ifr, 0, sizeof(struct ifreq)); > >>>>>>>> strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ); > >>>>>>>> ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR; > >>>>>>>> ifr.ifr_mii_reg_num = MII_LAN8720_SECR; > >>>>>>>> ifr.ifr_mii_val_out = 0; > >>>>>>>> ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr); > >>>>>>>> > >>>>>>>> // Do stuff with ifr.ifr_mii_val_out. > >>>>>>>> > >>>>>>>> close(sd); > >>>>>>>> > >>>>>>>> I realized that this type of ioctl will directly access the > >>>>>>>> hardware, > >>>>>>>> without any locking. > >>>>>>>> That is, if any other task needs to use the PHY in any other way, > >>>>>>>> it > >>>>>>>> will > >>>>>>>> eventually corrupt its register data. > >>>>>>>> > >>>>>>>> > >>>>>>>> Two questions on this: > >>>>>>>> 1. Is there any good reason for this? > >>>>>>>> 2. What is the best way to fix it? Shall I add a driver level > >> lock, > >>>>> or > >>>>>>>> should net_lock() be used in any higher layer? > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos < > >>>>>>> f.j.pa...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Hello, > >>>>>>>>> > >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet, > >>>>> they > >>>>>>>>>> have all been working reliably for months without stopping, we > >>>>>>>>>> know > >>>>>>>>>> it > >>>>>>>>>> because they critically depend on network functionality and we > >>>>>>>>>> have > >>>>>>>>>> reports if a card becomes unreachable. None has so far outside > >> of > >>>>>>>>>> dedicated tests. > >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers. > >>>>>>>>> Good to hear that! > >>>>>>>>> Although, I may be using a feature or protocol that you are not. > >>>>>>>>> Of course, I don't believe that NuttX is broken per se, but a > >>>>>>>>> minor > >>>>>>>>> bug > >>>>>>>>> may lurk somewhere... > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> I have seen that when I enable the network debugging features, > >> it > >>>>>>>>>> seems > >>>>>>>>> to > >>>>>>>>>> hit an assertion failure before getting to nsh prompt at > >> startup. > >>>>>>>>>> This > >>>>>>>>> was > >>>>>>>>>> on a quite recent master. I haven't had a chance to diagnose > >> this > >>>>>>>>> further. > >>>>>>>>>> Have you tried enabling these and if so, do they work? > >>>>>>>>> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and > >>>>>>>>> it > >>>>>>> works. > >>>>>>>>> I have some devices under test, waiting to reproduce the issue to > >>>>> see > >>>>>>>>> if > >>>>>>>>> this option provides any useful information. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> Also, out of curiosity, have you tried running ostest on your > >>>>> board? > >>>>>>>>> I just tried. > >>>>>>>>> It passed all the tests. > >>>>>>>>> > >>>>>>>>> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet > >>>>>>>>> <sebast...@lorquet.fr > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> > >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet, > >>>>> they > >>>>>>>>>> have all been working reliably for months without stopping, we > >>>>>>>>>> know > >>>>>>>>>> it > >>>>>>>>>> because they critically depend on network functionality and we > >>>>>>>>>> have > >>>>>>>>>> reports if a card becomes unreachable. None has so far outside > >> of > >>>>>>>>>> dedicated tests. > >>>>>>>>>> > >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers. > >>>>>>>>>> > >>>>>>>>>> Most certainly a build option on your particular config. debug > >> is > >>>>>>>>>> a > >>>>>>>>>> possible issue, thread problems is another possibility. > >>>>>>>>>> > >>>>>>>>>> Sebastien > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote: > >>>>>>>>>>> Hello! > >>>>>>>>>>> > >>>>>>>>>>> I am using Ethernet on an STM32F427 target, but I am facing > >> some > >>>>>>> issues. > >>>>>>>>>>> Initially the device works correctly. After some hours of > >>>>> continuous > >>>>>>>>>>> operation I completely lose all network communications. > >>>>>>>>>>> Trying to troubleshoot the issue, I enabled assertions and > >>>>>>>>>>> various > >>>>>>> other > >>>>>>>>>>> debug features. > >>>>>>>>>>> > >>>>>>>>>>> Again the device works correctly for some hours, and then I get > >>>>>>>>>>> a > >>>>>>> failed > >>>>>>>>>>> assertion at stm32_eth.c, line 1372: > >>>>>>>>>>> > >>>>>>>>>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL); > >>>>>>>>>>> > >>>>>>>>>>> No other errors are reported (e.g. stack overflows etc). > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> I have observed that this issue usually manifests itself when > >>>>> there > >>>>>>>>>>> is > >>>>>>>>>>> insufficient stack on a task. > >>>>>>>>>>> But in my case, all tasks have oversized stacks. Typically they > >>>>>>>>>>> do > >>>>>>>>>>> not > >>>>>>>>>>> exceed 50% utilization. > >>>>>>>>>>> I have plenty of room available in the heap too (> 100kB). > >>>>>>>>>>> > >>>>>>>>>>> Regarding the rest of the firmware, I cannot see any other > >>>>>>> misbehaviour > >>>>>>>>>> or > >>>>>>>>>>> problem. > >>>>>>>>>>> I haven't ever seen any other unexplained problem, assertion > >>>>>>>>>>> fail, > >>>>>>>>>>> hard-fault etc. > >>>>>>>>>>> The application code passes all of our tests. > >>>>>>>>>>> In fact, even when this issue happens, although I lose network > >>>>>>>>>>> connectivity, the rest of the system works perfectly. > >>>>>>>>>>> > >>>>>>>>>>> Please note that I have checked the contents of dev->d_len and > >>>>>>>>>> dev->d_buf, > >>>>>>>>>>> and they seem to contain valid data. > >>>>>>>>>>> The address lies within the normal address space of the MCU, > >> and > >>>>> the > >>>>>>>>>> size > >>>>>>>>>>> is sane. > >>>>>>>>>>> So it doesn't look like any kind of memory corruption. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> At this point I believe that this is an actual bug either on > >> the > >>>>>>>>>>> STM32 > >>>>>>>>>> MAC > >>>>>>>>>>> driver, or at the TCP/IP stack itself. > >>>>>>>>>>> I had a look at the driver code, but I didn't see anything > >>>>>>>>>>> suspicious. > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Has anyone observed the same issue before? > >>>>>>>>>>> Can it be affected in any way with my configuration? > >>>>>>>>>>> Or maybe, do you have any recommendations on what to test next? > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Thank you! > >>>>>>>>>>> >