Hello,

I was on vacation last week, so I didn't make any progress on this.
I want to fix it, but I need guidance. No one has commented on this...

Did anyone manage to reproduce the issue using my fork?

On Tue, Aug 23, 2022 at 10:42 AM Sebastien Lorquet <sebast...@lorquet.fr>
wrote:

> Hi,
>
> is there any follow up on this point?
>
> Sebastien
>
>
> Le 13/08/2022 à 16:44, Fotis Panagiotopoulos a écrit :
> > Ok, I just managed to reproduce the issue on a NUCLEO-F429ZI, using the
> > NuttX apps.
> >
> > Please check my fork on
> > https://github.com/fjpanag/incubator-nuttx-apps/tree/tcp_issue
> > See the branch tcp_issue.
> >
> > I have "hacked" the NSH code to reproduce the issue.
> > A TCP connection is opened, and then closed. Then the network interface
> is
> > brought down. At this point the system crashes immediately.
> >
> > Note that I have locked the scheduler when the connection is closed.
> > This is to simulate an ifdown action BEFORE the FIN ACK is processed (as
> it
> > happens in my case).
> > My code does not have this locking, this is only for simulation purposes.
> >
> > Please use the provided defconfig. It is stored in the root of my apps
> fork.
> > I guess it is not related to the configuration, but my "working" sample
> is
> > provided.
> >
> >
> >
> > On Fri, Aug 12, 2022 at 7:15 PM Alan Carvalho de Assis <
> acas...@gmail.com>
> > wrote:
> >
> >> Hi Fotis,
> >>
> >> Yes, I understood the point. Because it needs the right timing it
> >> could be trick to duplicate.
> >>
> >> Did you try to create a simple host server to try to emulate this
> >> connection issue?
> >>
> >> BR,
> >>
> >> Alan
> >>
> >> On 8/12/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote:
> >>> I think I understand the nature of the bug.
> >>>
> >>> When closing a socket, tcp_close_eventhandler() is set as a callback in
> >> the
> >>> dev->d_devcb list.
> >>>
> >>> Typically, the server's response (FIN ACK) will have as a result
> >>> tcp_callback() to be executed, and thus the callback to be properly
> >> called,
> >>> with proper arguments.
> >>> Then the cb is properly free'd.
> >>>
> >>> If however devif_dev_event() has the chance to execute before
> >>> tcp_callback() (e.g. server's response was lost), then the callbacks
> take
> >>> NULL as a conn argument.
> >>> This crashes the whole system horribly.
> >>>
> >>> As you see, this requires specific timings with the server
> communication,
> >>> that's why this is so hard to reproduce.
> >>>
> >>>
> >>> On Fri, Aug 12, 2022 at 5:13 PM Fotis Panagiotopoulos <
> >> f.j.pa...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi Alan,
> >>>>
> >>>> I am trying hard to reproduce the issue reliably, but I haven't been
> >> able
> >>>> to do so yet.
> >>>>
> >>>> I noticed that when I disable CONFIG_NET_TCP_WRITE_BUFFERS, the
> problem
> >>>> does not disappear, rather it changes form.
> >>>> Now I occasionally get a failed assertion in wdog/wd_cancel.c line 95.
> >>>>
> >>>> I have to mention that everything in my system is commented out.
> >>>> Currently the only thing working is the network thread that opens the
> >> TCP
> >>>> connection, nothing else.
> >>>> I have disabled all of my usage of the workers, all signals etc.
> >>>> I verify that when the fault occurs, this thread is not interrupted by
> >>>> anything (using Segger SystemView).
> >>>> It looks like a scheduling issue is unlikely.
> >>>>
> >>>> I also increased the stacks more, and I added padding to the very few
> >>>> malloc's that I use.
> >>>>
> >>>> ---
> >>>>
> >>>> At this moment I observe something very interesting.
> >>>> I am calling netlib_ifdown(), which causes the attached stack trace.
> >>>>
> >>>> So:
> >>>> 1. netdev_ifdown() calls devif_dev_event() with the argument pvconn
> set
> >>>> explicitly to NULL.
> >>>> 2. devif_dev_event() eventually calls tcp_close_eventhandler()
> >>>> 3. tcp_close_eventhandler() assumes that conn is NOT NULL. Which
> causes
> >>>> the crash.
> >>>>
> >>>> This is wrong, but I don't have the understanding of it yet.
> >>>> Shall there be a check for a NULL conn?
> >>>> Or maybe tcp_close_eventhandler() is wrong to be in the cb's list in
> the
> >>>> first place?
> >>>> Or tcp_close_eventhandler() should be tolerant to a NULL conn
> argument?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Thu, Aug 11, 2022 at 12:05 PM Alan Carvalho de Assis
> >>>> <acas...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> Hi Fotis,
> >>>>>
> >>>>> Are you in sync with mainline?
> >>>>>
> >>>>> If you can create a host application to induce the issue will be
> >>>>> easier for us to test.
> >>>>>
> >>>>> BR,
> >>>>>
> >>>>> Alan
> >>>>>
> >>>>> On 8/9/22, Fotis Panagiotopoulos <f.j.pa...@gmail.com> wrote:
> >>>>>> Hello,
> >>>>>>
> >>>>>> still trying to make the network work reliably.
> >>>>>> After fixing another issue of my application, I hit another problem.
> >>>>>>
> >>>>>> The following sequence causes NuttX to crash:
> >>>>>>
> >>>>>> 1. My application is creating a TCP socket and communicates with a
> >>>>> server.
> >>>>>> 2. At one point the server stops responding (unrelated to NuttX /
> >>>>> network
> >>>>>> issue).
> >>>>>> 3. The application detects the timeout, and calls close() on the
> >>>>>> socket.
> >>>>>> 4. A new socket is created, and it is connected to the server.
> >>>>>> 5. At this point, the server decides to send a FIN message for the
> >>>>> previous
> >>>>>> connection.
> >>>>>> 6. I get a failed assertion in devif_callback.c at line 85.
> >>>>>>
> >>>>>> Note that I haven't managed to manually reproduce this issue.
> >>>>>> No matter what I do manually, everything seems to be working
> >>>>>> correctly.
> >>>>>> I just have to wait for it to happen.
> >>>>>> It seems that it is only triggered if a FIN arrives **after** a SYN.
> >>>>>>
> >>>>>> I am sure that this is only happening with
> >>>>>> CONFIG_NET_TCP_WRITE_BUFFERS
> >>>>>> enabled.
> >>>>>> I have no problems without buffering.
> >>>>>>
> >>>>>> The assertion seems right to fire.
> >>>>>> When a FIN is received for a closed connection, the same callback is
> >>>>> free'd
> >>>>>> both by tcp_lost_connection() and later on by
> >>>>>> tcp_close_eventhandler().
> >>>>>> All these are happening within the same execution of tcp_input().
> >>>>>>
> >>>>>> Any ideas?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Jul 26, 2022 at 3:44 PM Sebastien Lorquet
> >>>>>> <sebast...@lorquet.fr
> >>>>>>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> good find but
> >>>>>>>
> >>>>>>> -I dont think any usual application tinkers with PHY regs during
> its
> >>>>>>> lifetime except the ethernet monitor
> >>>>>>>
> >>>>>>> -the fix is certainly a lock somewhere but global or fine grained I
> >>>>> dont
> >>>>>>> know.
> >>>>>>>
> >>>>>>> Not all calls need to be locked, eg the one that returns the PHY
> >>>>>>> address. Probaby not needed by default, but a PHY access lock would
> >>>>>>> prevent any issue you describe.
> >>>>>>>
> >>>>>>> I will wait for people with more expertise about this.
> >>>>>>>
> >>>>>>> Just a note, dont forget that not all PHY have an interrupt, the
> one
> >>>>>>> on
> >>>>>>> the nucleo stm32h743zi[2] board does not have one.
> >>>>>>>
> >>>>>>> Sebastien
> >>>>>>>
> >>>>>>> Le 26/07/2022 à 11:05, Fotis Panagiotopoulos a écrit :
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I have eventually found 2 issues regarding networking in my
> >>>>>>>> application.
> >>>>>>>> I would like to discuss the first one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> My code contains something like this:
> >>>>>>>>
> >>>>>>>> int sd = socket(AF_INET, SOCK_DGRAM, 0);
> >>>>>>>>
> >>>>>>>> struct ifreq ifr;
> >>>>>>>> memset(&ifr, 0, sizeof(struct ifreq));
> >>>>>>>> strncpy(ifr.ifr_name, CONFIG_NETIF_DEV_NAME, IFNAMSIZ);
> >>>>>>>> ifr.ifr_mii_phy_id = CONFIG_STM32_PHYADDR;
> >>>>>>>> ifr.ifr_mii_reg_num = MII_LAN8720_SECR;
> >>>>>>>> ifr.ifr_mii_val_out = 0;
> >>>>>>>> ioctl(sd, SIOCGMIIREG, (unsigned long)&ifr);
> >>>>>>>>
> >>>>>>>> // Do stuff with ifr.ifr_mii_val_out.
> >>>>>>>>
> >>>>>>>> close(sd);
> >>>>>>>>
> >>>>>>>> I realized that this type of ioctl will directly access the
> >>>>>>>> hardware,
> >>>>>>>> without any locking.
> >>>>>>>> That is, if any other task needs to use the PHY in any other way,
> >>>>>>>> it
> >>>>>>>> will
> >>>>>>>> eventually corrupt its register data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Two questions on this:
> >>>>>>>> 1. Is there any good reason for this?
> >>>>>>>> 2. What is the best way to fix it? Shall I add a driver level
> >> lock,
> >>>>> or
> >>>>>>>> should net_lock() be used in any higher layer?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Tue, Jul 19, 2022 at 10:30 PM Fotis Panagiotopoulos <
> >>>>>>> f.j.pa...@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Hello,
> >>>>>>>>>
> >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>>>> they
> >>>>>>>>>> have all been working reliably for months without stopping, we
> >>>>>>>>>> know
> >>>>>>>>>> it
> >>>>>>>>>> because they critically depend on network functionality and we
> >>>>>>>>>> have
> >>>>>>>>>> reports if a card becomes unreachable. None has so far outside
> >> of
> >>>>>>>>>> dedicated tests.
> >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
> >>>>>>>>> Good to hear that!
> >>>>>>>>> Although, I may be using a feature or protocol that you are not.
> >>>>>>>>> Of course, I don't believe that NuttX is broken per se, but a
> >>>>>>>>> minor
> >>>>>>>>> bug
> >>>>>>>>> may lurk somewhere...
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> I have seen that when I enable the network debugging features,
> >> it
> >>>>>>>>>> seems
> >>>>>>>>> to
> >>>>>>>>>> hit an assertion failure before getting to nsh prompt at
> >> startup.
> >>>>>>>>>> This
> >>>>>>>>> was
> >>>>>>>>>> on a quite recent master. I haven't had a chance to diagnose
> >> this
> >>>>>>>>> further.
> >>>>>>>>>> Have you tried enabling these and if so, do they work?
> >>>>>>>>> If you refer to CONFIG_DEBUG_NET, then yes I have enabled it and
> >>>>>>>>> it
> >>>>>>> works.
> >>>>>>>>> I have some devices under test, waiting to reproduce the issue to
> >>>>> see
> >>>>>>>>> if
> >>>>>>>>> this option provides any useful information.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> Also, out of curiosity, have you tried running ostest on your
> >>>>> board?
> >>>>>>>>> I just tried.
> >>>>>>>>> It passed all the tests.
> >>>>>>>>>
> >>>>>>>>> On Tue, Jul 19, 2022 at 4:44 PM Sebastien Lorquet
> >>>>>>>>> <sebast...@lorquet.fr
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Hi,
> >>>>>>>>>>
> >>>>>>>>>> We have deployed hundreds of boards with stm32f427 and ethernet,
> >>>>> they
> >>>>>>>>>> have all been working reliably for months without stopping, we
> >>>>>>>>>> know
> >>>>>>>>>> it
> >>>>>>>>>> because they critically depend on network functionality and we
> >>>>>>>>>> have
> >>>>>>>>>> reports if a card becomes unreachable. None has so far outside
> >> of
> >>>>>>>>>> dedicated tests.
> >>>>>>>>>>
> >>>>>>>>>> So I believe that there is no obvious hard bug in these drivers.
> >>>>>>>>>>
> >>>>>>>>>> Most certainly a build option on your particular config. debug
> >> is
> >>>>>>>>>> a
> >>>>>>>>>> possible issue, thread problems is another possibility.
> >>>>>>>>>>
> >>>>>>>>>> Sebastien
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 7/19/22 13:47, Fotis Panagiotopoulos wrote:
> >>>>>>>>>>> Hello!
> >>>>>>>>>>>
> >>>>>>>>>>> I am using Ethernet on an STM32F427 target, but I am facing
> >> some
> >>>>>>> issues.
> >>>>>>>>>>> Initially the device works correctly. After some hours of
> >>>>> continuous
> >>>>>>>>>>> operation I completely lose all network communications.
> >>>>>>>>>>> Trying to troubleshoot the issue, I enabled assertions and
> >>>>>>>>>>> various
> >>>>>>> other
> >>>>>>>>>>> debug features.
> >>>>>>>>>>>
> >>>>>>>>>>> Again the device works correctly for some hours, and then I get
> >>>>>>>>>>> a
> >>>>>>> failed
> >>>>>>>>>>> assertion at stm32_eth.c, line 1372:
> >>>>>>>>>>>
> >>>>>>>>>>> DEBUGASSERT(dev->d_len == 0 && dev->d_buf == NULL);
> >>>>>>>>>>>
> >>>>>>>>>>> No other errors are reported (e.g. stack overflows etc).
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I have observed that this issue usually manifests itself when
> >>>>> there
> >>>>>>>>>>> is
> >>>>>>>>>>> insufficient stack on a task.
> >>>>>>>>>>> But in my case, all tasks have oversized stacks. Typically they
> >>>>>>>>>>> do
> >>>>>>>>>>> not
> >>>>>>>>>>> exceed 50% utilization.
> >>>>>>>>>>> I have plenty of room available in the heap too (> 100kB).
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding the rest of the firmware, I cannot see any other
> >>>>>>> misbehaviour
> >>>>>>>>>> or
> >>>>>>>>>>> problem.
> >>>>>>>>>>> I haven't ever seen any other unexplained problem, assertion
> >>>>>>>>>>> fail,
> >>>>>>>>>>> hard-fault etc.
> >>>>>>>>>>> The application code passes all of our tests.
> >>>>>>>>>>> In fact, even when this issue happens, although I lose network
> >>>>>>>>>>> connectivity, the rest of the system works perfectly.
> >>>>>>>>>>>
> >>>>>>>>>>> Please note that I have checked the contents of dev->d_len and
> >>>>>>>>>> dev->d_buf,
> >>>>>>>>>>> and they seem to contain valid data.
> >>>>>>>>>>> The address lies within the normal address space of the MCU,
> >> and
> >>>>> the
> >>>>>>>>>> size
> >>>>>>>>>>> is sane.
> >>>>>>>>>>> So it doesn't look like any kind of memory corruption.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> At this point I believe that this is an actual bug either on
> >> the
> >>>>>>>>>>> STM32
> >>>>>>>>>> MAC
> >>>>>>>>>>> driver, or at the TCP/IP stack itself.
> >>>>>>>>>>> I had a look at the driver code, but I didn't see anything
> >>>>>>>>>>> suspicious.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Has anyone observed the same issue before?
> >>>>>>>>>>> Can it be affected in any way with my configuration?
> >>>>>>>>>>> Or maybe, do you have any recommendations on what to test next?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Thank you!
> >>>>>>>>>>>
>

Reply via email to