It's not on RHEL, it's in the MRG Realtime kernel, which is a kernel
with the PREEMPT_RT patchset applied. It's a layered product where the
RT kernel runs on a RHEL6 user-space. 

We don't really have a forum for this, which is why I asked Luis to
reach out upstream.

In answer to your question, a customer has a heartbeat script which is
calling ifconfig repeatedly. 

Clark

On Fri, 12 Jul 2013 18:35:04 +0000
"Ronciak, John" <john.ronc...@intel.com> wrote:

> The request for stats should be happening only once every 2 seconds.  Do you 
> have a script pounding on getting stats repeatedly?  Are you sure that it's 
> the request for stats that is causing the issue you are seeing or are you 
> guessing that this is the case?  Can you make this happen without bonding 
> being involved (i.e. using multiple interfaces)?
> 
> Also, we see that at least some of you work for RH.  If this is on RHEL this 
> is the incorrect forum for this and it should be handled through bugzillas 
> and the weekly engineering call with Peter Martuccelli.   If it's for Fedora 
> what's the RT stuff being used for there?  Please explain.
> 
> Cheers,
> John
> 
> 
> > -----Original Message-----
> > From: Luis Claudio R. Goncalves [mailto:lclau...@uudg.org]
> > Sent: Friday, July 12, 2013 11:08 AM
> > To: Wyborny, Carolyn
> > Cc: e1000-devel@lists.sourceforge.net; Clark Williams
> > Subject: Re: [E1000-devel] [RFC] igb: minimize busy loop on
> > igb_get_hw_semaphore
> > 
> > On Thu, Jul 11, 2013 at 10:46:31PM +0000, Wyborny, Carolyn wrote:
> > | > -----Original Message-----
> > | > From: Luis Claudio R. Goncalves [mailto:lclau...@uudg.org]
> > | > Sent: Thursday, July 11, 2013 11:45 AM
> > | > To: e1000-devel@lists.sourceforge.net
> > | > Cc: Clark Williams
> > | > Subject: Re: [E1000-devel] [RFC] igb: minimize busy loop on
> > | > igb_get_hw_semaphore
> > | >
> > | > Hello,
> > | >
> > | > A customer noticed a strange issue on his setup, a bonding
> > interface
> > | > composed of two igb nics. After several debug sessions we are
> > pretty
> > | > sure the specific symptom reported is caused by a busy loop on
> > | > igb_get_hw_semaphore(). The problem was reported on a 3.0.25 kernel
> > | > but the patch below was written on 3.8.13.
> > | >
> > | > The complete scenario is described below and there is a great
> > chance
> > | > that this issue is only present (or at least more likely to be
> > | > triggered) on the PREEMPT_RT enabled kernels... but I would like to
> > | > confirm whether this solution is valid or if there is a better way
> > to mitigate the problem.
> > | >
> > | > Thanks,
> > | > Luis
> > |
> > | Hello Luis,
> > |
> > | This is a complicated setup and not something we'd be doing much
> > | testing on.  The semaphore calls are intended to serialize access to
> > | certain areas in the hw, usually the PHY.  Making the delays
> > | pre-emptible does not necessarily accomplish the same thing.
> > |
> > | Have you tested the proposed patch and does it speed things up enough
> > to
> > | find what you need to find?   Another thing to try is to reduce the
> > time
> > | value rather than change the type of delay being used and see if you
> > | can find a way to speed things up that way.
> > 
> > First of all, thanks for replying! :)
> > 
> > I have the impression that reducing the delay time on
> > igb_get_hw_semaphore() wouldn't help much here because
> > igb_release_swfw_sync_82575() has this piece of code:
> > 
> >     while (igb_get_hw_semaphore(hw) != 0);
> > 
> > So, even if the udelay used there was 1us, in cases like the one
> > described below, you would be subjected to unbound busy waits.
> > 
> > I can see that the issue happened because someone else (maybe even the
> > HW) was holding the semaphore for a long time.
> > 
> > Busy waits/loops are dangerous on RT when the process is running at
> > higher priorities. In this case ifconfig was a regular process but when
> > it requested the NIC stats, it held the bond->lock.
> > 
> > Then, while ifconfig was busy waiting on 'while
> > (igb_get_hw_semaphore(hw) != 0);'
> > the igb TX threads (we use threaded IRQs on RT) needed that lock. As
> > these IRQ threads run at higher RT priorities, in order to have their
> > work interrupted for smaller periods while waiting for threads running
> > at lower priorities they perform a Priority Inheritance operation, they
> > lend their priority to the lower priority thread until it releases the
> > lock.
> > 
> > This way, the regular process 'ifconfig' busy waiting for the HW
> > semaphore becomes a Real Time thread, running (in this example) at
> > FIFO:85 and therefore preventing any other thread of equal or lower
> > priority from getting any CPU time. If this persists for a long time,
> > several subsystems may experience problems and even collapse. One such
> > example is RCU.
> > 
> > Sorry if this email is getting a bit too big. While I understand the
> > need for serialization and the way it was done on
> > igb_get_hw_semaphore(), I would like to see if there is another way,
> > less likely to create a corner case in RT.
> > 
> > Again, this was observed only once and may not be easy to reproduce.
> > But it seems to be a real issue. All this scenario data was gathered by
> > debugging the vmcores (created by kdump) using crash.
> > 
> > Luis
> > 
> > | Let me know if there is more info I can provide.  I can review your
> > | full lspci -vvv , ethtool ethX output and your .CONFIG for anything
> > | else to check and, of course a full dmesg that shows the problem you
> > are seeing.
> > | I'm no bonding expert though, so if the problem is there, I may not
> > | have much to offer.
> > |
> > | Hope this helps.
> > |
> > | Carolyn
> > |
> > | Carolyn Wyborny
> > | Linux Development
> > | Networking Division
> > | Intel Corporation
> > |
> > |
> > | >
> > | > ----
> > | >
> > | > igb: minimize busy loop on igb_get_hw_semaphore
> > | >
> > | > Bugzilla: 976912
> > | >
> > | > In drivers/net/ethernet/intel/igb/e1000_82575.c, funtion
> > | > igb_release_swfw_sync_82575() there is this line:
> > | >
> > | >         while (igb_get_hw_semaphore(hw) != 0);
> > | >
> > | > That is basically a busy loop waiting on a HW semaphore.
> > | >
> > | > A customer has a setup where two igb NICs are part of a bonding
> > interface.
> > | > This customer also has a monitoring script that calls ifconfig
> > | > often. It was observed that in this scenario there is a chance that
> > | > this ifconfig, that happens to hold the bond->lock while collecting
> > | > statistics, enters this busy loop waiting for another thread clear
> > that HW semaphore.
> > | >
> > | > Meanwhile, the irq/xxx-ethY-Tx threads, running at FIFO:85, try to
> > | > acquire the bond lock, held by ifconfig. As it happens on RT, a
> > | > Priority Inheritance operation is started and ifconfig is boosted
> > to
> > | > FIFO:85 so that it may be able to finish its work sooner and
> > release
> > | > the bond->lock, desired by the aforementioned threads.
> > | >
> > | > As ifconfig is running on a busy loop, waiting for the HW
> > semaphore,
> > | > this thread now runs a busy loop at a very high priority,
> > preventing
> > | > other threads on that CPU from progressing.
> > | >
> > | > On that scenario, it seems that the thread holding the HW semaphore
> > | > is also waiting for a lock held by other task. This whole scenario
> > | > leads to RCU stall warnings, that have as side effects a crescent
> > number of threads being stuck.
> > | > As this progresses, the livelock reaches threads on other CPUs and
> > | > the system becomes more and more unresponsive.
> > | >
> > | > This little patch aims to prevent the busy loop at a high priority
> > | > (the code called by ifconfig in this example) to starve the threads
> > | > on the same CPU. It may not solve the issue but will at least lead
> > | > us closer to the real issue, masked by the RCU stalls created by
> > the busy loop.
> > | >
> > | > This is mostly a debug patch for a testing kernel.
> > | >
> > | > Signed-off-by: Luis Claudio R. Goncalves <lgonc...@redhat.com>
> > | >
> > | > diff --git a/drivers/net/ethernet/intel/igb/e1000_mac.c
> > | > b/drivers/net/ethernet/intel/igb/e1000_mac.c
> > | > index a5c7200..ec0be87 100644
> > | > --- a/drivers/net/ethernet/intel/igb/e1000_mac.c
> > | > +++ b/drivers/net/ethernet/intel/igb/e1000_mac.c
> > | > @@ -1225,7 +1225,7 @@ s32 igb_get_hw_semaphore(struct e1000_hw *hw)
> > | >                 if (!(swsm & E1000_SWSM_SMBI))
> > | >                         break;
> > | >
> > | > -               udelay(50);
> > | > +               usleep_range(50,51);
> > | >                 i++;
> > | >         }
> > | >
> > | > @@ -1244,7 +1244,7 @@ s32 igb_get_hw_semaphore(struct e1000_hw *hw)
> > | >                 if (rd32(E1000_SWSM) & E1000_SWSM_SWESMBI)
> > | >                         break;
> > | >
> > | > -               udelay(50);
> > | > +               usleep_range(50,51);
> > | >         }
> > | >
> > | >         if (i == timeout) {
> > | > --
> > --
> > [ Luis Claudio R. Goncalves             Red Hat  -  Realtime Team ]
> > [ Fingerprint: 4FDD B8C4 3C59 34BD 8BE9  2696 7203 D980 A448 C8F8 ]
> > 
> > 
> > -----------------------------------------------------------------------
> > -------
> > See everything from the browser to the database with AppDynamics Get
> > end-to-end visibility with application monitoring from AppDynamics
> > Isolate bottlenecks and diagnose root cause in seconds.
> > Start your free trial of AppDynamics Pro today!
> > http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.cl
> > ktrk
> > _______________________________________________
> > E1000-devel mailing list
> > E1000-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/e1000-devel
> > To learn more about Intel&#174; Ethernet, visit
> > http://communities.intel.com/community/wired

Attachment: signature.asc
Description: PGP signature

------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to