On Sun, 2015-01-25 at 14:54 +0200, Erez Shitrit wrote: > On 1/23/2015 6:52 PM, Doug Ledford wrote: > > On Thu, 2015-01-22 at 09:31 -0500, Doug Ledford wrote: > >> My 8 patch set taken into 3.19 caused some regressions. This patch > >> set resolves those issues. > >> > >> These patches are to resolve issues created by my previous patch set. > >> While that set worked fine in my testing, there were problems with > >> multicast joins after the initial set of joins had completed. Since my > >> testing relied upon the normal set of multicast joins that happen > >> when the interface is first brought up, I missed those problems. > >> > >> Symptoms vary from failure to send packets due to a failed join, to > >> loss of connectivity after a subnet manager restart, to failure > >> to properly release multicast groups on shutdown resulting in hangs > >> when the mlx4 driver attempts to unload itself via its reboot > >> notifier handler. > >> > >> This set of patches has passed a number of tests above and beyond my > >> original tests. As suggested by Or Gerlitz I added IPv6 and IPv4 > >> multicast tests. I also added both subnet manager restarts and > >> manual shutdown/restart of individual ports at the switch in order to > >> ensure that the ENETRESET path was properly tested. I included > >> testing, then a subnet manager restart, then a quiescent period for > >> caches to expire, then restarting testing to make sure that arp and > >> neighbor discovery work after the subnet manager restart. > >> > >> All in all, I have not been able to trip the multicast joins up any > >> longer. > >> > >> Additionally, the original impetus for my first 8 patch set was that > >> it was simply too easy to break the IPoIB subsystem with this simple > >> loop: > >> > >> while true; do > >> ifconfig ib0 up > >> ifconfig ib0 down > >> done > >> > >> Just to be safe, I made sure this problem did not resurface. > >> > >> v5: fix an oversight in mcast_restart_task that leaked mcast joins > >> fix a failure to flush the ipoib_workqueue on deregister that > >> meant we could end up running our code after our device had been > >> removed, resulting in an oops > >> remove a debug message that could be trigger so fast that the > >> kernel printk mechanism would starve out the mcast join task thread > >> resulting in what looked like a mcast failure that was really just > >> delayed action > >> > >> > >> Doug Ledford (10): > >> IB/ipoib: fix IPOIB_MCAST_RUN flag usage > >> IB/ipoib: Add a helper to restart the multicast task > >> IB/ipoib: make delayed tasks not hold up everything > >> IB/ipoib: Handle -ENETRESET properly in our callback > >> IB/ipoib: don't restart our thread on ENETRESET > >> IB/ipoib: remove unneeded locks > >> IB/ipoib: fix race between mcast_dev_flush and mcast_join > >> IB/ipoib: fix ipoib_mcast_restart_task > >> IB/ipoib: flush the ipoib_workqueue on unregister > >> IB/ipoib: cleanup a couple debug messages > >> > >> drivers/infiniband/ulp/ipoib/ipoib.h | 1 + > >> drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 + > >> drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 234 > >> ++++++++++++++----------- > >> 3 files changed, 131 insertions(+), 106 deletions(-) > >> > > FWIW, a couple different customers have tried a test kernel I built > > internally with my patches and I've had multiple reports that all > > previously observed issues have been resolved. > > > > Hi Doug, > still see an issue with the last version, and as a result no sendonly or > IPv6 is working.
I disagree with your assessment. See below. > The scenario is some how simple to reproduce, if there is a sendonly > multicast group that failed to join (sm refuses, perhaps the group was > closed, etc.) This is not the case. "failed to join" implies that the join completed with an error. According to the logs below, something impossible is happening. > that causes all the other mcg's to be blocked behind it > forever. Your right. Because the join tasks are serialized and we only queue up one join at a time (something I would like to address, but not in the 3.19 timeframe), if one join simply never returns, as the logs below indicate, then it blocks up the whole system. > for example, there is bad mcg ff12:601b:ffff:0000:0000:0000:0000:0016 > that the sm refuses to join, and after some time the user tries to send > packets to ip address 225.5.5.5 (mcg: > ff12:401b:ffff:0000:0000:0000:0105:0505 ) > the log will show something like: > [1561627.426080] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join This is printed out by sendonly_join > [1561633.726768] ib0: setting up send only multicast group for > ff12:401b:ffff:0000:0000:0000:0105:0505 This is part of mcast_send and indicates we have created the new group and added it to the mcast rbtree, but not that we've started the join. > [1561643.498990] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join Our mcast task got ran again, and we are restarting the same group for some reason. Theoretically that means we should have cleared the busy flag on this group somehow, which requires that sendonly_join_complete must have been called. However, I can't find a single code path in sendonly_join_complete that could possibly clear out the busy flag on this group and not result in a printout message of some sort. Are you trimming the list of messages here? The reason I say this is that these are the possibilities: sendonly_join_complete with status == ENETRESET, we silently drop out but we would immediately get an ib_event with netreset and process a flush, which would print out copious messages sendonly_join_complete with status == 0, we would call mcast_join_finish and the only silent return is if this is our broadcast group and priv->broadcast is not set, in which case we don't have link up and the silent failures and the failure to join any other address are normal and expected since the ipoib layer considers things dead without the broadcast group joined, every other return path will print out something, and in particular the attempt to get an ah on the group will always print out either success or failure sendonly_join_complete with status anything else results in a debug message about the join failing. As far as I know, your setup has the IPv4 broadcast attached and the link is up. The group you are actually showing as bad is the IPv6 MLDv2-capable Routers group and failure to join it should not be blocking anything else. So I'm back to what I said before: unless you are trimming messages, this debug output makes no sense whatsoever and should be impossible to generate. Are you sure you have a clean tree with no possible merge oddities that are throwing your testing off? > [1561675.645424] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join > [1561691.718464] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join > [1561707.791609] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join > [1561723.864839] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join > [1561739.937981] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join > [1561756.010895] ib0: no multicast record for > ff12:601b:ffff:0000:0000:0000:0000:0016, starting sendonly join These are all being queued with successively longer backoffs, exactly as they should be. In __ipoib_mcast_continue_join_thread, when we get called with delay != 0, we end up doing two things. One, we change the mcast->backoff from 1 to whatever it should be now (minimum of 2, maximum of 16), and we set delay_until to jiffies + (HZ * mcast->backoff). This lets the mcast task know to skip this group until such time as its backoff is expired. From the messages above, that seems to be working. The other thing we do is we don't queue up the mcast thread just once, we queue it up twice. Once at the right time for the backoff, and again now (although there is a bug here, the second instance queues it up with delay as the time offset, that should be 0, that's a holdover from before, at least fortunately it is only ever 1 or 0 so it's not horrible, but things would go a little faster if I fix that so that the second instance is always 0). That way the instance that's queued up first will know to skip the delayed group and continue on through the list, and that should allow us to get these other groups that you say are blocked. > .... > for ever or till the sm will decide in the future to let > ff12:601b:ffff:0000:0000:0000:0000:0016 join, till than no sendonly traffic. > > > The main cause is the concept that was broken for the send-only join, > when you treat the sendonly like a regular mcg and add it to the mc list > and to the mc_task etc. I'm looking at 3.18 right now, and 3.18 adds sendonly groups to the mcg just like my current code does. The only difference, and I do mean *only*, is that it calls sendonly_join directly instead of via the mcast_task. The net effect of that is that sendonly joins are not serialized in the old version and they are in the new. I'm more than happy to deserialize them, but I don't think that necessarily means we have to call sendonly_join directly from mcast_send, it could also be done by allowing the mcast_task to fire off more than one join at a time. Whether we use the mcast_task or call sendonly_join directly is in-consequential, it's more an issue of whether or not we serialize the joins. > (not consider other issues like packet drop, and bw of the sendonly traffic) > IMHO, sendonly is part of the TX flow as it was till now in the ipoib > driver and should stay in that way. It seems like you're thinking of the mcast task thread as something that only runs as part of mcast joins or leaves when the upper layer calls set multicast list in our driver. It used to be that way, but since I reworked the task thread, that's no longer the case. As such, I don't think this distinction you are drawing is necessarily still correct. > I went over your comments to my patch, will try to response/ cover them > ASAP. > > Thanks, Erez > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Doug Ledford <dledf...@redhat.com> GPG KeyID: 0E572FDD
signature.asc
Description: This is a digitally signed message part