Updated responses follow.
> 1. kstat stuff
>
> Unplumb of the IPMP meta-interface and a kstat read operation could
> deadlock ??
>
> T1. unplumb(IPMP meta-interface) ...->ill_glist_delete -> ill_phyint_free ->
> ipmp_grp_destroy -> ipmp_grp_destroy_kstats -> kstat_delete_netstack ->
> kstat_delete -> kstat_hold_bykid -> kstat_hold
>
> T2. read_kstat_data -> kstat_hold_bykid -> kstat_hold
> <-
> -> ks_update() i.e. ipmp_grp_update_kstats()
>
> Interleaving between T1 and T2 could result in the following sequence.
>
> T1 holds ipst->ips_ill_g_lock as writer in ill_glist_delete and attempts
> to acquire the kstat ownership in kstat_hold and cv_waits for the current
> owner in e->e_owner (say T2) to finish.
>
> T2 gets ownership of the kstat in e->e_owner in kstat_hold and
> continues to ipmp_grp_update_kstats and attempts to acquire
> ipst->ips_ill_g_lock which is held by T1.
>
> T1 and T2 deadlock.
One core problem here is that we're holding ill_g_lock when we're tearing
down the IPMP group in ill_phyint_free(). The IPMP code has no need for
ill_g_lock to be held, and in fact it will cause other problems -- e.g.
ipmp_phyint_leave_grp() (also called from ill_phyint_free()) may generate
routing socket messages, which we will putnext() with ill_g_lock held,
which also isn't good.
Looking at this more, the root of the bug is that ill_phyint_free()
doesn't just free the phyint, it also pulls the phyint out of the global
AVL lists and NULLs out various bits of linkage, and *then* does a bunch
of free-related stuff. I propose we change ill_phyint_free() to *just* do
the free-related stuff, which does not need ill_g_lock. We can then
either inline the linkage logic into ill_glist_delete() (its one caller),
or create a new ill_phyint_unlink() function to do that part. I suspect
inlining will be better because it makes the locking more obvious.
> > We end up holding IP layer locks across calls to GLD in the following
> > sequence ipmp_grp_update_kstats -> ipmp_phint_get_kstats ->
> > KSTAT_UPDATE. This is something to be aware of and be cautious. - at
> > least put in a comment
Indeed -- and even with the fix above, holding this lock across layers
still makes me uncomfortable. One possible solution would be to build up
a dynamically-allocated list of phyints (and refhold an ill on each one so
it can't go away) with ill_g_lock held, then drop ill_g_lock and walk that
list when calling down to GLDv3.
> > 3. We create the ill_grp only in I_PLINK of the IPMP ill. However the
> > IPMP ill becomes known globally as soon as the ill_glist_insert happens
> > as part of the SLIFNAME ioctl. There is a small window between the two
> > ioctls and it is not clear if something bad can happen in that
> > window. There is a check in ip_process_ioctl that fail writer
> > operations on the IPMP ill before the PLINK. But other operations may
> > also end up accessing ill_grp.
> >
> > a. ND_GET/SET. These are forked off ip_sioctl_copyin_setup.
> > ill_foward-set expects a non-null ill_grp in the case of an IPMP ill.
> >
> > b. ioctls that don't need IPI_WR. In this case ip_process_ioctl lets them.
> > SIOCGXARP: ip_sioctl_arp() expects a non-null ill_grp in the case of
> > IPMP ill.
> >
> > c. ip_opt_add_group_v6 through T_SVR4_OPTMGMT_REQ. In this case can we end
> > up accessing ill_grp in ip_ll_send_enabmulti_req ?
> >
> > If the above reasoning is right, then I suspect the same holds in the
> > interval between I_PUNLINK and ill_delete.
I've coded up a fix for this. Basically, ipmp_illgrp_create() is now
called in ipif_allocate() (right after we set PHYI_IPMP, which is what
IS_IPMP() tests), and ipmp_illgrp_destroy() is now called from
ill_delete_tail(), right before we call ill_glist_delete(). I think this
will eliminate this class of bugs, and still preserve the benefit of
failing an "unplumb" of an active group in I_PUNLINK. Please have a look:
http://cr.opensolaris.org/~meem/clearview-ipmp-illgrp/
--
meem