On Thu, Sep 18, 2014 at 08:07:31AM +0000, Wodkowski, PawelX wrote:
> > > +int
> > > +bond_mode_8023ad_deactivate_slave(struct rte_eth_dev *bond_dev,
> > > + uint8_t slave_pos)
> > > +{
> > > + struct bond_dev_private *internals = bond_dev->data->dev_private;
> > > + struct mode8023ad_data *data = &internals->mode4;
> > > + struct port *port;
> > > + uint8_t i;
> > > +
> > > + bond_mode_8023ad_stop(bond_dev);
> > > +
> > > + /* Exclude slave from transmit policy. If this slave is an aggregator
> > > + * make all aggregated slaves unselected to force sellection logic
> > > + * to select suitable aggregator for this port */
> > > + for (i = 0; i < internals->active_slave_count; i++) {
> > > + port = &data->port_list[slave_pos];
> > > + if (port->used_agregator_idx == slave_pos) {
> > > + port->selected = UNSELECTED;
> > > + port->actor_state &= ~(STATE_SYNCHRONIZATION |
> > STATE_DISTRIBUTING |
> > > + STATE_COLLECTING);
> > > +
> > > + /* Use default aggregator */
> > > + port->used_agregator_idx = i;
> > > + }
> > > + }
> > > +
> > > + port = &data->port_list[slave_pos];
> > > + timer_cancel(&port->current_while_timer);
> > > + timer_cancel(&port->periodic_timer);
> > > + timer_cancel(&port->wait_while_timer);
> > > + timer_cancel(&port->tx_machine_timer);
> > > +
> > These all seem rather racy. Alarm callbacks are executed with the alarm
> > list
> > locks not held. So there is every possibility that you could execute these
> > (or
> > any timer_cancel calls in this PMD in parallel with the internal state
> > machine
> > timer callback, and leave either with a corrupted timer list (resulting
> > from a
> > double free between here, and the actual callback site),
>
> I don't think so. Yes, callbacks are executed with alarm list locks not
> held, but
> this is not the issue because access to list itself is guarded by lock and
> ap->executing variable. So list will not be trashed. Check source of
> eal_alarm_callback(), rte_eal_alarm_set() and rte_eal_alarm_cancel().
>
Yes, you're right, the list is probably safe wht the executing bit.
> > or a timer that is
> > actually still pending when a slave is removed.
> >
> This is not the issue also, but problem might be similar. I assumed that
> alarms
> are atomic but when I looked at rte alarms closer I saw a race condition
> between and rte_eal_alarm_cancel() from bond_mode_8023ad_stop()
> and rte_eal_alarm_set() from state machines callback. This need to be
> reworked in some way.
Yes, this is what I was referring to:
CPU0 CPU1
rte_eal_alarm_callback bond_8023ad_deactivate_slave
-bond_8023_ad_periodic_cb timer_cancel
timer_set
If those timer functions operate on the same timer, the result is that you can
leave the stop/deactivate slave paths with a timer function for that slave still
pending. The bonding mode needs some internal state to serialize those
operations and determine if the timer should be reactivated.
Neil