Re: [RFC PATCH] can: improved CAN error state handling

Wolfgang Grandegger Sun, 25 Jul 2010 08:09:24 -0700

On 07/25/2010 05:08 PM, Wolfgang Grandegger wrote:
> On 07/25/2010 04:23 PM, Wolfgang Grandegger wrote:
>> On 07/25/2010 03:35 PM, Marc Kleine-Budde wrote:
>>> Wolfgang Grandegger wrote:
>>>> I found some time to work on this issue.
>>>>
>>>> So far we only report CAN error state changes to the worse:
>>>> "error-active -> error-warning -> error-passive -> bus-off", but not
>>>> "bus-off -> error-active" or "error-passive -> error-warning ->
>>>> error-active". There have been request from other users to correct
>>>> this behavior.
>>>>
>>>> This patch reports any state change reported by the SJA1000 and
>>>> MSCAN-MPC5200 controller. It introduces "CAN_ERR_CRTL_ACTIVE" to
>>>> signal state changes to error active. Furthermore, the SJA1000
>>>> triggers now the bus-off recovery procedure when restarted. So far,
>>>> the controller was *stopped* and then restarted. I did two tests to
>>>> analyze the state changes behavior:
>>>>
>>>> 1. First I forced the controller into the error-passive by sending
>>>>    a message (with cansend) without cable connected (*). Then I
>>>>    reconnected the cable (**) and continued to send messages (with
>>>>    cangen) (***) until the error active state is reached:
>>>>
>>>>    On SJA1000:
>>>>
>>>>    # candump any,0:0,#FFFFFFFF
>>>>    (*)
>>>>    can0  20000004  [8] 00 08 00 00 00 00 60 00   ERROR-WARNING
>>>>    can0  20000004  [8] 00 20 00 00 00 00 80 00   ERROR-PASSIVE
>>>>    (**)
>>>>    can0  123  [4] DE AD BE EF
>>>>    (***)
>>>>    can0  42A  [1] 00
>>>>    ...
>>>>    can0  42A  [1] 07
>>>>    can0  20000004  [8] 00 08 00 00 00 00 7F 00   ERROR-WARNING
>>>
>>> Is it possible to tell apart whether the driver goes from passive to
>>> warning or from active to warning? Without the need to have a state
>>> machine in userspace, too?
>>
>> Yes, I realized that side effect as well. The user must monitor state
>> changes to understand if the state gets worse or better. We could use an
>> error bit to indicate state improvements.
>>
>>> Looking at the tx/rx error level might be an option, but I've seen that
>>> the level can rise very quickly.
>>
>> Be aware that the tx/rx error counters are *not* readable on all CAN
>> controllers.
>>
>>>>    can0  42A  [1] 08
>>>>    ...
>>>>    can0  42A  [1] 27
>>>>    can0  20000004  [8] 00 40 00 00 00 00 5F 00   ERROR-ACTIVE
>>>>                                           \-- tx-err !!!
>>>>    On MSCAN:
>>>>
>>>>    # candump any,0:0,#FFFFFFFF
>>>>    (*)
>>>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>>>    can2  20000004  [8] 00 20 00 00 00 00 00 00   ERROR-PASSIVE
>>>>    (**)
>>>>    can2  123  [4] DE AD BE EF
>>>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>>>    (***)
>>>>    can2  42A  [1] 00
>>>>    can2  42A  [1] 01
>>>>    ...
>>>>    can2  42A  [1] 1E
>>>>    can2  42A  [1] 1F
>>>>    can2  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>>>
>>>>    As you can see, the state change to error-active is signaled
>>>>    differently. The MSCAN seems not to respect the specs :-(.
>>>
>>> Do you mean the fact that error active is signalled with tx err == 0?
>>
>> Yes, the SJA1000 triggers an interrupt to signal error active when the
>> error counter goes below 96.
>>
>>> From the patch below you only changed the part of the driver that runs
>>> if a change of state is detected "(old_state != priv->can.state)".
>>>
>>> Maybe it's a problem in the driver, that calculated the new state wrong.
>>> Maybe due to a problem with the hardware. (I don't know the driver nor
>>> the hardware).
>>
>> I monitored the interrupt generation as well:
>>
>> [ 1234.089446] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x44)
>> [ 1234.092228] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x48)
>> [ 1240.960487] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x44)
>> [ 1352.336647] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x40)
>>
>> The last interrupt comes that late, even if the state is misinterpreted.
>> Note also that the MSCAN has a dedicated state field for TX and RX.
>> Therefore the state is defined without a state machine in software.
>>
>>> Looking at you patches below I want to rise my old proposal again. I
>>> think we should move the state machine that assembles and sends the
>>> error frames into a common function. Each driver just needs to have a
>>> proper function that calculates the current state and then call the
>>> state handling state machine.
>>
>> The error handling code is already quite large and complex, especially
>> for the purpose it serves. Any common function improving that situation
>> would be nice. But unfortunately, we rely on the hardware :-(.
>>
>>>>    When time permits, I will also try the MSCAN of the MPC5121.
>>>
>>>> 2. I forced the controller into bus-off by short-circuiting
>>>>    the low and hi lines and sending a message (*), Then I
>>>>    restarted the controller manually with
>>>>    "ip link set can0 type can restart" (**):
>>>>
>>>>    On SJA1000:
>>>>
>>>>    # ./candump -t d any,0:0,#FFFFFFFF
>>>>    (*)
>>>>    can0  20000004  [8] 00 08 00 00 00 00 88 00   ERROR-WARNING
>>>>    can0  20000004  [8] 00 20 00 00 00 00 88 00   ERROR-PASSIVE
>>>>    can0  20000044  [8] 00 00 00 00 00 00 7F 00   BUS-OFF
>>>>    (**)
>>>>    can0  20000100  [8] 00 00 00 00 00 00 00 00   RESTARTED
>>>>    can0  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>>>                                           \-- tx-err !!!
>>>>
>>>>    On MSCAN:
>>>>
>>>>    # ./candump -t d any,0:0,#FFFFFFFF
>>>>    (*)
>>>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>>>    can2  20000040  [8] 00 00 00 00 00 00 00 00   BUS-OFF
>>>>    (**)
>>>>    can2  20000100  [8] 00 00 00 00 00 00 00 00   RESTARTED
>>>>    can2  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>>>
>>>>    The MSCAN seems not to signal the ERROR-PASSIVE state before
>>>>    going bus-off.
>>>
>>> I think the signalling of ERROR-PASSIVE depends on multiple things:
>>> - the current bitrate.
>>>   I think this doesn't count so much in this short circuit test.
>>> - the CAN hardware itself: does it signal this "in-the-middle-passive"
>>>   interrupt
>>> - overall speed and load of your system:
>>>   reaction time on interrupts,
>>>   is your system fast enough to handle these two back-to-back interrupts
>>>   as two single events
>>> - driver design, i.e.
>>> - is the state change signalled from the hard-irq or from napi:
>>>   the second interrupt may occur somewhere between hardirq and napi,
>>>   so napi just sees bus off
>>> - design of the error state machine:
>>>   in the at91 and flexcan I always send a warning interrupt if going
>>>   from active to passive.
>>>
>>> The last point is another argument for a common state machine.
>>
>> Well, yes, a common state handling function would be nice, but as the
>> hardware is not really smart, a simple state machine is not enough. It's
>> not just reading the current state and check if it has changed. Have a
>> look to the SJA1000 and MSCAN driver, e.g.:
>>
>>         if (isrc & IRQ_EPI) {
>>                 /* error passive interrupt */
>>                 dev_dbg(dev->dev.parent, "error passive interrupt\n",
>>                         status);
>>                 if (status & SR_ES) {
>>                         if (priv->can.state == CAN_STATE_ERROR_WARNING)
>>                                 state = CAN_STATE_ERROR_PASSIVE;
>>                         else
>>                                 state = CAN_STATE_ERROR_WARNING;
>>                 } else {
>>                         state = CAN_STATE_ERROR_ACTIVE;
>>                 }
>>         }
>>
>> No need for another state machine. But a common function is still useful
>> to fill an error frame for the state change, of course. I will try to
>> come up with a proposal soon.
> 
> 
>   int can_handle_state_change(struct net_device *dev,
>                               struct can_frame *cf,
>                               enum can_state new_state,
>                               u16 tx_err, u16 rx_err)
> 
> If "tx_err" or "rx_err == -1", they are ignored. If "new_state ==
> CAN_STATE_UNKNOWN", the state could be derived from tx_err or rx_err.
> What do you think?


This message was sent accidentally. Sorry for the noise.

Wolfgang.
_______________________________________________
Socketcan-core mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/socketcan-core

Re: [RFC PATCH] can: improved CAN error state handling

Reply via email to