Re: [RFC PATCH] can: improved CAN error state handling

Wolfgang Grandegger Sun, 25 Jul 2010 07:23:08 -0700

On 07/25/2010 03:35 PM, Marc Kleine-Budde wrote:
> Wolfgang Grandegger wrote:
>> I found some time to work on this issue.
>>
>> So far we only report CAN error state changes to the worse:
>> "error-active -> error-warning -> error-passive -> bus-off", but not
>> "bus-off -> error-active" or "error-passive -> error-warning ->
>> error-active". There have been request from other users to correct
>> this behavior.
>>
>> This patch reports any state change reported by the SJA1000 and
>> MSCAN-MPC5200 controller. It introduces "CAN_ERR_CRTL_ACTIVE" to
>> signal state changes to error active. Furthermore, the SJA1000
>> triggers now the bus-off recovery procedure when restarted. So far,
>> the controller was *stopped* and then restarted. I did two tests to
>> analyze the state changes behavior:
>>
>> 1. First I forced the controller into the error-passive by sending
>>    a message (with cansend) without cable connected (*). Then I
>>    reconnected the cable (**) and continued to send messages (with
>>    cangen) (***) until the error active state is reached:
>>
>>    On SJA1000:
>>
>>    # candump any,0:0,#FFFFFFFF
>>    (*)
>>    can0  20000004  [8] 00 08 00 00 00 00 60 00   ERROR-WARNING
>>    can0  20000004  [8] 00 20 00 00 00 00 80 00   ERROR-PASSIVE
>>    (**)
>>    can0  123  [4] DE AD BE EF
>>    (***)
>>    can0  42A  [1] 00
>>    ...
>>    can0  42A  [1] 07
>>    can0  20000004  [8] 00 08 00 00 00 00 7F 00   ERROR-WARNING
> 
> Is it possible to tell apart whether the driver goes from passive to
> warning or from active to warning? Without the need to have a state
> machine in userspace, too?


Yes, I realized that side effect as well. The user must monitor state
changes to understand if the state gets worse or better. We could use an
error bit to indicate state improvements.

> Looking at the tx/rx error level might be an option, but I've seen that
> the level can rise very quickly.

Be aware that the tx/rx error counters are *not* readable on all CAN
controllers.

>>    can0  42A  [1] 08
>>    ...
>>    can0  42A  [1] 27
>>    can0  20000004  [8] 00 40 00 00 00 00 5F 00   ERROR-ACTIVE
>>                                           \-- tx-err !!!
>>    On MSCAN:
>>
>>    # candump any,0:0,#FFFFFFFF
>>    (*)
>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>    can2  20000004  [8] 00 20 00 00 00 00 00 00   ERROR-PASSIVE
>>    (**)
>>    can2  123  [4] DE AD BE EF
>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>    (***)
>>    can2  42A  [1] 00
>>    can2  42A  [1] 01
>>    ...
>>    can2  42A  [1] 1E
>>    can2  42A  [1] 1F
>>    can2  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>
>>    As you can see, the state change to error-active is signaled
>>    differently. The MSCAN seems not to respect the specs :-(.
> 
> Do you mean the fact that error active is signalled with tx err == 0?

Yes, the SJA1000 triggers an interrupt to signal error active when the
error counter goes below 96.

> From the patch below you only changed the part of the driver that runs
> if a change of state is detected "(old_state != priv->can.state)".
> 
> Maybe it's a problem in the driver, that calculated the new state wrong.
> Maybe due to a problem with the hardware. (I don't know the driver nor
> the hardware).

I monitored the interrupt generation as well:

[ 1234.089446] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x44)
[ 1234.092228] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x48)
[ 1240.960487] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x44)
[ 1352.336647] mpc5xxx_can f0000900.can: error interrupt (canrflg=0x40)

The last interrupt comes that late, even if the state is misinterpreted.
Note also that the MSCAN has a dedicated state field for TX and RX.
Therefore the state is defined without a state machine in software.

> Looking at you patches below I want to rise my old proposal again. I
> think we should move the state machine that assembles and sends the
> error frames into a common function. Each driver just needs to have a
> proper function that calculates the current state and then call the
> state handling state machine.

The error handling code is already quite large and complex, especially
for the purpose it serves. Any common function improving that situation
would be nice. But unfortunately, we rely on the hardware :-(.

>>    When time permits, I will also try the MSCAN of the MPC5121.
> 
>> 2. I forced the controller into bus-off by short-circuiting
>>    the low and hi lines and sending a message (*), Then I
>>    restarted the controller manually with
>>    "ip link set can0 type can restart" (**):
>>
>>    On SJA1000:
>>
>>    # ./candump -t d any,0:0,#FFFFFFFF
>>    (*)
>>    can0  20000004  [8] 00 08 00 00 00 00 88 00   ERROR-WARNING
>>    can0  20000004  [8] 00 20 00 00 00 00 88 00   ERROR-PASSIVE
>>    can0  20000044  [8] 00 00 00 00 00 00 7F 00   BUS-OFF
>>    (**)
>>    can0  20000100  [8] 00 00 00 00 00 00 00 00   RESTARTED
>>    can0  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>                                           \-- tx-err !!!
>>
>>    On MSCAN:
>>
>>    # ./candump -t d any,0:0,#FFFFFFFF
>>    (*)
>>    can2  20000004  [8] 00 08 00 00 00 00 00 00   ERROR-WARNING
>>    can2  20000040  [8] 00 00 00 00 00 00 00 00   BUS-OFF
>>    (**)
>>    can2  20000100  [8] 00 00 00 00 00 00 00 00   RESTARTED
>>    can2  20000004  [8] 00 40 00 00 00 00 00 00   ERROR-ACTIVE
>>
>>    The MSCAN seems not to signal the ERROR-PASSIVE state before
>>    going bus-off.
> 
> I think the signalling of ERROR-PASSIVE depends on multiple things:
> - the current bitrate.
>   I think this doesn't count so much in this short circuit test.
> - the CAN hardware itself: does it signal this "in-the-middle-passive"
>   interrupt
> - overall speed and load of your system:
>   reaction time on interrupts,
>   is your system fast enough to handle these two back-to-back interrupts
>   as two single events
> - driver design, i.e.
> - is the state change signalled from the hard-irq or from napi:
>   the second interrupt may occur somewhere between hardirq and napi,
>   so napi just sees bus off
> - design of the error state machine:
>   in the at91 and flexcan I always send a warning interrupt if going
>   from active to passive.
> 
> The last point is another argument for a common state machine.

Well, yes, a common state handling function would be nice, but as the
hardware is not really smart, a simple state machine is not enough. It's
not just reading the current state and check if it has changed. Have a
look to the SJA1000 and MSCAN driver, e.g.:

        if (isrc & IRQ_EPI) {
                /* error passive interrupt */
                dev_dbg(dev->dev.parent, "error passive interrupt\n",
                        status);
                if (status & SR_ES) {
                        if (priv->can.state == CAN_STATE_ERROR_WARNING)
                                state = CAN_STATE_ERROR_PASSIVE;
                        else
                                state = CAN_STATE_ERROR_WARNING;
                } else {
                        state = CAN_STATE_ERROR_ACTIVE;
                }
        }

No need for another state machine. But a common function is still useful
to fill an error frame for the state change, of course. I will try to
come up with a proposal soon.

Wolfgang.


_______________________________________________
Socketcan-core mailing list
[email protected]
https://lists.berlios.de/mailman/listinfo/socketcan-core

Re: [RFC PATCH] can: improved CAN error state handling

Reply via email to