On Mar 14, 2011, at 14:59, Wolfgang Denk wrote:
> In message <a60aea13-1206-4699-9302-0df9c0f9d...@boeing.com> you wrote:
>> My own board needs both processor modules to synchronize resets to allow
>> them to come back up at all, which means that a "reset" may block for an
>> arbitrary amount of time waiting for the other module to cleanly shut down
>> and restart (or waiting for somebody to type "reset" on the other U-Boot).
>> If someone just types "reset" on the console, I want to allow them to hit
>> Ctrl-C to interrupt the process.
> 
> This is not what the "reset" command is supposed to do.  The reset
> command is supposed to be the software equivalent of someone pressing
> the reset button on your board - to the extend possible to be
> implemented in software.

On our boards, when the "reset" button is pressed in hardware, both processor 
modules on the board and all the attached hardware reset at the same time.

If just *one* of the 2 CPUs triggers the reset then only *some* of the attached 
hardware will be properly reset due to a hardware errata, and as a result the 
board will sometimes hang or corrupt DMA transfers to the SSDs shortly after 
reset.

The only way to reset either chip safely is by resetting both at the same time, 
which requires them to communicate before the reset and wait (possibly a long 
time) for the other board to agree to reset.  Yes, it's a royal pain, but we're 
stuck with this hardware for the time being, and if the board can't communicate 
then it might as well hang() anyways.

This same logic is also implemented in my Linux board-support code, so when one 
CPU requests a reset the other treats it as a Ctrl-Alt-Del.


>>> What is the difference between these two - and why do we need
>>> different functions at all?
>>> 
>>> A reset is a reset is a reset, isn't it?
>> 
>> That might be true *IF* all boards could actually perform a real hardware 
>> reset.
>> 
>> Some can't, and instead they just jump to their reset vector (Nios-II) or to 
>> flash (some ppc 74xx/7xx systems).
> 
> So this is the "reset" on these boards, then.
> 
>> If the board just panic()ed or got an unhandled trap or exception, then you
>> don't want to do a soft-reset that assumes everything is OK.  A startup in
>> a bad environment like that could corrupt FLASH or worse.  Right now there
>> is no way to tell the difference, but the lower-level arch-specific code
>> really should care.
> 
> I don't understand your chain of arguments.
> 
> If there really is no better way to implement the reset on such
> boards, then what else can we do?
> 
> And if there are more things that could be done to provide a "better"
> reset, then why should we not always do these?

If the board is in a panic() state it may well have still-running DMA transfers 
(such as USB URBs), or be in the middle of writing to FLASH.

Performing a jump to early-boot code which is only ever tested when everything 
is OK and devices are properly initialized is a great way to cause data 
corruption.

I know for a fact that our boards would rather hang forever than try to reset 
without cooperation from the other CPU.

>>> My initial feeling is a plain NAK, for this and the rest of the patch
>>> series.  Why would we want all this?
>> 
>> While I was going through the hooks I noticed that several of them were
>> explicitly NOT safe if the board was in the middle of a panic() for whatever
> 
> Can you please peovide some specific examples?  I don't understand what
> you are talking about.

Ok, using the ppmc7xx board as an example:

        /* Disable and invalidate cache */
        icache_disable();
        dcache_disable();

        /* Jump to cold reset point (in RAM) */
        _start();

        /* Should never get here */
        while(1)
                ;

This board uses the EEPRO100 driver, which appears to set up statically 
allocated TX and RX rings which the device performs DMA to/from.

If this board starts receiving packets and then panic()s, it will disable 
address translation and immediately re-relocate U-Boot into RAM, then zero the 
BSS.  If the network card tries to receive a packet after BSS is zeroed, it 
will read a packet buffer address of (probably) 0x0 from the RX ring and 
promptly overwrite part of U-Boot's memory at that address.

Depending on the initialization process and memory layout, other similar boards 
could start writing garbage values to FLASH control registers and potentially 
corrupt data.

Since the panic() path is so infrequently used and tested, it's better to be 
safe and hang() on the boards which do not have a reliable hardware-level reset 
than it is to cause undefined behavior or potentially corrupt data.

Cheers,
Kyle Moffett
_______________________________________________
U-Boot mailing list
U-Boot@lists.denx.de
http://lists.denx.de/mailman/listinfo/u-boot

Reply via email to