Executive summary: - adding a pull-up to WE# works but doesn't reduce NOR corruption - tests confirm that block locking does protect the respective blocks from getting corrupted - next: CE0 pull-up
The rework ---------- As promised in http://lists.milkymist.org/pipermail/devel-milkymist.org/2011-October/001940.html I added a 4.7 kOhm pull-up to the NOR's write enable signal and then ran the power-cycling tests. The rework looks like this: http://downloads.qi-hardware.com/people/werner/m1/nor/d2/nor-we.jpg This was a bit tricker than it looks because the wire has a tendency of twisting during soldering, but with enough patience and flux with a sufficiently low evaporation rate, also this can be mastered. First results ------------- The first run of 6159 cycles still produced numerous corruptions. It looked as if the pull-up had reduced their frequency a little, but this later turned out to be incorrect: http://downloads.qi-hardware.com/people/werner/m1/nor/d2/dist.png When doing more testing, I then had a string of X server hangs (not caused by the testing), that yielded unusably short runs. Finally, I had one that looked normal for a while but then went on for more than 20'000 cycles without a single (fatal) corruption of the standby partition. Eventually, the Flickernoise partition was corrupted, preventing the M1 from booting, and I then stopped the test and looked for an explanation for this unexpectedly good result. Invulnerability debunked ------------------------ When I analyzed what had happened, I found that the first block for some reason got locked. If the theory is true that undefined bus states during power-down are the root cause of all our NOR troubles, then this would mean that one such event has actually generated the Block Lock command sequence. Such an event may be - very rough estimate - about 1/200 times as likely as a random bus state producing a write command with a data pattern that clears bits. It may also be possible for a Block Unlock command to be generated, which - if executed - would unlock the entire device. However, given that erasing is a very slow operation, it may well be the case that the chip shuts down before such a command can produce much damage. More extensive results ---------------------- That 20'000 cycles run had me confused for a while, but then I finally got a long successful run without unexpected problems. This one lasted for 14687 cycles and 33 standby corruptions, and ended with the (unprotected) main Flickernoise partition taking a hit. There's the graph to prove it: http://downloads.qi-hardware.com/people/werner/m1/nor/d6/dist.png The measured rate of 1/445 is close enough to the 1/478 I got before the rework that they can be considered equivalent. In other words, the rework had no effect on the rate at which NOR corruption occurs. The correlation of adjacent intervals doesn't show anything suspicious either: http://downloads.qi-hardware.com/people/werner/m1/nor/d6/corr.png The pattern analysis yields this: 00000 ____________________ | 00000000 00000000 | d6/10531-corrupt.bin | 00000000 00000000 | d6/13288-corrupt.bin | 00000000 00000000 | d6/14686-corrupt.bin | 11001101 01000000 | d6/2209-corrupt.bin | 00000000 00000000 | d6/4292-corrupt.bin | 00000000 00000000 | d6/4389-corrupt.bin | 00000000 00000000 | d6/4492-corrupt.bin | 10011011 11110000 | d6/6091-corrupt.bin | 10101010 00001011 | d6/7700-corrupt.bin | 00000000 00000000 | d6/8332-corrupt.bin | 00000000 00000000 | d6/9423-corrupt.bin 00002 __________________1_ | 00000010 10111101 | d6/2209-corrupt.bin | 00000000 00000000 | d6/7700-corrupt.bin 00004 _________________1__ | 00000000 00000000 | d6/13288-corrupt.bin | 00000000 00000000 | d6/14505-corrupt.bin 00014 _______________1_1__ | ____00__ 0____0_0 | d6/14505-corrupt.bin 1/2 00020 ______________1_____ | 0_0001__ ________ | d6/14517-corrupt.bin 1/1 | 0_0000__ ________ | d6/3187-corrupt.bin 1/1 | 1_1001__ ________ | d6/9423-corrupt.bin 1/1 00040 _____________1______ | _____0__ ________ | d6/13288-corrupt.bin 1/2 00050 _____________1_1____ | _____0__ ________ | d6/5320-corrupt.bin 1/2 00082 ____________1_____1_ | _0__00__ 0_____00 | d6/4094-corrupt.bin 1/1 00086 ____________1____11_ | _0__00__ 0____111 | d6/11961-corrupt.bin 1/1 0008a ____________1___1_1_ | 00__10__ 0____0_0 | d6/4492-corrupt.bin 1/1 000a0 ____________1_1_____ | ________ 0_______ | d6/319-corrupt.bin 1/1 000a2 ____________1_1___1_ | ____1_1_ _____00_ | d6/6528-corrupt.bin 1/1 00152 ___________1_1_1__1_ | 00__10__ __0__00_ | d6/11690-corrupt.bin 1/1 0017e ___________1_111111_ | ________ 0_______ | d6/4292-corrupt.bin 1/1 00180 ___________11_______ | ________ _0______ | d6/6313-corrupt.bin 1/1 001d0 ___________111_1____ | ________ 000_____ | d6/5732-corrupt.bin 1/1 00202 __________1_______1_ | 00__00__ __0__00_ | d6/10722-corrupt.bin 1/1 00440 _________1___1______ | ________ 0___0___ | d6/11565-corrupt.bin 1/1 | ________ 0___0___ | d6/9622-corrupt.bin 1/1 00800 ________1___________ | ________ ____0___ | d6/10531-corrupt.bin 1/1 0080e ________1_______111_ | ________ __0_____ | d6/13288-corrupt.bin 2/2 00830 ________1_____11____ | ________ 00__0___ | d6/8332-corrupt.bin 1/1 00840 ________1____1______ | ________ __0_0___ | d6/11745-corrupt.bin 1/1 00880 ________1___1_______ | ________ ___00___ | d6/3531-corrupt.bin 1/1 008a2 ________1___1_1___1_ | 11__01__ __0__10_ | d6/4389-corrupt.bin 1/1 008f0 ________1___1111____ | ________ _00_____ | d6/14505-corrupt.bin 2/2 009ec ________1__1111_11__ | ____10__ _1___0__ | d6/5965-corrupt.bin 1/1 00c20 ________11____1_____ | ________ 0_0_____ | d6/3120-corrupt.bin 1/1 01062 _______1_____11___1_ | 00__00__ __0__00_ | d6/14686-corrupt.bin 1/1 01200 _______1__1_________ | ________ 0_0_0___ | d6/13807-corrupt.bin 1/1 018c0 _______11___11______ | ________ _001____ | d6/6091-corrupt.bin 1/1 01942 _______11__1_1____1_ | 00__00__ __0__00_ | d6/11608-corrupt.bin 1/1 02442 ______1__1___1____1_ | 00__00__ __0__00_ | d6/2209-corrupt.bin 1/1 02832 ______1_1_____11__1_ | 00__00__ __0__00_ | d6/14206-corrupt.bin 1/1 02aa0 ______1_1_1_1_1_____ | ________ 0_0_____ | d6/5320-corrupt.bin 2/2 02ffe ______1_11111111111_ | ________ _0000___ | d6/7700-corrupt.bin 1/1 0409e _____1______1__1111_ | ________ __000___ | d6/2678-corrupt.bin 1/1 Also this looks similar to the previous result. There were fewer corruptions that left a 1 bit intact somewhere (indicated by "1" in the pattern data field), though. The test did not reveal any damage to locked partitions, further strengthening our hypthesis that locking does indeed avert NOR corruption. Conclusion ---------- The bad news is that the WE# pull-up didn't help to prevent NOR corruption. The good news is that it didn't introduce new problems. But we wouldn't have expected such things anyway. Furthermore, it looks as if locking partitions does indeed protect them against NOR corruption, or at least makes this corruption so unlikely that an M1 will have died of other causes long before such corruption would happen. What's next ----------- I'll now try to add a pull-up to FLASH_CE_N/CE0 as well, and see how things go. - Werner _______________________________________________ http://lists.milkymist.org/listinfo.cgi/devel-milkymist.org IRC: #milkymist@Freenode
