On Tue, Oct 21, 2008 at 12:33:34AM +0100, Andy Green wrote: > | result (based on 512byte dd): 9.197MByte/sec (98% speed-up) > > Wow... the 98% sounds good already but on a benefit-per-line-of-patch > basis it's probably a record.
well, it is both the timings _and_ the hardware ECC benefit together (the latter work by zecke is already in your kernel tree, so it's cheating a bit). > | However, I don't think that all of the time is spent copying data, but > | rather polling for when data is finished. The s3c244x (not 2410) support a > | RnB interrupt which should solve this issue. > | > | The mainline kernel NAND code doesn't have infrastructure for this yet, > | but I'm working on this right now. > > Yes this is similar to the Glamo MCI thing, you ask for a block and then > some time later you get a completion interrupt. In the meanwhile the > MCI stack has allowed other processes to get the CPU... it'd be cool to > have that for NAND too because at boot-time there can easily be other > processes floating around that have a use for the CPU inbetween NAND, > and if not then parallel startup will increase the probability of it. Interestingly, I now have a patchset that uses completion-interrupt based logic and I still get the exact same throughput at the same CPU usage. I have confirmed that the new code was actually used by printk's in the interrupt and completion function. Also, the interrupt count in /proc/interrupts is visibly increasing every 2k that is read from the device. So now I'm looking at it with oprofile and we get 35% in __raw_readsl, which is expected. at 45ns theoretical byte clock, there are 18 cpu core instructions per byte we read from NAND. Since we read word-wise, we get 72 instructions for each word. Since our actual clock runs at 50ns it is something like 80 instructions per quad-word. Given that our actual data rate has 108.73ns per byte, we actually get something like 173 core clocks (and thus maximum cpu instructions) per quad-word. So there's not much time, and the loads+stores in __raw_readsl will likely take significant time. However, what is really surprising is default_idle showing up with 21.7% ! Top reports 98% CPU load, but oprofile claims 21.7% idling. Really strange. The actual s3c2410 NAND driver shows up with a totl of 1.12%, the interrupt/completion related functions are 0.0625% I'm considering this to be some kind of artefact of using oprofile. I've never used it on ARMv4 before, and it seems like it can only use the timer tick. dyn_tick is disabled, so the timer ticks should be monotonic. (and powertop proves they are) I'm really lost here, don't know what else to do. I'll get some profiles on a soft-ECC and on a non-irq-based-NAND kernel to compare the results and see if they also show this 'artefact'. Maybe 'top' is actually wrong? Any ideas? Cheers, -- - Harald Welte <[EMAIL PROTECTED]> http://openmoko.org/ ============================================================================ Software for the world's first truly open Free Software mobile phone
