Poor IDE performance on Linux 2.6.x
Hi, i'm using a Linux 2.6.8-rc4 (linuxppc tree) on a Pallas (PPC405 core plus Set-Top-Box-specialized SOC)-based board (Redwood 5 like). The IDE driver in use is ibm_ocp_ide.c, in UDMA-33 mode. When measured with "hdparm -t", we get a HDD performance of about 11MB/s. With an older kernel like 2.4.20, the performance was - with same Hardware - about 22MB/s, i.e. twice as high. I tried different IO-schedulers, but, as expected, as there is only one process accessing the harddisk, there was no difference. The IDE-driver seems to be ok - i made some measurements, and the time from "ide_do_rw_disk" until the end of the IDE-irq isn't longer than expected (and gives a raw IDE performance of about ~29MB/s, which is near the theoretical limit of 33MB/s of the UDMA-Bus. The harddisk performance doesn't seem to matter as it's >11MB/s, and seems to make some prefetch, so that the next data is already read from disk into the drive's cache when the DMA transfer starts. The first DMA transfers are slower, probably due seek time and real read time etc. ). The time measured (i won't tell exact numbers as they depend on the transfered size and the time required for the printks) included the IDE command processing time (i.e., time after issuing the IDE command until the IDE device asserted DRQ), so it's some "worst case timing". The problem seems to be the delay after the successfull termination of the read-command until the next ide_do_rw_disk is called. I was - mainly because i don't know the IO subsystem of the kernel too much - unable to trace down what's going on there. I hacked the kernel profiler to use a critical interrupt (available on 4xx) and an on-cpu compare timer, so i was able to profile even in IRQ time. The profile, sorted and tailed, looks like: 31 run_timer_softirq 0.0718 42 __flush_dcache_icache 0.5526 94 invalidate_dcache_range1.9583 103 finish_task_switch 0.5598 199 memset 2.1630 533 __do_softirq 2.3795 4404 __copy_tofrom_user 7.8085 9819 cpu_idle 175.3393 27760 default_idle 301.7391 43154 total 0.2316 so except some "copy_tofrom_user", the CPU is just idling around. Can anybody tell me where to look at, i.e. where the time is spend between a successfull termination of a transfer and the start of the next io? Userspace just reads BIG blocks (10MB or so), so userspace latency doesn't seem to be the problem. hdparm -T gives about 46MB/s, which is about the half of our memory performance. Felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
IBM OCP IDE fixes
Hi, it seems that there's a bug in drivers/ide/ppc/ibm_ocp_ide.c, in ocp_ide_build_dmatable in the current (linuxppc) 2.6. Explanation: pvprv gets initialized to NULL, and for each bio segment BIOVEC_PHYS_MERGEABLE will be called to see if it can be merged with the previous one. In the first iteration, this will fail, giving a null-pointer to BIOVEC_PHYS_MERGEABLE, which doesn't check for this condition, leading to an Oops. As the first segment can never be merged with something else, checking for a null pvprv should be enough. Speaking of the ibm_ocp_ide.c, it should be inserted into the Makefile in drivers/ide ("ide-core-$(CONFIG_BLK_DEV_IDE_STB04xxx) += ppc/ibm_ocp_ide.o"), and the std_ide_cntl must be called. Not sure if my patch is the correct way here. Additionally, the ocp driver issues IDE commands on his own in dma mode, which is wrong for 48bit addressing. I made simple workaround, but a more generic function might be called instead. Then there are some simple compile fixes (missing headerfile, which isn't of any use anyway and replacement of hw_init_dma_channel against ppc4xx_init_dma_channel). ide_dma_off seems to be not required anymore. Finally, the udelay(1000*1000) have to be replaced by mdelay(1000) in the spinup wait. Maybe this loop should be replaced by the more generic IDE spinup loop. Comments? Felix diff -Naur linuxppc-2.5-vanilla/drivers/ide/Makefile linux-2.6/drivers/ide/Makefile --- linuxppc-2.5-vanilla/drivers/ide/Makefile2004-03-02 22:17:17.0 +0100 +++ linux-2.6/drivers/ide/Makefile2004-03-04 18:33:12.0 +0100 @@ -37,6 +37,7 @@ ide-core-$(CONFIG_BLK_DEV_MPC8xx_IDE)+= ppc/mpc8xx.o ide-core-$(CONFIG_BLK_DEV_IDE_PMAC)+= ppc/pmac.o ide-core-$(CONFIG_BLK_DEV_IDE_SWARM)+= ppc/swarm.o +ide-core-$(CONFIG_BLK_DEV_IDE_STB04xxx) += ppc/ibm_ocp_ide.o obj-$(CONFIG_BLK_DEV_IDE)+= ide-core.o obj-$(CONFIG_IDE_GENERIC)+= ide-generic.o diff -Naur linuxppc-2.5-vanilla/drivers/ide/ide.c linux-2.6/drivers/ide/ide.c --- linuxppc-2.5-vanilla/drivers/ide/ide.c2004-03-02 22:16:11.0 +0100 +++ linux-2.6/drivers/ide/ide.c2004-03-04 18:33:12.0 +0100 @@ -2156,7 +2156,7 @@ pnpide_init(1); } #endif /* CONFIG_BLK_DEV_IDEPNP */ -#ifdef CONFIG_BLK_DEV_STD +#if defined(CONFIG_BLK_DEV_STD) || defined(CONFIG_BLK_DEV_IDE_STB04xxx) { extern void std_ide_cntl_scan(void); std_ide_cntl_scan(); --- linuxppc-2.5-vanilla/drivers/ide/ppc/ibm_ocp_ide.c2004-03-02 22:16:52.0 +0100 +++ linux-2.6/drivers/ide/ppc/ibm_ocp_ide.c2004-03-04 19:04:29.0 +0100 @@ -23,7 +23,7 @@ #include #include -#include "ide_modes.h" +// #include "ide_modes.h" #define IDE_VER"2.0" ppc_dma_ch_t dma_ch; @@ -383,8 +383,8 @@ else consistent_sync((void *)vaddr, size, PCI_DMA_FROMDEVICE); -if (!BIOVEC_PHYS_MERGEABLE(bvprv, bvec)) { +if (bvprv && !BIOVEC_PHYS_MERGEABLE(bvprv, bvec)) { if (ocp_ide_build_prd_entry(&table, prd_paddr, prd_size, @@ -581,12 +580,18 @@ { if (!ocp_ide_build_dmatable(drive, writing)) return 1; + +int lba48bit; drive->waiting_for_dma = 1; if (drive->media != ide_disk) return 0; + +lba48bit = ((drive->id->cfs_enable_2 & 0x0400) ? 1 : 0) && (drive->addressing); + ide_set_handler(drive, &ocp_ide_dma_intr, WAIT_CMD, NULL); -HWIF(drive)->OUTB(writing ? WIN_WRITEDMA : WIN_READDMA, +HWIF(drive)->OUTB(writing ? (lba48bit ? WIN_WRITEDMA_EXT : WIN_WRITEDMA) +: (lba48bit ? WIN_READDMA_EXT : WIN_READDMA), IDE_COMMAND_REG); return __ocp_ide_dma_begin(drive, writing); } @@ -642,7 +647,7 @@ if ((stat & 0x80) == 0) { break; } -udelay(1000 * 1000);/* 1 second */ +mdelay(1000);/* 1 second */ } printk("."); @@ -657,7 +662,7 @@ if ((stat & 0x80) == 0) { break; } -udelay(1000 * 1000);/* 1 second */ +mdelay(1000);/* 1 second */ } if( i < 30){ outb_p(0xa0, io_ports[6]); @@ -715,7 +720,7 @@ dma_ch.ch_enable = 0;/* No chaining */ dma_ch.tcd_disable = 1;/* No chaining */ -if (hw_init_dma_channel(IDE_DMACH, &dma_ch) != DMA_STATUS_GOOD) +if (ppc4xx_init_dma_channel(IDE_DMACH, &dma_ch) != DMA_STATUS_GOOD) return -EBUSY; /* init CIC select2 reg to connect external DMA port 3 to internal @@ -772,8 +777,10 @@ if(!ocp_ide_spinup(hwif->index)) return 0; - -return 1; + + probe_hwif_init(hwif); + +return 1; } @@ -821,7 +829,6 @@ ide_hwifs[index].tuneproc = &ocp_ide_tune_drive; ide_hwifs[index].drives[0].autotune = 1; ide_hwifs[index].autodma = 1; -ide_hwifs[index].ide_dma_off = &ocp_ide_dma_off; ide_hwifs[index].ide_
405 Critical Interrupts
Hi, i need to have a low-latency interrupt on a 405-based chip with linux 2.4. Did anybody yet worked on this? I thought about routing the CriticalInterrupt pretty much the same way as the HardwareInterrupt, but with disabling MSR_CE. MSR_CE would be enabled then even in (normal) interrupts, we probably have to add a __crit_cli and __save_and_crit_cli as someone already suggested. Does CRIT_EXCEPTION work? Is do_IRQ reentrant? Should i use the same interrupt processing as a normal hardware interrupt, with the exception that only "critical"-flagged interrupts are processed? Any suggestions? The background: the IBM-STB045xx's capture port, which we use for IR-decoding, doesn't have any buffering, so when a time-consuming interrupt is processed (PIO network, maybe PIO ide), we miss IR cycles. Felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
ppcboot
Pascal wrote: > I'm working on a motorola MPc8xx. i would like to know if it's possible > to run ppcboot trough an another ppcboot on dbox2, we boot the ppcboot-elf from another firststage bootloader. works. on another system (dreambox), i boot u-boot from another firststate bootloader (IBM's openbios.). works. just take out all sdram init stuff et al and produce a file loadble from your bootloader. felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
IDE corruption w/ 48 Bit addressing
> Users report harddisk corruption, and a quick test showed, that data > written to 0x18+x (LBA sector 0xC00+x/512) is also written > to x. (direct O_LARGEFILE-access to /dev/discs0/disc). OK, update, this was a bit misleading: when reading from the device, the upper 3 bytes aren't updated in the right way, so the "previous content" (as specified in the ATA/6 specs) seem to be invalid and contain, well, wrong content. felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
IDE corruption w/ 48 Bit addressing
Hi, i'm having a PPC-405 based board (IBM STB04500 if anyone cares), and i'm using a Maxtor 6Y200L0, a 200GB harddisk drive. Obviously this uses 48bit addressing, and i'm using linucppc 2.4.21-pre4 devel. The same bug occurs with 2.4.20 release. Users report harddisk corruption, and a quick test showed, that data written to 0x18+x (LBA sector 0xC00+x/512) is also written to x. (direct O_LARGEFILE-access to /dev/discs0/disc). This will of course corrupt the filesystem. Now my questions: - is this a bug in the IDE low-level interface driver? - or maybe in the kernel? - or maybe fixed in newer versions? - why does the corruption start at this lba sector? User reported that this occurs with different HDD models and brands, too, but only with 48bit-drives. Everything else works perfectly. felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
reading block of data from host problem!!!!
Anand Franklin J wrote: > Hello All, > can any body tell what is the problem, it just hangs in reading the > "block 1" of zImage.treeboot and not proceeding further from tftp > booting process. I am using IBM power pc redwood 4. Although this isn't related to linux, this probably only means that no tftp server is running at all. (yes, i got this "error report" from a customer, and this was the fault) try "tftp localhost" -> "get zImage.treeboot" (or whatever) and see if tftp is *really* working on your tftp server machine. felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
file alignment of elf sections
hi, i'm currently porting u-boot to a ppc405-based board, but i'm failing miserably at the first step - correct linking. to make it short (and keep it a bit ontopic, sorry): how can i tell ld to align the section in the FILE to 2^16 ? i have my .text-section started at (loading address) 5MB, and like to have it in the elf-file at 64k. i saw lds producing exactly this output, but in my case the .text-section starts immediately after the elf header. (the reason is that i have to convert the binary to a special format for the primary bootloader, and the tool for converting is rather dumb, but i don't want to invest time into fixing the tool since i didn't wrote it and nobody else cares about it). There must be a simple option, but i can't find it :/ (even after googling around) I have to admit that i did this before, but i can't remember HOW i did it. felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
C++ Library recommendations ...
Jaap-Jan Boor wrote: > I just use libstd++ coming with gnu g++, it's not too big (shared ~300k) > compared to glibc (shared ~1.2 M) problem is that the STL is a template library, so a lot of code is produced when using these tamplates. i highly recommend to use normal lists and cast pointers (like in good old C times) again, even with the need of allocating two chunks per list item (pointer and data itself). Using STL makes your application MUCH bigger, and, often slower. STL is optimized for huge data structures, but for most things memcpy'ing (and using for example an array/vector) is much faster than using a list or hash, which has optimal - for example linear or even log - complexity. The thing the STL guy forgot is to keep in mind than 1000*linear (list insertions... ) is still worse than 1*exp (memcpy when doing vector insertions... but take this only as an example) when your list has, for example, 5 entries. and most of the lists are NOT accessed ten thousand times, do NOT have one million entries where 1000*linear is a LOT more than 1*exp complexity. just my 2 cent... felix ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/
non-PCI OHCI (STBxxxx)
Hi, i'm using a IBM STB04xxx-based system (PPC405 core with Set-Top-Box MPEG functions as well as some other stuff), which has an integrated OHCI controller i'd like to use. Obviously there is no PCI bus on that system, so it's not as simple as possible. I hacked a 2.4.19-preX ohci-driver to use some consistent_alloc, _sync instead of the pci functions and hardcoded baseaddress and IRQ. This worked (a bit), but was very unstable and stopped working totally with -rc3. Don't know what exactly changed, since i'm very new into USB at all, and i'm really unable to debug what's going wrong for example on "device not accepting new address" etc., since this already requires working USB transfers etc. A bit disappointed by commtens in ohci.h which state that it's "not so easy" to use non-PCI OHCI controllers i looked into the SA-case, but - well, it didn't helped me too much and seems to require huge hacks (for example they emulate the pci-functions.. or is that the way to go?) I then tried to use a 2.5 kernel. The ohci-stuff is well structured there, and i made some ohci-ocp.c and hacked the use of the pci-functions again. Result was a working USB support, but somewhere there's still an error, as there is some data inconsistency. for example, i burned an audio cd, and it contained noise about every second. When i read a FAT disk, there're randomly some "invalid cluster chain" error messages etc. Maybe some cache problems. Don't know, and as said, i'm unable to debug this further without help :( So i'm asking: Is there any standard approach to this? Maybe there's already a patch flying around? If someone from Monta Vista is reading this: Is this going to be supported? If not: How is this going to be? What exactly are the issues regarding consistent_alloc, _sync versus their pci-variants? Is it maybe possible to USE the pci-functions with some dummy pci device? If i understand correctly, consistent_alloc allocates contigouus, non-pagable memory which is directly mapped to bus-addresses, consistent_sync flushes all writeback caches (if TODEVICE) or invalidates them (if FROMDEVICE) is this correct? Do pci_pool_alloc alloc compatible memory? does pci_map_single "nothing more" (in functional meaning, if we don't look at other, more complex hardware/bridges) that a consistent_sync and bus2virt? And finally: I usually ioremap() to use hardware memory. What's the difference of using ioremap() vresus bus2virt and virt2bus? are they deprecated ? are they only possible after an ioremap? or is kernel-memory all the time mapped to bus addresses which can be retrieved using virt2bus? thanks in advance, Felix Domke ** Sent via the linuxppc-embedded mail list. See http://lists.linuxppc.org/