Thanks a lot to Alan for this suggestion. I think it makes sense to simulate a scatter gather in driver for this case. I'll try it later and expect to see the improved performance.
>-----Original Message----- >From: Alan Cox [mailto:a...@lxorguk.ukuu.org.uk] >Sent: 2010年4月13日 23:21 >To: Gao, Yunpeng >Cc: James Bottomley; Martin K. Petersen; Robert Hancock; >linux-...@vger.kernel.org; linux-mmc@vger.kernel.org >Subject: Re: How to make kernel block layer generate bigger request in the >request queue? > >> And I just curious why the block layer does not merge these contiguous >> sectors >into one single request? For example, if > the block layer generate >'start_sect: >48776, nsect: 64, rw: r' instead of below requests, I think the performance >will >> be better. > >You said earlier "My hardware doesn't support scatter/gather" > >> start_sect: 48776, nsect: 8, rw: r >> start_sect: 48784, nsect: 8, rw: r >> start_sect: 48792, nsect: 8, rw: r >> start_sect: 48800, nsect: 8, rw: r >> start_sect: 48808, nsect: 8, rw: r >> start_sect: 48816, nsect: 8, rw: r >> start_sect: 48824, nsect: 8, rw: r >> start_sect: 48832, nsect: 8, rw: r > >Print the bus address of each request and you will probably find they are >not contiguous so they have not been merged because your hardware could >not do that transfer and you have no IOMMU. > >If the overhead per command is really really huge you can preallocate an >internal buffer of say 32K or 64K in your driver and tell the block layer >you do scatter gather, then copy the buffers into a linear chunk. I'd be >very surprised if that was a win overall on any vaguely sane hardware but >flash with erase block overhead and the like might be one of the less >sane cases. > >Alan