On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote: > On Sun, Dec 10 2006, Ming Zhang wrote: > > Today I use blktrace observe a strange (at least to me) behavior at > > block layer. Wonder if anybody can shed some lights? Thanks. > > > > Here is the detail. > > > > ... previous requests are ok. > > > > 8,16 0 782 7.025277381 4915 Q W 6768 + 32 [istiod1] > > 8,16 0 783 7.025283850 4915 G W 6768 + 32 [istiod1] > > 8,16 0 784 7.025286799 4915 P R [istiod1] > > 8,16 0 785 7.025287794 4915 I W 6768 + 32 [istiod1] > > > > Write request to lba 6768 was inserted to the queue. > > > > 8,16 0 786 7.026059876 4915 Q R 6768 + 32 [istiod1] > > 8,16 0 787 7.026064451 4915 G R 6768 + 32 [istiod1] > > 8,16 0 788 7.026066369 4915 I R 6768 + 32 [istiod1] > > > > Read request to same lba was inserted to the queue as well. though it > > can not be merged, i thought it can be satisfied by previous write > > request directly. seems merge function does not consider this. > > That is the job of the upper layers, typically the page cache. For this > scenario to take place, you must be using raw or O_DIRECT. And in that > case, it is the job of the application to ensure proper ordering of > requests.
ic. i assumed blkio should take responsibility on this as well. so i am wrong. > > > 8,16 0 789 7.034883766 0 UT R [swapper] 2 > > 8,16 0 790 7.034904284 9 U R [kblockd/0] 2 > > > > Unplug because of a read. > > > > 8,16 0 791 7.045272094 9 D R 6768 + 32 [kblockd/0] > > 8,16 0 792 7.045654039 9 C R 6768 + 32 [0] > > > > Strangely, read request was sent to device before write request and thus > > return a wrong data. > > Linux doesn't guarantee any request ordering for O_DIRECT io. so this means it can be inserted front and back. and no fixed order? > > > 8,16 0 793 7.045669809 9 D W 6768 + 32 [kblockd/0] > > 8,16 0 794 7.049840970 0 C W 6768 + 32 [0] > > > > Write finished. > > > > So read get a wrong data back to application. one thing not sure is > > where (front/back) the request are insert into queue and who mess up the > > order here. > > There is no mess up, you are making assumptions that aren't valid. > > > Is it possible for I event, we can know the extra flag, so we know where > > it is inserted. > > That would be too expensive, as we have to peak inside the io scheduler > queue. So no. see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we generate insert event and we know where already. so export that flag is not expensive. > > > ---- is the code to generate this io -----. disk is a regular disk and > > current scheduler is CFQ. > > Ah ok, so you are doing this inside the kernel. If you want to ensure > write ordering, then you need to mark the request as a barrier. > > submit_bio(rw || (1 << BIO_RW_BARRIER), bio); we tried that if we mark a write request as barrier, we lose half performance. if we mark it as BIO_RW_SYNC, it is almost no change. though i still need to figure out the reason of that half performance loss compared with BIO_RW_SYNC > > I wont comment on your design, but it seems somewhat strange - why are > you doing this in the kernel? What is the segment switching doing? we are writing an iscsi target in kernel level. which segment switching u meant? > > BTW, this mail really isn't about blktrace, it probably should have been > sent to the linux-kernel list. You wouldn't send a vmstat observed > problem to the vmstat list, would you? :-) > make sense. not next time. thx. - To unsubscribe from this list: send the line "unsubscribe linux-btrace" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
