On Mon, Dec 11 2006, Ming Zhang wrote: > On Mon, 2006-12-11 at 15:32 +0100, Jens Axboe wrote: > > On Mon, Dec 11 2006, Ming Zhang wrote: > > > On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote: > > > > On Sun, Dec 10 2006, Ming Zhang wrote: > > > > > Today I use blktrace observe a strange (at least to me) behavior at > > > > > block layer. Wonder if anybody can shed some lights? Thanks. > > > > > > > > > > Here is the detail. > > > > > > > > > > ... previous requests are ok. > > > > > > > > > > 8,16 0 782 7.025277381 4915 Q W 6768 + 32 [istiod1] > > > > > 8,16 0 783 7.025283850 4915 G W 6768 + 32 [istiod1] > > > > > 8,16 0 784 7.025286799 4915 P R [istiod1] > > > > > 8,16 0 785 7.025287794 4915 I W 6768 + 32 [istiod1] > > > > > > > > > > Write request to lba 6768 was inserted to the queue. > > > > > > > > > > 8,16 0 786 7.026059876 4915 Q R 6768 + 32 [istiod1] > > > > > 8,16 0 787 7.026064451 4915 G R 6768 + 32 [istiod1] > > > > > 8,16 0 788 7.026066369 4915 I R 6768 + 32 [istiod1] > > > > > > > > > > Read request to same lba was inserted to the queue as well. though it > > > > > can not be merged, i thought it can be satisfied by previous write > > > > > request directly. seems merge function does not consider this. > > > > > > > > That is the job of the upper layers, typically the page cache. For this > > > > scenario to take place, you must be using raw or O_DIRECT. And in that > > > > case, it is the job of the application to ensure proper ordering of > > > > requests. > > > > > > ic. i assumed blkio should take responsibility on this as well. so i am > > > wrong. > > > > > > > > > > > > 8,16 0 789 7.034883766 0 UT R [swapper] 2 > > > > > 8,16 0 790 7.034904284 9 U R [kblockd/0] 2 > > > > > > > > > > Unplug because of a read. > > > > > > > > > > 8,16 0 791 7.045272094 9 D R 6768 + 32 [kblockd/0] > > > > > 8,16 0 792 7.045654039 9 C R 6768 + 32 [0] > > > > > > > > > > Strangely, read request was sent to device before write request and > > > > > thus > > > > > return a wrong data. > > > > > > > > Linux doesn't guarantee any request ordering for O_DIRECT io. > > > > > > so this means it can be inserted front and back. and no fixed order? > > > > It'll be sort inserted like any other request. That might be front, it > > might be back, or it migth be somewhere in the middle. > > ic, so no special treatment here.
Nope. In fact the block layer and io scheduler do not know that this is an O_DIRECT request, the bio originates from the same path as any other regular fs request. > > > > > 8,16 0 793 7.045669809 9 D W 6768 + 32 [kblockd/0] > > > > > 8,16 0 794 7.049840970 0 C W 6768 + 32 [0] > > > > > > > > > > Write finished. > > > > > > > > > > So read get a wrong data back to application. one thing not sure is > > > > > where (front/back) the request are insert into queue and who mess up > > > > > the > > > > > order here. > > > > > > > > There is no mess up, you are making assumptions that aren't valid. > > > > > > > > > Is it possible for I event, we can know the extra flag, so we know > > > > > where > > > > > it is inserted. > > > > > > > > That would be too expensive, as we have to peak inside the io scheduler > > > > queue. So no. > > > > > > see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we > > > generate insert event and we know where already. so export that flag is > > > not expensive. > > > > Maybe we are not talking about the same thing - which flag do you mean? > > Do you mean the 'where' position? It'll be ELEVATOR_INSERT_FRONT for > > basically any request, unless the issuer specifically asked for BACK or > > FRONT. Those are only use in the kernel, or for non-fs request like > > SG_IO generated ones. So I don't think the flag will add very much > > information that isn't already given. > > ic. spawn another question, why almost always ELEVATOR_INSERT_FRONT > here? why not a fifo queue? or later unplug will drop from the end? i > forgot the detail. Typo, it was supposed to say ELEVATOR_INSERT_SORT! > > > > > ---- is the code to generate this io -----. disk is a regular disk and > > > > > current scheduler is CFQ. > > > > > > > > Ah ok, so you are doing this inside the kernel. If you want to ensure > > > > write ordering, then you need to mark the request as a barrier. > > > > > > > > submit_bio(rw || (1 << BIO_RW_BARRIER), bio); > > > > > > we tried that if we mark a write request as barrier, we lose half > > > performance. if we mark it as BIO_RW_SYNC, it is almost no change. > > > though i still need to figure out the reason of that half performance > > > loss compared with BIO_RW_SYNC > > > > You lose a lot of performance for writes, as Linux will then also ensure > > ordering at the drive level. It does so since just ordering in the > > kernel makes little sense, if you allow the drive to reorder at will > > anyway. It is possible to control the two parameters, but not from the > > bio level. If you mark the bio BIO_RW_BARRIER, then that will get marked > > SOFT and HARD barrier in the io scheduler. A soft barrier has ordering > > ensured inside the kernel, a hard barrier has ordering in the kernel and > > at the drive side as well. > > > > BIO_RW_SYNC doesn't imply any ordering constraints, it just tells the > > kernel to make sure that we don't stall plugging the queue. > > ic. thanks for explanation. so BIO_RW_SYNC just unplug the queue while > BIO_RW_BARRIER will ensure the order. then in worst case, BIO_RW_SYNC > will lead to data inconsistency if 2 overlapped writes coming and order > is reversed. Again, the consistency is in the care of the issuer. For regular file system io, the page cache will give you this consistency. If you are issuing bio's directly, you have to take care of this yourself. > > > > I wont comment on your design, but it seems somewhat strange - why are > > > > you doing this in the kernel? What is the segment switching doing? > > > > > > we are writing an iscsi target in kernel level. > > > > One already exists :-) > > en? which one? u mean IET or STGT? we are checking if can adding another > iomode to IET. some people have fast storage and prefer to bypass the > page cache completely. I haven't tracked which projects exists, just was a scsi target merged the other day. And I know that there are at least one iscsi implementation that are catered by some good and experienced Linux kernel people. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-btrace" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
