Re: how to explain this?

Ming Zhang Mon, 11 Dec 2006 06:44:33 -0800

On Mon, 2006-12-11 at 15:32 +0100, Jens Axboe wrote:
> On Mon, Dec 11 2006, Ming Zhang wrote:
> > On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote:
> > > On Sun, Dec 10 2006, Ming Zhang wrote:
> > > > Today I use blktrace observe a strange (at least to me) behavior at
> > > > block layer. Wonder if anybody can shed some lights? Thanks.
> > > > 
> > > > Here is the detail.
> > > > 
> > > > ... previous requests are ok.
> > > > 
> > > >   8,16   0      782     7.025277381  4915  Q   W 6768 + 32 [istiod1]
> > > >   8,16   0      783     7.025283850  4915  G   W 6768 + 32 [istiod1]
> > > >   8,16   0      784     7.025286799  4915  P   R [istiod1]
> > > >   8,16   0      785     7.025287794  4915  I   W 6768 + 32 [istiod1]
> > > > 
> > > > Write request to lba 6768 was inserted to the queue.
> > > > 
> > > >   8,16   0      786     7.026059876  4915  Q   R 6768 + 32 [istiod1]
> > > >   8,16   0      787     7.026064451  4915  G   R 6768 + 32 [istiod1]
> > > >   8,16   0      788     7.026066369  4915  I   R 6768 + 32 [istiod1]
> > > > 
> > > > Read request to same lba was inserted to the queue as well. though it
> > > > can not be merged, i thought it can be satisfied by previous write
> > > > request directly. seems merge function does not consider this.
> > > 
> > > That is the job of the upper layers, typically the page cache. For this
> > > scenario to take place, you must be using raw or O_DIRECT. And in that
> > > case, it is the job of the application to ensure proper ordering of
> > > requests.
> > 
> > ic. i assumed blkio should take responsibility on this as well. so i am
> > wrong.
> > 
> > > 
> > > >   8,16   0      789     7.034883766     0 UT   R [swapper] 2
> > > >   8,16   0      790     7.034904284     9  U   R [kblockd/0] 2
> > > > 
> > > > Unplug because of a read.
> > > > 
> > > >   8,16   0      791     7.045272094     9  D   R 6768 + 32 [kblockd/0]
> > > >   8,16   0      792     7.045654039     9  C   R 6768 + 32 [0]
> > > > 
> > > > Strangely, read request was sent to device before write request and thus
> > > > return a wrong data.
> > > 
> > > Linux doesn't guarantee any request ordering for O_DIRECT io.
> > 
> > so this means it can be inserted front and back. and no fixed order?
> 
> It'll be sort inserted like any other request. That might be front, it
> might be back, or it migth be somewhere in the middle.


ic, so no special treatment here.

> 
> > > >   8,16   0      793     7.045669809     9  D   W 6768 + 32 [kblockd/0]
> > > >   8,16   0      794     7.049840970     0  C   W 6768 + 32 [0]
> > > > 
> > > > Write finished.
> > > > 
> > > > So read get a wrong data back to application. one thing not sure is
> > > > where (front/back) the request are insert into queue and who mess up the
> > > > order here.
> > > 
> > > There is no mess up, you are making assumptions that aren't valid.
> > > 
> > > > Is it possible for I event, we can know the extra flag, so we know where
> > > > it is inserted.
> > > 
> > > That would be too expensive, as we have to peak inside the io scheduler
> > > queue. So no.
> > 
> > see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we
> > generate insert event and we know where already. so export that flag is
> > not expensive.
> 
> Maybe we are not talking about the same thing - which flag do you mean?
> Do you mean the 'where' position? It'll be ELEVATOR_INSERT_FRONT for
> basically any request, unless the issuer specifically asked for BACK or
> FRONT. Those are only use in the kernel, or for non-fs request like
> SG_IO generated ones. So I don't think the flag will add very much
> information that isn't already given.

ic. spawn another question, why almost always ELEVATOR_INSERT_FRONT
here? why not a fifo queue? or later unplug will drop from the end? i
forgot the detail.

> 
> > > > ---- is the code to generate this io -----. disk is a regular disk and
> > > > current scheduler is CFQ.
> > > 
> > > Ah ok, so you are doing this inside the kernel. If you want to ensure
> > > write ordering, then you need to mark the request as a barrier.
> > > 
> > >         submit_bio(rw || (1 << BIO_RW_BARRIER), bio);
> > 
> > we tried that if we mark a write request as barrier, we lose half
> > performance. if we mark it as BIO_RW_SYNC, it is almost no change.
> > though i still need to figure out the reason of that half performance
> > loss compared with BIO_RW_SYNC
> 
> You lose a lot of performance for writes, as Linux will then also ensure
> ordering at the drive level. It does so since just ordering in the
> kernel makes little sense, if you allow the drive to reorder at will
> anyway. It is possible to control the two parameters, but not from the
> bio level. If you mark the bio BIO_RW_BARRIER, then that will get marked
> SOFT and HARD barrier in the io scheduler. A soft barrier has ordering
> ensured inside the kernel, a hard barrier has ordering in the kernel and
> at the drive side as well.
> 
> BIO_RW_SYNC doesn't imply any ordering constraints, it just tells the
> kernel to make sure that we don't stall plugging the queue.

ic. thanks for explanation. so BIO_RW_SYNC just unplug the queue while
BIO_RW_BARRIER will ensure the order. then in worst case, BIO_RW_SYNC
will lead to data inconsistency if 2 overlapped writes coming and order
is reversed.


> 
> > > I wont comment on your design, but it seems somewhat strange - why are
> > > you doing this in the kernel? What is the segment switching doing?
> > 
> > we are writing an iscsi target in kernel level.
> 
> One already exists :-)

en? which one? u mean IET or STGT? we are checking if can adding another
iomode to IET. some people have fast storage and prefer to bypass the
page cache completely.


> 
> > which segment switching u meant?
> 
> The set_fs() stuff around submit_bio() and friends.
> 

o, "segment switching" i know what u mean now. i will have a look, not
my code. ;)


-
To unsubscribe from this list: send the line "unsubscribe linux-btrace" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to explain this?

Reply via email to