Re: blkdev in pagecache
On Wed, May 09, 2001 at 05:03:06PM +0200, Reto Baettig wrote: > Jeff Garzik schrieb: > > > > Martin Dalecki wrote: > > > > - I force the virtual blocksize for all the blkdev I/O > > > > (buffered and direct) to work with a 4096 bytes granularity instead of > > > > > > You mean PAGE_SIZE :-). > > Or maybe 8192 bytes on alphas ?!? ;-) Again, see my argument with Jens, if we make it 8k we risk triggering lowlevel driver assumption about b_size being <= 4k. At least on my alpha the fs has a 4k blocksize and I think I never tested myself using a b_size of 8k yet and so I didn't wanted to put too many unknown variables into the first equation ;). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Jeff Garzik schrieb: > > Martin Dalecki wrote: > > > - I force the virtual blocksize for all the blkdev I/O > > > (buffered and direct) to work with a 4096 bytes granularity instead of > > > > You mean PAGE_SIZE :-). Or maybe 8192 bytes on alphas ?!? ;-) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Andrea Arcangeli wrote: > > On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: > > > (buffered and direct) to work with a 4096 bytes granularity instead of > > > > You mean PAGE_SIZE :-). > > In my first patch it is really 4096 bytes, but yes I agree we should > change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that > I wasn't sure all the device drivers out there can digest a bh->b_size of > 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE > supported by linux is 4k. If Jens says I can sumbit 64k b_size without > any problem for all the relevant blkdevices then I will change that in a > jiffy ;). Anyways changing that is truly easy, just define > BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as > well) and it should do the trick automatically. So for now I only cared > to make it easy to change that. > > > Exactly, please see my former explanation... BTW.> If you are gogin into > > the range of PAGE_SIZE, it may be very well possible to remove the > > whole page assoociated mechanisms of a buffer_head? > > I wouldn't be that trivial to drop it, not much different than dropping > it when a fs has a 4k blocksize. I think the dynamic allocation of the > bh is not that a bad thing, or at least it's an orthogonal problem to > moving the blkdev in pagecache ;). I think the only guys which will have a hard time on this will be ibm's AS/390 people and maybe a far fainter pille of problems will araise in lvm and raid code... As I stated already in esp the AS/390 are the ones most confused about blksize_size ver. hardsect_size bh->b_size and so on semantics. find /usr/src/linux -exec grep blksize_size /dev/null {} \; shows this fine as well as the corresponding BLOCK_SIZE redefinition in the lvm.h file! Well not much worth of caring about I think... (It will just *force* them to write cleaner code 8-). > > > Basically this is something which should come down to the strategy > > routine > > of the corresponding device and be fixed there... And then we have this > > so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE > aligned and to take care of writing zero in the pagecache beyond the end > of the device? That would be fine from my part but I'm not yet sure > that's the cleanest manner to handle that. Yes that's about it. We *can* afford to expect that the case of access behind a device should be handled as an exception and not by checks beforeahead. This should greatly simplify the main code... > > > Some notes about the code: > > > > kdev_t dev = inode->i_rdev; > > - struct buffer_head * bh, *bufferlist[NBUF]; > > - register char * p; > > + int err; > > > > - if (is_read_only(dev)) > > - return -EPERM; > > + err = -EIO; > > + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >> > > (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) > >^ > > > > blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is > > supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. > > Are you shure it's guaranteed here to be already preset? > > > > Same question goes for calc_end_index and calc_rsize. > > that's a bug indeed (a minor one at least because all the relevant > blkdevices initialize such array and if it's not initialized you notice > before you can make any damage ;), thanks for pointing it out! This kind of problem slipery in are the reasons for the last tinny encapsulation patch I sendid to Linus and Alan (for inclusion into 2.4.5) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09 2001, Andrea Arcangeli wrote: > On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote: > > better to stay with PAGE_CACHE_SIZE and not get into dark country :-) > > My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is > that many architectures have a PAGE_CACHE_SIZE > 4k, up to 64k, on > x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the > multi giga boxes. I think you agreed I'd better stay to a virtual > blocksize of 4k fixed for now. In that case, then yes leaving it as a hardcode 4k would be preferred. -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote: > better to stay with PAGE_CACHE_SIZE and not get into dark country :-) My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is that many architectures have a PAGE_CACHE_SIZE > 4k, up to 64k, on x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the multi giga boxes. I think you agreed I'd better stay to a virtual blocksize of 4k fixed for now. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09 2001, Andrea Arcangeli wrote: > On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: > > > (buffered and direct) to work with a 4096 bytes granularity instead of > > > > You mean PAGE_SIZE :-). > > In my first patch it is really 4096 bytes, but yes I agree we should > change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that > I wasn't sure all the device drivers out there can digest a bh->b_size of > 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE > supported by linux is 4k. If Jens says I can sumbit 64k b_size without > any problem for all the relevant blkdevices then I will change that in a > jiffy ;). Anyways changing that is truly easy, just define On IDE it should at least be possible, it can handle single segment entries as big as 64kB for DMA. But apart from that, I think it's a lot better to stay with PAGE_CACHE_SIZE and not get into dark country :-) -- Jens Axboe - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: > > (buffered and direct) to work with a 4096 bytes granularity instead of > > You mean PAGE_SIZE :-). In my first patch it is really 4096 bytes, but yes I agree we should change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that I wasn't sure all the device drivers out there can digest a bh->b_size of 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE supported by linux is 4k. If Jens says I can sumbit 64k b_size without any problem for all the relevant blkdevices then I will change that in a jiffy ;). Anyways changing that is truly easy, just define BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as well) and it should do the trick automatically. So for now I only cared to make it easy to change that. > Exactly, please see my former explanation... BTW.> If you are gogin into > the range of PAGE_SIZE, it may be very well possible to remove the > whole page assoociated mechanisms of a buffer_head? I wouldn't be that trivial to drop it, not much different than dropping it when a fs has a 4k blocksize. I think the dynamic allocation of the bh is not that a bad thing, or at least it's an orthogonal problem to moving the blkdev in pagecache ;). > Basically this is something which should come down to the strategy > routine > of the corresponding device and be fixed there... And then we have this so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE aligned and to take care of writing zero in the pagecache beyond the end of the device? That would be fine from my part but I'm not yet sure that's the cleanest manner to handle that. > Some notes about the code: > > kdev_t dev = inode->i_rdev; > - struct buffer_head * bh, *bufferlist[NBUF]; > - register char * p; > + int err; > > - if (is_read_only(dev)) > - return -EPERM; > + err = -EIO; > + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >> > (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) >^ > > blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is > supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. > Are you shure it's guaranteed here to be already preset? > > Same question goes for calc_end_index and calc_rsize. that's a bug indeed (a minor one at least because all the relevant blkdevices initialize such array and if it's not initialized you notice before you can make any damage ;), thanks for pointing it out! Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Martin Dalecki wrote: > > - I force the virtual blocksize for all the blkdev I/O > > (buffered and direct) to work with a 4096 bytes granularity instead of > > You mean PAGE_SIZE :-). Or maybe PAGE_CACHE_SIZE? -- Jeff Garzik | Game called on account of naked chick Building 1024| MandrakeSoft | - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Andrea Arcangeli wrote: > (btw, also the current rawio uses a 512byte bh->b_size granularity that is even > worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter > on this side as it uses the softblocksize of the fs that can be as well > 4k if you created the fs with -b 4096) Amen to this the differentation bitween blksize_size and hardsect_size in linux is: a) not quite usefull, since blksize_size isn't in reality a property of the device but more a property of the actually mounted file system. b) very confusing... see my last patch about the RaiserFS and please have a look at the AS390 code, which basically got *very* confused about the sematics of blksize_size() > I'll describe here some of the details of the blkdev-pagecache-1 patch: > > - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by HURRA! Great stuff! > opening the blkdevice with O_DIRECT, it looks much saner and I > basically get it for free by just implementing 10 lines of the > blkdev_direct_IO callback, of course I didn't removed the /dev/raw* > API for compatibility. PLEASE REMOVE IT AS SOON AS POSSIBLE! It's an really insane API just for ORACLE tuning, and well most oracle deployers don't run on /dev/raw* at least not under Linux, where it basically doesn't give you any reall performance gains... Or at least one could amke /dev/raw* a configure option and a module > - I force the virtual blocksize for all the blkdev I/O > (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). > the current 1024 softblocksize because we need that for getting higher > performance, 1024 is too low because it wastes too much ram and too > much cpu. So a DBMS won't be able anymore to write 512bytes to the Exactly, please see my former explanation... BTW.> If you are gogin into the range of PAGE_SIZE, it may be very well possible to remove the whole page associated mechanisms of a buffer_head? > disk using rawio being sure it will be a single atomic block update. > If you use /dev/raw nothing changed of course, only opening blkdev > with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O. > I don't think this is a problem, and also O_DIRECT through the fs was > just using the fs softblocksize instead of the hardblocksize as unit > of the minimal direct-IO granularity. > > - writes to the blockdevice won't end in the buffer cache, so it will > be impossible to update the superblock of an ext2 partition mounted ro > for example, it must not be mounted at all to update the superblock, I > will need to invent an hack to fix this problem or it will get too > annoying. One way could simply to change ext2 and have it checking > the buffer to be uptodate before marking it dirty again but maybe > we could also do it in a generic manner that fixes all the fs at once > (OTOH probably not that many fs needs to be fscked online...). > > - mmap should be functional but it's totally untested. > > - currently the last `harddisk_size & 4095' bytes (if any) won't be > accessible via the blkdev, to avoid sending to the hardware requests > beyond the end of the device. Not sure how/if to solve this. But this is > definitely not a new issue, the same thing happens today in 2.2 and > 2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared > a mke2fs -b 1024 could get confused. But I really don't want to > decrease the b_size of the buffer header even if we fix this. Basically this is something which should come down to the strategy routine of the corresponding device and be fixed there... And then we have this gross blk_size check in ll_rw_block.c Some notes about the code: kdev_t dev = inode->i_rdev; - struct buffer_head * bh, *bufferlist[NBUF]; - register char * p; + int err; - if (is_read_only(dev)) - return -EPERM; + err = -EIO; + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >> (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) ^ blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. Are you shure it's guaranteed here to be already preset? Same question goes for calc_end_index and calc_rsize. + goto out; - written = write_error = buffercount = 0; - blocksize = BLOCK_SIZE; - if (blksize_size[MAJOR(dev)] && blksize_size[MAJOR(dev)][MINOR(dev)]) - blocksize = blksize_size[MAJOR(dev)][MINOR(dev)]; - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Andrea Arcangeli wrote: (btw, also the current rawio uses a 512byte bh-b_size granularity that is even worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter on this side as it uses the softblocksize of the fs that can be as well 4k if you created the fs with -b 4096) Amen to this the differentation bitween blksize_size and hardsect_size in linux is: a) not quite usefull, since blksize_size isn't in reality a property of the device but more a property of the actually mounted file system. b) very confusing... see my last patch about the RaiserFS and please have a look at the AS390 code, which basically got *very* confused about the sematics of blksize_size() I'll describe here some of the details of the blkdev-pagecache-1 patch: - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by HURRA! Great stuff! opening the blkdevice with O_DIRECT, it looks much saner and I basically get it for free by just implementing 10 lines of the blkdev_direct_IO callback, of course I didn't removed the /dev/raw* API for compatibility. PLEASE REMOVE IT AS SOON AS POSSIBLE! It's an really insane API just for ORACLE tuning, and well most oracle deployers don't run on /dev/raw* at least not under Linux, where it basically doesn't give you any reall performance gains... Or at least one could amke /dev/raw* a configure option and a module - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). the current 1024 softblocksize because we need that for getting higher performance, 1024 is too low because it wastes too much ram and too much cpu. So a DBMS won't be able anymore to write 512bytes to the Exactly, please see my former explanation... BTW. If you are gogin into the range of PAGE_SIZE, it may be very well possible to remove the whole page associated mechanisms of a buffer_head? disk using rawio being sure it will be a single atomic block update. If you use /dev/raw nothing changed of course, only opening blkdev with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O. I don't think this is a problem, and also O_DIRECT through the fs was just using the fs softblocksize instead of the hardblocksize as unit of the minimal direct-IO granularity. - writes to the blockdevice won't end in the buffer cache, so it will be impossible to update the superblock of an ext2 partition mounted ro for example, it must not be mounted at all to update the superblock, I will need to invent an hack to fix this problem or it will get too annoying. One way could simply to change ext2 and have it checking the buffer to be uptodate before marking it dirty again but maybe we could also do it in a generic manner that fixes all the fs at once (OTOH probably not that many fs needs to be fscked online...). - mmap should be functional but it's totally untested. - currently the last `harddisk_size 4095' bytes (if any) won't be accessible via the blkdev, to avoid sending to the hardware requests beyond the end of the device. Not sure how/if to solve this. But this is definitely not a new issue, the same thing happens today in 2.2 and 2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared a mke2fs -b 1024 could get confused. But I really don't want to decrease the b_size of the buffer header even if we fix this. Basically this is something which should come down to the strategy routine of the corresponding device and be fixed there... And then we have this gross blk_size check in ll_rw_block.c Some notes about the code: kdev_t dev = inode-i_rdev; - struct buffer_head * bh, *bufferlist[NBUF]; - register char * p; + int err; - if (is_read_only(dev)) - return -EPERM; + err = -EIO; + if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) ^ blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. Are you shure it's guaranteed here to be already preset? Same question goes for calc_end_index and calc_rsize. + goto out; - written = write_error = buffercount = 0; - blocksize = BLOCK_SIZE; - if (blksize_size[MAJOR(dev)] blksize_size[MAJOR(dev)][MINOR(dev)]) - blocksize = blksize_size[MAJOR(dev)][MINOR(dev)]; - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Martin Dalecki wrote: - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). Or maybe PAGE_CACHE_SIZE? -- Jeff Garzik | Game called on account of naked chick Building 1024| MandrakeSoft | - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). In my first patch it is really 4096 bytes, but yes I agree we should change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that I wasn't sure all the device drivers out there can digest a bh-b_size of 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE supported by linux is 4k. If Jens says I can sumbit 64k b_size without any problem for all the relevant blkdevices then I will change that in a jiffy ;). Anyways changing that is truly easy, just define BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as well) and it should do the trick automatically. So for now I only cared to make it easy to change that. Exactly, please see my former explanation... BTW. If you are gogin into the range of PAGE_SIZE, it may be very well possible to remove the whole page assoociated mechanisms of a buffer_head? I wouldn't be that trivial to drop it, not much different than dropping it when a fs has a 4k blocksize. I think the dynamic allocation of the bh is not that a bad thing, or at least it's an orthogonal problem to moving the blkdev in pagecache ;). Basically this is something which should come down to the strategy routine of the corresponding device and be fixed there... And then we have this so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE aligned and to take care of writing zero in the pagecache beyond the end of the device? That would be fine from my part but I'm not yet sure that's the cleanest manner to handle that. Some notes about the code: kdev_t dev = inode-i_rdev; - struct buffer_head * bh, *bufferlist[NBUF]; - register char * p; + int err; - if (is_read_only(dev)) - return -EPERM; + err = -EIO; + if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) ^ blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. Are you shure it's guaranteed here to be already preset? Same question goes for calc_end_index and calc_rsize. that's a bug indeed (a minor one at least because all the relevant blkdevices initialize such array and if it's not initialized you notice before you can make any damage ;), thanks for pointing it out! Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09 2001, Andrea Arcangeli wrote: On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). In my first patch it is really 4096 bytes, but yes I agree we should change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that I wasn't sure all the device drivers out there can digest a bh-b_size of 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE supported by linux is 4k. If Jens says I can sumbit 64k b_size without any problem for all the relevant blkdevices then I will change that in a jiffy ;). Anyways changing that is truly easy, just define On IDE it should at least be possible, it can handle single segment entries as big as 64kB for DMA. But apart from that, I think it's a lot better to stay with PAGE_CACHE_SIZE and not get into dark country :-) -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote: better to stay with PAGE_CACHE_SIZE and not get into dark country :-) My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is that many architectures have a PAGE_CACHE_SIZE 4k, up to 64k, on x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the multi giga boxes. I think you agreed I'd better stay to a virtual blocksize of 4k fixed for now. Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09 2001, Andrea Arcangeli wrote: On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote: better to stay with PAGE_CACHE_SIZE and not get into dark country :-) My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is that many architectures have a PAGE_CACHE_SIZE 4k, up to 64k, on x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the multi giga boxes. I think you agreed I'd better stay to a virtual blocksize of 4k fixed for now. In that case, then yes leaving it as a hardcode 4k would be preferred. -- Jens Axboe - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Andrea Arcangeli wrote: On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote: (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). In my first patch it is really 4096 bytes, but yes I agree we should change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that I wasn't sure all the device drivers out there can digest a bh-b_size of 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE supported by linux is 4k. If Jens says I can sumbit 64k b_size without any problem for all the relevant blkdevices then I will change that in a jiffy ;). Anyways changing that is truly easy, just define BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as well) and it should do the trick automatically. So for now I only cared to make it easy to change that. Exactly, please see my former explanation... BTW. If you are gogin into the range of PAGE_SIZE, it may be very well possible to remove the whole page assoociated mechanisms of a buffer_head? I wouldn't be that trivial to drop it, not much different than dropping it when a fs has a 4k blocksize. I think the dynamic allocation of the bh is not that a bad thing, or at least it's an orthogonal problem to moving the blkdev in pagecache ;). I think the only guys which will have a hard time on this will be ibm's AS/390 people and maybe a far fainter pille of problems will araise in lvm and raid code... As I stated already in esp the AS/390 are the ones most confused about blksize_size ver. hardsect_size bh-b_size and so on semantics. find /usr/src/linux -exec grep blksize_size /dev/null {} \; shows this fine as well as the corresponding BLOCK_SIZE redefinition in the lvm.h file! Well not much worth of caring about I think... (It will just *force* them to write cleaner code 8-). Basically this is something which should come down to the strategy routine of the corresponding device and be fixed there... And then we have this so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE aligned and to take care of writing zero in the pagecache beyond the end of the device? That would be fine from my part but I'm not yet sure that's the cleanest manner to handle that. Yes that's about it. We *can* afford to expect that the case of access behind a device should be handled as an exception and not by checks beforeahead. This should greatly simplify the main code... Some notes about the code: kdev_t dev = inode-i_rdev; - struct buffer_head * bh, *bufferlist[NBUF]; - register char * p; + int err; - if (is_read_only(dev)) - return -EPERM; + err = -EIO; + if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS))) ^ blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX. Are you shure it's guaranteed here to be already preset? Same question goes for calc_end_index and calc_rsize. that's a bug indeed (a minor one at least because all the relevant blkdevices initialize such array and if it's not initialized you notice before you can make any damage ;), thanks for pointing it out! This kind of problem slipery in are the reasons for the last tinny encapsulation patch I sendid to Linus and Alan (for inclusion into 2.4.5) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
Jeff Garzik schrieb: Martin Dalecki wrote: - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). Or maybe 8192 bytes on alphas ?!? ;-) - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: blkdev in pagecache
On Wed, May 09, 2001 at 05:03:06PM +0200, Reto Baettig wrote: Jeff Garzik schrieb: Martin Dalecki wrote: - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of You mean PAGE_SIZE :-). Or maybe 8192 bytes on alphas ?!? ;-) Again, see my argument with Jens, if we make it 8k we risk triggering lowlevel driver assumption about b_size being = 4k. At least on my alpha the fs has a 4k blocksize and I think I never tested myself using a b_size of 8k yet and so I didn't wanted to put too many unknown variables into the first equation ;). Andrea - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
blkdev in pagecache
This night I moved the blkdev layer in pagecache in this patch: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1 It is incremental and depends on the o_direct functionality, latest o_direct patch against 2.4.5pre1 is here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5 The main reasons I moved the blkdev in pagecaches is that the current blkdev provides horrible performance with fast I/O subsystem capable of over 50mbyte/sec that I just increased x2 with a simple hack that you can see here if you're curious: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1 (btw, also the current rawio uses a 512byte bh->b_size granularity that is even worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter on this side as it uses the softblocksize of the fs that can be as well 4k if you created the fs with -b 4096) However after running this 4k_block_dev-1 hack on some more machine I noticed the blkdev layer wasn't able anymore to update the superblock of 1k ext2 filesystems and to make it "usable" in real life I needed to fix it. But I didn't wanted ot invest any further time on such an hack and I preferred to move the blkdev in pagecache and to fix the problem on top of the new better design (moving blkdev in pagecache of course introduces that same problem too as I also mentioned in one of the below points). I'll describe here some of the details of the blkdev-pagecache-1 patch: - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by opening the blkdevice with O_DIRECT, it looks much saner and I basically get it for free by just implementing 10 lines of the blkdev_direct_IO callback, of course I didn't removed the /dev/raw* API for compatibility. While testing O_DIRECT I destroyed the first 50mbyte of the root partition so I will need to wait the test box to return alive before I can make further testing ;). But I just fixed the bug that caused the corruption before uploading the patch so I don't expect further problems (it was only a s/i_dev/i_rdev thing) because the regression testing was working well even if it was writing in the wrong disk ;). - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of the current 1024 softblocksize because we need that for getting higher performance, 1024 is too low because it wastes too much ram and too much cpu. So a DBMS won't be able anymore to write 512bytes to the disk using rawio being sure it will be a single atomic block update. If you use /dev/raw nothing changed of course, only opening blkdev with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O. I don't think this is a problem, and also O_DIRECT through the fs was just using the fs softblocksize instead of the hardblocksize as unit of the minimal direct-IO granularity. - writes to the blockdevice won't end in the buffer cache, so it will be impossible to update the superblock of an ext2 partition mounted ro for example, it must not be mounted at all to update the superblock, I will need to invent an hack to fix this problem or it will get too annoying. One way could simply to change ext2 and have it checking the buffer to be uptodate before marking it dirty again but maybe we could also do it in a generic manner that fixes all the fs at once (OTOH probably not that many fs needs to be fscked online...). - mmap should be functional but it's totally untested. - currently the last `harddisk_size & 4095' bytes (if any) won't be accessible via the blkdev, to avoid sending to the hardware requests beyond the end of the device. Not sure how/if to solve this. But this is definitely not a new issue, the same thing happens today in 2.2 and 2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared a mke2fs -b 1024 could get confused. But I really don't want to decrease the b_size of the buffer header even if we fix this. - to share all the filemap.c code and not to change too much stuff in the first patch I added some ISBLK check in fast paths, basically only to check against blk_size instead of inode->i_size, I also considered changing the i_size semantics for the blkdev inodes but I didn't wanted to break all the fs yet so I took the localized slower way for now (I doubt it is noticeable in the benchmarks but nevertheless it would be nice to optimize away those branches). - once the blkdev is closed in the block_close callback I filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2 (fdatawait seems not necessary but it won't hurt). I'm not calling truncate_inode_pages because those pages could be still mapped (->release is called when f_count goes down to zero, not when i_count reaches z
blkdev in pagecache
This night I moved the blkdev layer in pagecache in this patch: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1 It is incremental and depends on the o_direct functionality, latest o_direct patch against 2.4.5pre1 is here: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5 The main reasons I moved the blkdev in pagecaches is that the current blkdev provides horrible performance with fast I/O subsystem capable of over 50mbyte/sec that I just increased x2 with a simple hack that you can see here if you're curious: ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1 (btw, also the current rawio uses a 512byte bh-b_size granularity that is even worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter on this side as it uses the softblocksize of the fs that can be as well 4k if you created the fs with -b 4096) However after running this 4k_block_dev-1 hack on some more machine I noticed the blkdev layer wasn't able anymore to update the superblock of 1k ext2 filesystems and to make it usable in real life I needed to fix it. But I didn't wanted ot invest any further time on such an hack and I preferred to move the blkdev in pagecache and to fix the problem on top of the new better design (moving blkdev in pagecache of course introduces that same problem too as I also mentioned in one of the below points). I'll describe here some of the details of the blkdev-pagecache-1 patch: - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by opening the blkdevice with O_DIRECT, it looks much saner and I basically get it for free by just implementing 10 lines of the blkdev_direct_IO callback, of course I didn't removed the /dev/raw* API for compatibility. While testing O_DIRECT I destroyed the first 50mbyte of the root partition so I will need to wait the test box to return alive before I can make further testing ;). But I just fixed the bug that caused the corruption before uploading the patch so I don't expect further problems (it was only a s/i_dev/i_rdev thing) because the regression testing was working well even if it was writing in the wrong disk ;). - I force the virtual blocksize for all the blkdev I/O (buffered and direct) to work with a 4096 bytes granularity instead of the current 1024 softblocksize because we need that for getting higher performance, 1024 is too low because it wastes too much ram and too much cpu. So a DBMS won't be able anymore to write 512bytes to the disk using rawio being sure it will be a single atomic block update. If you use /dev/raw nothing changed of course, only opening blkdev with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O. I don't think this is a problem, and also O_DIRECT through the fs was just using the fs softblocksize instead of the hardblocksize as unit of the minimal direct-IO granularity. - writes to the blockdevice won't end in the buffer cache, so it will be impossible to update the superblock of an ext2 partition mounted ro for example, it must not be mounted at all to update the superblock, I will need to invent an hack to fix this problem or it will get too annoying. One way could simply to change ext2 and have it checking the buffer to be uptodate before marking it dirty again but maybe we could also do it in a generic manner that fixes all the fs at once (OTOH probably not that many fs needs to be fscked online...). - mmap should be functional but it's totally untested. - currently the last `harddisk_size 4095' bytes (if any) won't be accessible via the blkdev, to avoid sending to the hardware requests beyond the end of the device. Not sure how/if to solve this. But this is definitely not a new issue, the same thing happens today in 2.2 and 2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared a mke2fs -b 1024 could get confused. But I really don't want to decrease the b_size of the buffer header even if we fix this. - to share all the filemap.c code and not to change too much stuff in the first patch I added some ISBLK check in fast paths, basically only to check against blk_size instead of inode-i_size, I also considered changing the i_size semantics for the blkdev inodes but I didn't wanted to break all the fs yet so I took the localized slower way for now (I doubt it is noticeable in the benchmarks but nevertheless it would be nice to optimize away those branches). - once the blkdev is closed in the block_close callback I filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2 (fdatawait seems not necessary but it won't hurt). I'm not calling truncate_inode_pages because those pages could be still mapped (-release is called when f_count goes down to zero, not when i_count reaches zero). I'd like to defer