Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 05:03:06PM +0200, Reto Baettig wrote:
> Jeff Garzik schrieb:
> > 
> > Martin Dalecki wrote:
> > > > - I force the virtual blocksize for all the blkdev I/O
> > > >   (buffered and direct) to work with a 4096 bytes granularity instead of
> > >
> > > You mean PAGE_SIZE :-).
> 
> Or maybe 8192 bytes on alphas ?!? ;-)

Again, see my argument with Jens, if we make it 8k we risk triggering
lowlevel driver assumption about b_size being <= 4k. At least on my
alpha the fs has a 4k blocksize and I think I never tested myself using
a b_size of 8k yet and so I didn't wanted to put too many unknown
variables into the first equation ;).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Reto Baettig

Jeff Garzik schrieb:
> 
> Martin Dalecki wrote:
> > > - I force the virtual blocksize for all the blkdev I/O
> > >   (buffered and direct) to work with a 4096 bytes granularity instead of
> >
> > You mean PAGE_SIZE :-).

Or maybe 8192 bytes on alphas ?!? ;-)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Martin Dalecki

Andrea Arcangeli wrote:
> 
> On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
> > >   (buffered and direct) to work with a 4096 bytes granularity instead of
> >
> > You mean PAGE_SIZE :-).
> 
> In my first patch it is really 4096 bytes, but yes I agree we should
> change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
> I wasn't sure all the device drivers out there can digest a bh->b_size of
> 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
> supported by linux is 4k. If Jens says I can sumbit 64k b_size without
> any problem for all the relevant blkdevices then I will change that in a
> jiffy ;). Anyways changing that is truly easy, just define
> BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as
> well) and it should do the trick automatically. So for now I only cared
> to make it easy to change that.
> 
> > Exactly, please see my former explanation... BTW.> If you are gogin into
> > the range of PAGE_SIZE, it may be very well possible to remove the
> > whole page assoociated mechanisms of a buffer_head?
> 
> I wouldn't be that trivial to drop it, not much different than dropping
> it when a fs has a 4k blocksize. I think the dynamic allocation of the
> bh is not that a bad thing, or at least it's an orthogonal problem to
> moving the blkdev in pagecache ;).

I think the only guys which will have a hard time on this will be ibm's 
AS/390 people and maybe a far fainter pille of problems will araise in
lvm and raid
code... As I stated already in esp the AS/390 are the ones most confused
about
blksize_size ver. hardsect_size bh->b_size and so on semantics.
find /usr/src/linux -exec grep blksize_size /dev/null {} \;
shows this fine as well as the corresponding BLOCK_SIZE redefinition in
the
lvm.h file! Well not much worth of caring about I think... (It will just
*force*
them to write cleaner code 8-).

> 
> > Basically this is something which should come down to the strategy
> > routine
> > of the corresponding device and be fixed there... And then we have this
> 
> so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE
> aligned and to take care of writing zero in the pagecache beyond the end
> of the device? That would be fine from my part but I'm not yet sure
> that's the cleanest manner to handle that.

Yes that's about it. We *can* afford to expect that the case of access
behind
a device should be handled as an exception and not by checks
beforeahead.
This should greatly simplify the main code...

> 
> > Some notes about the code:
> >
> >   kdev_t dev = inode->i_rdev;
> > - struct buffer_head * bh, *bufferlist[NBUF];
> > - register char * p;
> > + int err;
> >
> > - if (is_read_only(dev))
> > - return -EPERM;
> > + err = -EIO;
> > + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >>
> > (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
> >^
> >
> > blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
> > supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
> > Are you shure it's guaranteed here to be already preset?
> >
> > Same question goes for calc_end_index and calc_rsize.
> 
> that's a bug indeed (a minor one at least because all the relevant
> blkdevices initialize such array and if it's not initialized you notice
> before you can make any damage ;), thanks for pointing it out!

This kind of problem slipery in are the reasons for the last tinny
encapsulation patch I sendid
to Linus and Alan (for inclusion into 2.4.5)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jens Axboe

On Wed, May 09 2001, Andrea Arcangeli wrote:
> On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote:
> > better to stay with PAGE_CACHE_SIZE and not get into dark country :-)
> 
> My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is
> that many architectures have a PAGE_CACHE_SIZE > 4k, up to 64k, on
> x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the
> multi giga boxes. I think you agreed I'd better stay to a virtual
> blocksize of 4k fixed for now.

In that case, then yes leaving it as a hardcode 4k would be preferred.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote:
> better to stay with PAGE_CACHE_SIZE and not get into dark country :-)

My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is
that many architectures have a PAGE_CACHE_SIZE > 4k, up to 64k, on
x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the
multi giga boxes. I think you agreed I'd better stay to a virtual
blocksize of 4k fixed for now.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jens Axboe

On Wed, May 09 2001, Andrea Arcangeli wrote:
> On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
> > >   (buffered and direct) to work with a 4096 bytes granularity instead of
> > 
> > You mean PAGE_SIZE :-).
> 
> In my first patch it is really 4096 bytes, but yes I agree we should
> change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
> I wasn't sure all the device drivers out there can digest a bh->b_size of
> 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
> supported by linux is 4k. If Jens says I can sumbit 64k b_size without
> any problem for all the relevant blkdevices then I will change that in a
> jiffy ;). Anyways changing that is truly easy, just define

On IDE it should at least be possible, it can handle single segment
entries as big as 64kB for DMA. But apart from that, I think it's a lot
better to stay with PAGE_CACHE_SIZE and not get into dark country :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
> >   (buffered and direct) to work with a 4096 bytes granularity instead of
> 
> You mean PAGE_SIZE :-).

In my first patch it is really 4096 bytes, but yes I agree we should
change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
I wasn't sure all the device drivers out there can digest a bh->b_size of
8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
supported by linux is 4k. If Jens says I can sumbit 64k b_size without
any problem for all the relevant blkdevices then I will change that in a
jiffy ;). Anyways changing that is truly easy, just define
BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as
well) and it should do the trick automatically. So for now I only cared
to make it easy to change that.

> Exactly, please see my former explanation... BTW.> If you are gogin into
> the range of PAGE_SIZE, it may be very well possible to remove the
> whole page assoociated mechanisms of a buffer_head?

I wouldn't be that trivial to drop it, not much different than dropping
it when a fs has a 4k blocksize. I think the dynamic allocation of the
bh is not that a bad thing, or at least it's an orthogonal problem to
moving the blkdev in pagecache ;).

> Basically this is something which should come down to the strategy
> routine
> of the corresponding device and be fixed there... And then we have this

so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE
aligned and to take care of writing zero in the pagecache beyond the end
of the device? That would be fine from my part but I'm not yet sure
that's the cleanest manner to handle that.

> Some notes about the code:
> 
>   kdev_t dev = inode->i_rdev;
> - struct buffer_head * bh, *bufferlist[NBUF];
> - register char * p;
> + int err;
>  
> - if (is_read_only(dev))
> - return -EPERM;
> + err = -EIO;
> + if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >>
> (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
>^
> 
> blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
> supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
> Are you shure it's guaranteed here to be already preset?
> 
> Same question goes for calc_end_index and calc_rsize.

that's a bug indeed (a minor one at least because all the relevant
blkdevices initialize such array and if it's not initialized you notice
before you can make any damage ;), thanks for pointing it out!

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jeff Garzik

Martin Dalecki wrote:
> > - I force the virtual blocksize for all the blkdev I/O
> >   (buffered and direct) to work with a 4096 bytes granularity instead of
> 
> You mean PAGE_SIZE :-).

Or maybe PAGE_CACHE_SIZE?

-- 
Jeff Garzik  | Game called on account of naked chick
Building 1024|
MandrakeSoft |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Martin Dalecki

Andrea Arcangeli wrote:

> (btw, also the current rawio uses a 512byte bh->b_size granularity that is even
> worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
> on this side as it uses the softblocksize of the fs that can be as well
> 4k if you created the fs with -b 4096)

Amen to this the differentation bitween blksize_size and
hardsect_size in
linux is:
a) not quite usefull, since blksize_size isn't in reality a property of
the
device but more a property of the actually mounted file system.
b) very confusing... see my last patch about the RaiserFS and please
have a look at the AS390 code, which basically got *very* confused about
the sematics of blksize_size()

> I'll describe here some of the details of the blkdev-pagecache-1 patch:
> 
> - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by

HURRA! Great stuff!

>   opening the blkdevice with O_DIRECT, it looks much saner and I
>   basically get it for free by just implementing 10 lines of the
>   blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
>   API for compatibility.

PLEASE REMOVE IT AS SOON AS POSSIBLE! It's an really insane API just for
ORACLE tuning, and well most oracle deployers don't run on /dev/raw* at
least not under Linux, where it basically doesn't give you any reall
performance gains... Or at least one could amke /dev/raw* a configure
option and
a module
 
> - I force the virtual blocksize for all the blkdev I/O
>   (buffered and direct) to work with a 4096 bytes granularity instead of

You mean PAGE_SIZE :-).

>   the current 1024 softblocksize because we need that for getting higher
>   performance, 1024 is too low because it wastes too much ram and too
>   much cpu. So a DBMS won't be able anymore to write 512bytes to the

Exactly, please see my former explanation... BTW.> If you are gogin into
the range of PAGE_SIZE, it may be very well possible to remove the
whole page associated mechanisms of a buffer_head?

>   disk using rawio being sure it will be a single atomic block update.
>   If you use /dev/raw nothing changed of course, only opening blkdev
>   with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
>   I don't think this is a problem, and also O_DIRECT through the fs was
>   just using the fs softblocksize instead of the hardblocksize as unit
>   of the minimal direct-IO granularity.
> 
> - writes to the blockdevice won't end in the buffer cache, so it will
>   be impossible to update the superblock of an ext2 partition mounted ro
>   for example, it must not be mounted at all to update the superblock, I
>   will need to invent an hack to fix this problem or it will get too
>   annoying. One way could simply to change ext2 and have it checking
>   the buffer to be uptodate before marking it dirty again but maybe
>   we could also do it in a generic manner that fixes all the fs at once
>   (OTOH probably not that many fs needs to be fscked online...).
> 
> - mmap should be functional but it's totally untested.
> 
> - currently the last `harddisk_size & 4095' bytes (if any) won't be
>   accessible via the blkdev, to avoid sending to the hardware requests
>   beyond the end of the device. Not sure how/if to solve this. But this is
>   definitely not a new issue, the same thing happens today in 2.2 and
>   2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
>   a mke2fs -b 1024 could get confused. But I really don't want to
>   decrease the b_size of the buffer header even if we fix this.

Basically this is something which should come down to the strategy
routine
of the corresponding device and be fixed there... And then we have this
gross 
blk_size check in ll_rw_block.c 

Some notes about the code:

kdev_t dev = inode->i_rdev;
-   struct buffer_head * bh, *bufferlist[NBUF];
-   register char * p;
+   int err;
 
-   if (is_read_only(dev))
-   return -EPERM;
+   err = -EIO;
+   if (iblock >= (blk_size[MAJOR(dev)][MINOR(dev)] >>
(BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
 ^

blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
Are you shure it's guaranteed here to be already preset?

Same question goes for calc_end_index and calc_rsize.


+   goto out;
 
-   written = write_error = buffercount = 0;
-   blocksize = BLOCK_SIZE;
-   if (blksize_size[MAJOR(dev)] && blksize_size[MAJOR(dev)][MINOR(dev)])
-   blocksize = blksize_size[MAJOR(dev)][MINOR(dev)];
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Martin Dalecki

Andrea Arcangeli wrote:

 (btw, also the current rawio uses a 512byte bh-b_size granularity that is even
 worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
 on this side as it uses the softblocksize of the fs that can be as well
 4k if you created the fs with -b 4096)

Amen to this the differentation bitween blksize_size and
hardsect_size in
linux is:
a) not quite usefull, since blksize_size isn't in reality a property of
the
device but more a property of the actually mounted file system.
b) very confusing... see my last patch about the RaiserFS and please
have a look at the AS390 code, which basically got *very* confused about
the sematics of blksize_size()

 I'll describe here some of the details of the blkdev-pagecache-1 patch:
 
 - /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by

HURRA! Great stuff!

   opening the blkdevice with O_DIRECT, it looks much saner and I
   basically get it for free by just implementing 10 lines of the
   blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
   API for compatibility.

PLEASE REMOVE IT AS SOON AS POSSIBLE! It's an really insane API just for
ORACLE tuning, and well most oracle deployers don't run on /dev/raw* at
least not under Linux, where it basically doesn't give you any reall
performance gains... Or at least one could amke /dev/raw* a configure
option and
a module
 
 - I force the virtual blocksize for all the blkdev I/O
   (buffered and direct) to work with a 4096 bytes granularity instead of

You mean PAGE_SIZE :-).

   the current 1024 softblocksize because we need that for getting higher
   performance, 1024 is too low because it wastes too much ram and too
   much cpu. So a DBMS won't be able anymore to write 512bytes to the

Exactly, please see my former explanation... BTW. If you are gogin into
the range of PAGE_SIZE, it may be very well possible to remove the
whole page associated mechanisms of a buffer_head?

   disk using rawio being sure it will be a single atomic block update.
   If you use /dev/raw nothing changed of course, only opening blkdev
   with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
   I don't think this is a problem, and also O_DIRECT through the fs was
   just using the fs softblocksize instead of the hardblocksize as unit
   of the minimal direct-IO granularity.
 
 - writes to the blockdevice won't end in the buffer cache, so it will
   be impossible to update the superblock of an ext2 partition mounted ro
   for example, it must not be mounted at all to update the superblock, I
   will need to invent an hack to fix this problem or it will get too
   annoying. One way could simply to change ext2 and have it checking
   the buffer to be uptodate before marking it dirty again but maybe
   we could also do it in a generic manner that fixes all the fs at once
   (OTOH probably not that many fs needs to be fscked online...).
 
 - mmap should be functional but it's totally untested.
 
 - currently the last `harddisk_size  4095' bytes (if any) won't be
   accessible via the blkdev, to avoid sending to the hardware requests
   beyond the end of the device. Not sure how/if to solve this. But this is
   definitely not a new issue, the same thing happens today in 2.2 and
   2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
   a mke2fs -b 1024 could get confused. But I really don't want to
   decrease the b_size of the buffer header even if we fix this.

Basically this is something which should come down to the strategy
routine
of the corresponding device and be fixed there... And then we have this
gross 
blk_size check in ll_rw_block.c 

Some notes about the code:

kdev_t dev = inode-i_rdev;
-   struct buffer_head * bh, *bufferlist[NBUF];
-   register char * p;
+   int err;
 
-   if (is_read_only(dev))
-   return -EPERM;
+   err = -EIO;
+   if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] 
(BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
 ^

blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
Are you shure it's guaranteed here to be already preset?

Same question goes for calc_end_index and calc_rsize.


+   goto out;
 
-   written = write_error = buffercount = 0;
-   blocksize = BLOCK_SIZE;
-   if (blksize_size[MAJOR(dev)]  blksize_size[MAJOR(dev)][MINOR(dev)])
-   blocksize = blksize_size[MAJOR(dev)][MINOR(dev)];
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jeff Garzik

Martin Dalecki wrote:
  - I force the virtual blocksize for all the blkdev I/O
(buffered and direct) to work with a 4096 bytes granularity instead of
 
 You mean PAGE_SIZE :-).

Or maybe PAGE_CACHE_SIZE?

-- 
Jeff Garzik  | Game called on account of naked chick
Building 1024|
MandrakeSoft |
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
(buffered and direct) to work with a 4096 bytes granularity instead of
 
 You mean PAGE_SIZE :-).

In my first patch it is really 4096 bytes, but yes I agree we should
change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
I wasn't sure all the device drivers out there can digest a bh-b_size of
8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
supported by linux is 4k. If Jens says I can sumbit 64k b_size without
any problem for all the relevant blkdevices then I will change that in a
jiffy ;). Anyways changing that is truly easy, just define
BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as
well) and it should do the trick automatically. So for now I only cared
to make it easy to change that.

 Exactly, please see my former explanation... BTW. If you are gogin into
 the range of PAGE_SIZE, it may be very well possible to remove the
 whole page assoociated mechanisms of a buffer_head?

I wouldn't be that trivial to drop it, not much different than dropping
it when a fs has a 4k blocksize. I think the dynamic allocation of the
bh is not that a bad thing, or at least it's an orthogonal problem to
moving the blkdev in pagecache ;).

 Basically this is something which should come down to the strategy
 routine
 of the corresponding device and be fixed there... And then we have this

so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE
aligned and to take care of writing zero in the pagecache beyond the end
of the device? That would be fine from my part but I'm not yet sure
that's the cleanest manner to handle that.

 Some notes about the code:
 
   kdev_t dev = inode-i_rdev;
 - struct buffer_head * bh, *bufferlist[NBUF];
 - register char * p;
 + int err;
  
 - if (is_read_only(dev))
 - return -EPERM;
 + err = -EIO;
 + if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] 
 (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
^
 
 blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
 supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
 Are you shure it's guaranteed here to be already preset?
 
 Same question goes for calc_end_index and calc_rsize.

that's a bug indeed (a minor one at least because all the relevant
blkdevices initialize such array and if it's not initialized you notice
before you can make any damage ;), thanks for pointing it out!

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jens Axboe

On Wed, May 09 2001, Andrea Arcangeli wrote:
 On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
 (buffered and direct) to work with a 4096 bytes granularity instead of
  
  You mean PAGE_SIZE :-).
 
 In my first patch it is really 4096 bytes, but yes I agree we should
 change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
 I wasn't sure all the device drivers out there can digest a bh-b_size of
 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
 supported by linux is 4k. If Jens says I can sumbit 64k b_size without
 any problem for all the relevant blkdevices then I will change that in a
 jiffy ;). Anyways changing that is truly easy, just define

On IDE it should at least be possible, it can handle single segment
entries as big as 64kB for DMA. But apart from that, I think it's a lot
better to stay with PAGE_CACHE_SIZE and not get into dark country :-)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote:
 better to stay with PAGE_CACHE_SIZE and not get into dark country :-)

My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is
that many architectures have a PAGE_CACHE_SIZE  4k, up to 64k, on
x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the
multi giga boxes. I think you agreed I'd better stay to a virtual
blocksize of 4k fixed for now.

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Jens Axboe

On Wed, May 09 2001, Andrea Arcangeli wrote:
 On Wed, May 09, 2001 at 04:14:52PM +0200, Jens Axboe wrote:
  better to stay with PAGE_CACHE_SIZE and not get into dark country :-)
 
 My whole point for not using PAGE_CACHE_SIZE as virtual blocksize is
 that many architectures have a PAGE_CACHE_SIZE  4k, up to 64k, on
 x86-64 we may even hack a 2M PAGE_SIZE/PAGE_CACHE_SIZE mode for the
 multi giga boxes. I think you agreed I'd better stay to a virtual
 blocksize of 4k fixed for now.

In that case, then yes leaving it as a hardcode 4k would be preferred.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Martin Dalecki

Andrea Arcangeli wrote:
 
 On Wed, May 09, 2001 at 11:13:33AM +0200, Martin Dalecki wrote:
 (buffered and direct) to work with a 4096 bytes granularity instead of
 
  You mean PAGE_SIZE :-).
 
 In my first patch it is really 4096 bytes, but yes I agree we should
 change that to PAGE_CACHE_SIZE. The _only_ reason it's 4096 fixed bytes is that
 I wasn't sure all the device drivers out there can digest a bh-b_size of
 8k/32k/64k (for the non x86 archs) and I checked the minimal PAGE_SIZE
 supported by linux is 4k. If Jens says I can sumbit 64k b_size without
 any problem for all the relevant blkdevices then I will change that in a
 jiffy ;). Anyways changing that is truly easy, just define
 BUFFERED_BLOCKSIZE to PAGE_CACHE_SIZE instad of 4096 (plus the .._BITS as
 well) and it should do the trick automatically. So for now I only cared
 to make it easy to change that.
 
  Exactly, please see my former explanation... BTW. If you are gogin into
  the range of PAGE_SIZE, it may be very well possible to remove the
  whole page assoociated mechanisms of a buffer_head?
 
 I wouldn't be that trivial to drop it, not much different than dropping
 it when a fs has a 4k blocksize. I think the dynamic allocation of the
 bh is not that a bad thing, or at least it's an orthogonal problem to
 moving the blkdev in pagecache ;).

I think the only guys which will have a hard time on this will be ibm's 
AS/390 people and maybe a far fainter pille of problems will araise in
lvm and raid
code... As I stated already in esp the AS/390 are the ones most confused
about
blksize_size ver. hardsect_size bh-b_size and so on semantics.
find /usr/src/linux -exec grep blksize_size /dev/null {} \;
shows this fine as well as the corresponding BLOCK_SIZE redefinition in
the
lvm.h file! Well not much worth of caring about I think... (It will just
*force*
them to write cleaner code 8-).

 
  Basically this is something which should come down to the strategy
  routine
  of the corresponding device and be fixed there... And then we have this
 
 so you mean the device driver should make sure blk_size is PAGE_CACHE_SIZE
 aligned and to take care of writing zero in the pagecache beyond the end
 of the device? That would be fine from my part but I'm not yet sure
 that's the cleanest manner to handle that.

Yes that's about it. We *can* afford to expect that the case of access
behind
a device should be handled as an exception and not by checks
beforeahead.
This should greatly simplify the main code...

 
  Some notes about the code:
 
kdev_t dev = inode-i_rdev;
  - struct buffer_head * bh, *bufferlist[NBUF];
  - register char * p;
  + int err;
 
  - if (is_read_only(dev))
  - return -EPERM;
  + err = -EIO;
  + if (iblock = (blk_size[MAJOR(dev)][MINOR(dev)] 
  (BUFFERED_BLOCKSIZE_BITS - BLOCK_SIZE_BITS)))
 ^
 
  blk_size[MAJOR(dev)] can very well be equal NULL! In this case one is
  supposed to assume blk_size[MAJOR(dev)][MINOR(dev)] to be INT_MAX.
  Are you shure it's guaranteed here to be already preset?
 
  Same question goes for calc_end_index and calc_rsize.
 
 that's a bug indeed (a minor one at least because all the relevant
 blkdevices initialize such array and if it's not initialized you notice
 before you can make any damage ;), thanks for pointing it out!

This kind of problem slipery in are the reasons for the last tinny
encapsulation patch I sendid
to Linus and Alan (for inclusion into 2.4.5)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Reto Baettig

Jeff Garzik schrieb:
 
 Martin Dalecki wrote:
   - I force the virtual blocksize for all the blkdev I/O
 (buffered and direct) to work with a 4096 bytes granularity instead of
 
  You mean PAGE_SIZE :-).

Or maybe 8192 bytes on alphas ?!? ;-)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev in pagecache

2001-05-09 Thread Andrea Arcangeli

On Wed, May 09, 2001 at 05:03:06PM +0200, Reto Baettig wrote:
 Jeff Garzik schrieb:
  
  Martin Dalecki wrote:
- I force the virtual blocksize for all the blkdev I/O
  (buffered and direct) to work with a 4096 bytes granularity instead of
  
   You mean PAGE_SIZE :-).
 
 Or maybe 8192 bytes on alphas ?!? ;-)

Again, see my argument with Jens, if we make it 8k we risk triggering
lowlevel driver assumption about b_size being = 4k. At least on my
alpha the fs has a 4k blocksize and I think I never tested myself using
a b_size of 8k yet and so I didn't wanted to put too many unknown
variables into the first equation ;).

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



blkdev in pagecache

2001-05-08 Thread Andrea Arcangeli

This night I moved the blkdev layer in pagecache in this patch:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1

It is incremental and depends on the o_direct functionality, latest
o_direct patch against 2.4.5pre1 is here:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5

The main reasons I moved the blkdev in pagecaches is that the current
blkdev provides horrible performance with fast I/O subsystem capable of
over 50mbyte/sec that I just increased x2 with a simple hack that you
can see here if you're curious:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1

(btw, also the current rawio uses a 512byte bh->b_size granularity that is even
worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
on this side as it uses the softblocksize of the fs that can be as well
4k if you created the fs with -b 4096)

However after running this 4k_block_dev-1 hack on some more machine I
noticed the blkdev layer wasn't able anymore to update the superblock of
1k ext2 filesystems and to make it "usable" in real life I needed to fix
it. But I didn't wanted ot invest any further time on such an hack and I
preferred to move the blkdev in pagecache and to fix the problem on top
of the new better design (moving blkdev in pagecache of course
introduces that same problem too as I also mentioned in one of the below
points).

I'll describe here some of the details of the blkdev-pagecache-1 patch:

- /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by
  opening the blkdevice with O_DIRECT, it looks much saner and I
  basically get it for free by just implementing 10 lines of the
  blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
  API for compatibility.

  While testing O_DIRECT I destroyed the first 50mbyte of the root
  partition so I will need to wait the test box to return alive before I
  can make further testing ;). But I just fixed the bug that caused the
  corruption before uploading the patch so I don't expect further
  problems (it was only a s/i_dev/i_rdev thing) because the regression
  testing was working well even if it was writing in the wrong disk ;).

- I force the virtual blocksize for all the blkdev I/O
  (buffered and direct) to work with a 4096 bytes granularity instead of
  the current 1024 softblocksize because we need that for getting higher
  performance, 1024 is too low because it wastes too much ram and too
  much cpu. So a DBMS won't be able anymore to write 512bytes to the
  disk using rawio being sure it will be a single atomic block update.
  If you use /dev/raw nothing changed of course, only opening blkdev
  with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
  I don't think this is a problem, and also O_DIRECT through the fs was
  just using the fs softblocksize instead of the hardblocksize as unit
  of the minimal direct-IO granularity.

- writes to the blockdevice won't end in the buffer cache, so it will
  be impossible to update the superblock of an ext2 partition mounted ro
  for example, it must not be mounted at all to update the superblock, I
  will need to invent an hack to fix this problem or it will get too
  annoying. One way could simply to change ext2 and have it checking
  the buffer to be uptodate before marking it dirty again but maybe
  we could also do it in a generic manner that fixes all the fs at once
  (OTOH probably not that many fs needs to be fscked online...).

- mmap should be functional but it's totally untested.

- currently the last `harddisk_size & 4095' bytes (if any) won't be
  accessible via the blkdev, to avoid sending to the hardware requests
  beyond the end of the device. Not sure how/if to solve this. But this is
  definitely not a new issue, the same thing happens today in 2.2 and
  2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
  a mke2fs -b 1024 could get confused. But I really don't want to
  decrease the b_size of the buffer header even if we fix this.

- to share all the filemap.c code and not to change too much stuff in
  the first patch I added some ISBLK check in fast paths, basically
  only to check against blk_size instead of inode->i_size, I also
  considered changing the i_size semantics for the blkdev inodes but
  I didn't wanted to break all the fs yet so I took the localized
  slower way for now (I doubt it is noticeable in the benchmarks
  but nevertheless it would be nice to optimize away those branches).

- once the blkdev is closed in the block_close callback I
  filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2
  (fdatawait seems not necessary but it won't hurt). I'm not calling
  truncate_inode_pages because those pages could be still mapped
  (->release is called when f_count goes down to zero, not when
  i_count reaches z

blkdev in pagecache

2001-05-08 Thread Andrea Arcangeli

This night I moved the blkdev layer in pagecache in this patch:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/blkdev-pagecache-1

It is incremental and depends on the o_direct functionality, latest
o_direct patch against 2.4.5pre1 is here:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/patches/v2.4/2.4.5pre1/o_direct-5

The main reasons I moved the blkdev in pagecaches is that the current
blkdev provides horrible performance with fast I/O subsystem capable of
over 50mbyte/sec that I just increased x2 with a simple hack that you
can see here if you're curious:


ftp://ftp.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.4/2.4.5pre1aa2/00_4k_block_dev-1

(btw, also the current rawio uses a 512byte bh-b_size granularity that is even
worse than the 1024byte b_size of the blkdev, O_DIRECT is much smarter
on this side as it uses the softblocksize of the fs that can be as well
4k if you created the fs with -b 4096)

However after running this 4k_block_dev-1 hack on some more machine I
noticed the blkdev layer wasn't able anymore to update the superblock of
1k ext2 filesystems and to make it usable in real life I needed to fix
it. But I didn't wanted ot invest any further time on such an hack and I
preferred to move the blkdev in pagecache and to fix the problem on top
of the new better design (moving blkdev in pagecache of course
introduces that same problem too as I also mentioned in one of the below
points).

I'll describe here some of the details of the blkdev-pagecache-1 patch:

- /dev/raw* and drivers/char/raw.c gets obsoleted and replaced by
  opening the blkdevice with O_DIRECT, it looks much saner and I
  basically get it for free by just implementing 10 lines of the
  blkdev_direct_IO callback, of course I didn't removed the /dev/raw*
  API for compatibility.

  While testing O_DIRECT I destroyed the first 50mbyte of the root
  partition so I will need to wait the test box to return alive before I
  can make further testing ;). But I just fixed the bug that caused the
  corruption before uploading the patch so I don't expect further
  problems (it was only a s/i_dev/i_rdev thing) because the regression
  testing was working well even if it was writing in the wrong disk ;).

- I force the virtual blocksize for all the blkdev I/O
  (buffered and direct) to work with a 4096 bytes granularity instead of
  the current 1024 softblocksize because we need that for getting higher
  performance, 1024 is too low because it wastes too much ram and too
  much cpu. So a DBMS won't be able anymore to write 512bytes to the
  disk using rawio being sure it will be a single atomic block update.
  If you use /dev/raw nothing changed of course, only opening blkdev
  with O_DIRECT enforce a minimal granularity of 4096 bytes in the I/O.
  I don't think this is a problem, and also O_DIRECT through the fs was
  just using the fs softblocksize instead of the hardblocksize as unit
  of the minimal direct-IO granularity.

- writes to the blockdevice won't end in the buffer cache, so it will
  be impossible to update the superblock of an ext2 partition mounted ro
  for example, it must not be mounted at all to update the superblock, I
  will need to invent an hack to fix this problem or it will get too
  annoying. One way could simply to change ext2 and have it checking
  the buffer to be uptodate before marking it dirty again but maybe
  we could also do it in a generic manner that fixes all the fs at once
  (OTOH probably not that many fs needs to be fscked online...).

- mmap should be functional but it's totally untested.

- currently the last `harddisk_size  4095' bytes (if any) won't be
  accessible via the blkdev, to avoid sending to the hardware requests
  beyond the end of the device. Not sure how/if to solve this. But this is
  definitely not a new issue, the same thing happens today in 2.2 and
  2.4 after you mount a 4k filesystem on a blockdevice. OTOH I'm scared
  a mke2fs -b 1024 could get confused. But I really don't want to
  decrease the b_size of the buffer header even if we fix this.

- to share all the filemap.c code and not to change too much stuff in
  the first patch I added some ISBLK check in fast paths, basically
  only to check against blk_size instead of inode-i_size, I also
  considered changing the i_size semantics for the blkdev inodes but
  I didn't wanted to break all the fs yet so I took the localized
  slower way for now (I doubt it is noticeable in the benchmarks
  but nevertheless it would be nice to optimize away those branches).

- once the blkdev is closed in the block_close callback I
  filemap_fdatasync;fsync_dev;filemap_fdatawait;invalidate_inode_pages2
  (fdatawait seems not necessary but it won't hurt). I'm not calling
  truncate_inode_pages because those pages could be still mapped
  (-release is called when f_count goes down to zero, not when
  i_count reaches zero). I'd like to defer