Re: DVD blockdevice buffers

2001-06-01 Thread Pavel Machek

Hi!

> I can easily give more examples - just ask. BTW, the fact that this stuff
> is so fragmented is not a bug - we want it evenly spread over disk, just
> to have the ability to allocate a block/inode not too far from the piece
> of bitmap we'll need to modify.

BTW is this still true? This assumes that long seek takes more time than
short seek. With 12.000rpm drive, one rotation takes 5msec. "Full" seek
is around 12msec these days, no?

Pavel
-- 
Philips Velo 1: 1"x4"x8", 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-06-01 Thread Pavel Machek

Hi!

 I can easily give more examples - just ask. BTW, the fact that this stuff
 is so fragmented is not a bug - we want it evenly spread over disk, just
 to have the ability to allocate a block/inode not too far from the piece
 of bitmap we'll need to modify.

BTW is this still true? This assumes that long seek takes more time than
short seek. With 12.000rpm drive, one rotation takes 5msec. Full seek
is around 12msec these days, no?

Pavel
-- 
Philips Velo 1: 1x4x8, 300gram, 60, 12MB, 40bogomips, linux, mutt,
details at http://atrey.karlin.mff.cuni.cz/~pavel/velo/index.html.

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds



On 25 May 2001, Eric W. Biederman wrote:
>
> I obviously picked a bad name, and a bad place to start.
> int data_uptodate(struct page *page, unsigned offset, unsigned len)
>
> This is really an extension to PG_uptodate, not readpage.

Ugh.

The above is just horrible.

It doesn't fix any problems, it is only an ugly work-around for a
situation that never happens in real life. An application that only
re-reads the data that it just wrote itself is a _stupid_ application, and
I'm absolutely not interested in having a new interface that is useless
for everything _but_ such a stupid application.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

Linus Torvalds <[EMAIL PROTECTED]> writes:

> On 25 May 2001, Eric W. Biederman wrote:
> > 
> > For the small random read case we could use a 
> > mapping->a_ops->readpartialpage 
> 
> No, if so I'd prefer to just change "readpage()" to take the same kinds of
> arguments commit_page() does, namely the beginning and end of the read
> area. 

No.

I obviously picked a bad name, and a bad place to start.
int data_uptodate(struct page *page, unsigned offset, unsigned len)

This is really an extension to PG_uptodate, not readpage.  It should
never ever do any I/O.  It should just implement a check to see
if we have all of the data wanted already in the page in the page
cache.  As simply a buffer checking entity it will likely share
virtualy 0 code with read_page.

> Filesystems could choose to ignore the arguments completely, and just act
> the way they already do - filling in the whole page.
> 
> OR a filesystem might know that the page is partially up-to-date (because
> of a partial write), and just return an immediate "this area is already
> uptodate" return code or something. Or it could even fill in the page
> partially, and just unlock it (but not mark it up-to-date: the reader then
> has to wait for the page and then look at PG_error to decide whether the
> partial read succeeded or not).

First mm/filemap.c has generic cache management, so it should make the
decision.

The logic is does this page have the data in cache?
If so just return it.

Otherwise read all that you can at once.  

So we either want a virtual function that can make the decision on
a per filesystem bases if we have the data we need in the page cache.
Or we need to convert the buffer_head into a more generic entity
so everyone can use it.

> I don't think it really matters, I have to say. It would be very easy to
> implement (all the buffer-based filesystems already use the common
> fs/buffer.c readpage, so it would really need changes in just one place,
> along with some expanded prototypes with ignored arguments in some other
> places).
> 
> But it _could_ be a performance helper for some strange loads (write a
> partial page and immediately read it back - what a stupid program), and
> more importantly Al Viro felt earlier that a "partial read" approach might
> help his metadata-in-page-cache stuff because metadata tends to sometimes
> be scattered wildly across the disk.

Maybe I think despite the similarities (partial pages) Al & and I are
looking at two entirely different problems.

> So then we'd have
> 
>   int (*readpage)(struct file *, struct page *, unsigned offset, unsigned
> len);
> 
> 
> and the semantics would be:
>  - the function needs to start IO for _at_least_ the page area
>[offset, offset+len[
>  - return error code for _immediate_ errors (ie not asynchronous)
>  - if there was an asynchronous read error, we set PG_error
>  - if the page is fully populated, we set PG_uptodate
>  - if the page was not fully populated, but the partial read succeeded,
>the filesystem needs to have some way of keeping track of the partial
>success ("page->buffers" is obviously the way for a block-based one),
>and must _not_ set PG_uptodate.
>  - after the asynchronous operation (whether complete, partial or
>unsuccessful), the page is unlocked to tell the reader that it is done.
> 
> Now, this would be coupled with:
>  - generic_file_read() does the read-ahead decisions, and may decide that
>we really only need a partial page.
> 
> But NOTE! The above is meant to potentially avoid unnecessary IO and thus
> speed up the read-in. HOWEVER, it _will_ slow down the case where we first
> would read a small part of the page and then soon afterwards read in the
> rest of the page. I suspect that is the common case by far, and that the
> current whole-page approach is the faster one in 99% of all cases. So I'm
> not at all convinced that the above is actually worth it.

I don't want partial I/O at all.  And I always want to see reads
reading in all of the data for a page.  I just want an interface
where we can say hey we don't actually have to do any I/O for this
read request, give them back their data.

> If somebody can show that the above is worth it and worth implementing (ie
> the Al Viro kind of "I have a real-life schenario where I'd like to use
> it"), and implements it (should be a fairly trivial exercise), then I'll
> happily accept new semantics like this.
> 
> But I do _not_ want to see another new function ("partialread()"), and I
> do _not_ want to see synchronous interfaces (Al's first suggestion).

My naming mistake I don't want to see this logic combined with
readpage.  That is an entirely different case.

I can't see how adding a slow case to PageUptodate to check for a
partially uptodate page could hurt our performance.  And I can imagine
how it could help.

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo 

Re: blkdev-pagecache-2 [was Re: DVD blockdevice buffers]

2001-05-25 Thread Andrea Arcangeli

On Fri, May 25, 2001 at 10:12:51PM +0200, Andrea Arcangeli wrote:
>   
>ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2
   ^ 4 sorry
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



blkdev-pagecache-2 [was Re: DVD blockdevice buffers]

2001-05-25 Thread Andrea Arcangeli

On Thu, May 24, 2001 at 12:32:20AM +0200, Andrea Arcangeli wrote:
> userspace. I will try to work on the blkdev patch tomorrow to bring it
> in an usable state.

It seems in an usable state right but it is still very early beta, I
need to recheck the whole thing, I will do that tomorrow, for now it
should get it right the fsck on a ro mount fs and the cache coherency
across multiple inodes all pointing to the same blkdev, it actually
worked without any problem in the first basic tests I did. However I
expect it to corrupt a rw mounted fs if you open the blkdev under it
(the fsck test happens with the fs ro), so while it's in an usable state
it's not ready for public consumation yet. Of course ramdisk is still
totally broken too. The other first round of bugs mentioned in the first
thread should be fixed. The blocksize is still hardwired to 4k, I'll
think about the read-modify-write problem later. About the proposed
readpage API change I think it's not worthwhile for new hardware where
reading 1k or 4k doesn't make relevant difference. Handling partial
I/O seems worthwhile only during writes because a partial write would
otherwise trigger a read-modify-write operation with a synchronous read.


ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Stephen C. Tweedie

Hi,

On Fri, May 25, 2001 at 02:24:52PM -0400, Alexander Viro wrote:

> If you are OK with adding two extra arguments to ->readpage() I could
> submit a patch replacing that with plain and simple page cache by tomorrow.
> It should not be a problem to port, but I want to get some sleep before
> testing it...

The problem will be returning the IO completion status.  We can't just
rely on PG_Error: what happens if two separate partial reads are
submitted at once within the same page, yet the page is not completely
in cache?  If we forced readpage to be synchronous in that case we
could just return the status directly.  Otherwise we need a separate
way of determining the completion status once the page becomes
unlocked (eg. have a special readpage return which means "all done,
completion status is X", and resubmit the readpage to get that
completion status once the page lock is dropped.)

--Stephen

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Stephen C. Tweedie

Hi,

On Fri, May 25, 2001 at 09:09:37AM -0600, Eric W. Biederman wrote:

> The case we don't get quite right are partial reads that hit cached
> data, on a page that doesn't have PG_Uptodate set.  We don't actually
> need to do the I/O on the surrounding page to satisfy the read
> request.  But we do because generic_file_read doesn't even think about
> that case.

That's *precisely* the case in question.  The whole design of the page
cache involves reading entire pages at a time, in fact.  We _could_
read in only partial pages, but in that case we end up wasting a lot
of the page.

> For the small random read case we could use a 
> mapping->a_ops->readpartialpage 
> function that sees if a request can be satisfied entirely 
> from cached data.  But this is just to allow generic_file_read
> to handle this, case. 

Agreed.  The only case where blockdev-in-pagecache really results in
significantly more IO is partial writes followed by partial reads.
Reads from totally-uncached pages ought to just fill the entire page
from disk; it's only when there is something already present
in the cache for that page that we want to look for partial buffers.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Alexander Viro



On Fri, 25 May 2001, Linus Torvalds wrote:

> For example, I suspect that the metadata bitmaps in particular cache so
> well that the fact that we need to do several seeks over them every once
> in a while is a non-issue: we might be happier having the bitmaps in
> memory (and having simpler code), than try to avoid the occasional seeks.
>
> The "simpler code" argument in particular is, I think, a fairly strong
> one. Our current bitmap code is quite horrible, with multiple layers of
> caching (ext2 will explicitly hold references to some blocks, while at the
> same time depending on the buffer cache to cache the other blocks -
> nightmare)

Oh, current code is a complete mess - no arguments here. 8-element LRU.
Combined with the fact that directories allocation tries to get even
distribution of directory inodes by cylinder groups, you blow that LRU
completely on a regular basis if your fs is larger that 16 cg. For 1Kb
blocks fs it's 128Mb. For 4Kb - 2Gb. And pain starts at the half of that
size.

If you are OK with adding two extra arguments to ->readpage() I could
submit a patch replacing that with plain and simple page cache by tomorrow.
It should not be a problem to port, but I want to get some sleep before
testing it...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds


On Fri, 25 May 2001, Alexander Viro wrote:
> 
> OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on
> Alpha, etc.) consists of compact pieces - one per cylinder group.
> 
> There is a natural mapping from inodes to offsets in that beast.
> However, these pieces can trivially be not page-aligned. readpage()
> on a boundary of two pieces means large seek.

Yes.

But by "real-world" I mean "you can tell in real life".

I see the theoretical arguments for it. But I want to know that it makes a
real difference under real load.

For example, I suspect that the metadata bitmaps in particular cache so
well that the fact that we need to do several seeks over them every once
in a while is a non-issue: we might be happier having the bitmaps in
memory (and having simpler code), than try to avoid the occasional seeks.

The "simpler code" argument in particular is, I think, a fairly strong
one. Our current bitmap code is quite horrible, with multiple layers of
caching (ext2 will explicitly hold references to some blocks, while at the
same time depending on the buffer cache to cache the other blocks -
nightmare)

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Alexander Viro



On Fri, 25 May 2001, Linus Torvalds wrote:

> If somebody can show that the above is worth it and worth implementing (ie
> the Al Viro kind of "I have a real-life schenario where I'd like to use
> it"), and implements it (should be a fairly trivial exercise), then I'll
> happily accept new semantics like this.

OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on
Alpha, etc.) consists of compact pieces - one per cylinder group.

There is a natural mapping from inodes to offsets in that beast.
However, these pieces can trivially be not page-aligned. readpage()
on a boundary of two pieces means large seek.

Another example (even funnier) is bitmaps. Same story, but here you can
have 1Kb per cylinder group. Which is 8Mb in that case. I.e. on Alpha
it means that readpage() will require 7 seeks, 8Mb each. And the worst
thing being, unless we have corrupted free inodes counters we _will_
find what we need in the first 1Kb chunk we are looking at.

I can easily give more examples - just ask. BTW, the fact that this stuff
is so fragmented is not a bug - we want it evenly spread over disk, just
to have the ability to allocate a block/inode not too far from the piece
of bitmap we'll need to modify.
Al
PS: Uff... OK, looking at the locking stuff in fs/super.c was useful - I've
found a way to do it that is seriously simpler than what I used to do.
Just let me torture it for a couple of hours - so far it looks fine...

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds


On 25 May 2001, Eric W. Biederman wrote:
> 
> For the small random read case we could use a 
> mapping->a_ops->readpartialpage 

No, if so I'd prefer to just change "readpage()" to take the same kinds of
arguments commit_page() does, namely the beginning and end of the read
area. 

Filesystems could choose to ignore the arguments completely, and just act
the way they already do - filling in the whole page.

OR a filesystem might know that the page is partially up-to-date (because
of a partial write), and just return an immediate "this area is already
uptodate" return code or something. Or it could even fill in the page
partially, and just unlock it (but not mark it up-to-date: the reader then
has to wait for the page and then look at PG_error to decide whether the
partial read succeeded or not).

I don't think it really matters, I have to say. It would be very easy to
implement (all the buffer-based filesystems already use the common
fs/buffer.c readpage, so it would really need changes in just one place,
along with some expanded prototypes with ignored arguments in some other
places).

But it _could_ be a performance helper for some strange loads (write a
partial page and immediately read it back - what a stupid program), and
more importantly Al Viro felt earlier that a "partial read" approach might
help his metadata-in-page-cache stuff because metadata tends to sometimes
be scattered wildly across the disk.

So then we'd have

int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len);

and the semantics would be:
 - the function needs to start IO for _at_least_ the page area
   [offset, offset+len[
 - return error code for _immediate_ errors (ie not asynchronous)
 - if there was an asynchronous read error, we set PG_error
 - if the page is fully populated, we set PG_uptodate
 - if the page was not fully populated, but the partial read succeeded,
   the filesystem needs to have some way of keeping track of the partial
   success ("page->buffers" is obviously the way for a block-based one),
   and must _not_ set PG_uptodate.
 - after the asynchronous operation (whether complete, partial or
   unsuccessful), the page is unlocked to tell the reader that it is done.

Now, this would be coupled with:
 - generic_file_read() does the read-ahead decisions, and may decide that
   we really only need a partial page.

But NOTE! The above is meant to potentially avoid unnecessary IO and thus
speed up the read-in. HOWEVER, it _will_ slow down the case where we first
would read a small part of the page and then soon afterwards read in the
rest of the page. I suspect that is the common case by far, and that the
current whole-page approach is the faster one in 99% of all cases. So I'm
not at all convinced that the above is actually worth it.

If somebody can show that the above is worth it and worth implementing (ie
the Al Viro kind of "I have a real-life schenario where I'd like to use
it"), and implements it (should be a fairly trivial exercise), then I'll
happily accept new semantics like this.

But I do _not_ want to see another new function ("partialread()"), and I
do _not_ want to see synchronous interfaces (Al's first suggestion).

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

"Stephen C. Tweedie" <[EMAIL PROTECTED]> writes:

> Hi,
> 
> On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
>  
> > On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> > > > that the filesystems already do. And you can do it a lot _better_ than the
> 
> > > > current buffer-cache-based approach. Done right, you can actually do all
> > > > IO in page-sized chunks, BUT fall down on sector-sized things for the
> > > > cases where you want to.
> > >
> > > Right, but you still lose the caching in that case.  The write works,
> > > but the "cache" becomes nothing more than a buffer.
> > 
> > No. It is still cached. You find the buffer with "page->buffer", and when
> > all of them are up-to-date (whether from read-in or from having written
> > to them all), you just mark the whole page up-to-date.
> 
> It works, but *only* if the application writes a whole page worth of
> data.  From the previous emails I had the understanding that this
> application is writing small data items in random 512-byte blocks.  It
> is not writing the rest of the page.  The page never becomes uptodate.
> That in itself isn't a problem, but readpage() can't tell the
> underlying layers that only a part of the page is wanted, so there's
> no way to tell readpage that the page is in fact partially uptodate.
> 
> And just telling the application to write the rest of the page too
> isn't going to cut it, because the rest of the page may contain other
> objects which aren't in cache so we can't write them without first
> reading the page.  The only alternative is to change the on-disk
> layout, forcing a minimum PAGESIZE on the IO chunks.
> 
> > This _works_. Try it on ext2 or NFS today.
> 
> Not for this workload.  Now, maybe it's not an interesting workload.
> But shifting the uptodate granularity from buffer to page sized _does_
> impact the effectiveness of the cache for such an application. 
> 
> > So in short: the page cache supports _today_ all the optimizations.
> 
> For write, perhaps; but for subsequent read, generic_read_page
> doesn't see any of the data in the page unless the whole page has been
> written.

generic_read_page???

block_read_full_page seems to handle this correctly.  At least
with respect to keeping the data around, and not doing the I/O
on data we already have.  But it still reads in the unpopulated
parts of the page even if it is unnecessary.

The case we don't get quite right are partial reads that hit cached
data, on a page that doesn't have PG_Uptodate set.  We don't actually
need to do the I/O on the surrounding page to satisfy the read
request.  But we do because generic_file_read doesn't even think about
that case.

For the small random read case we could use a 
mapping->a_ops->readpartialpage 
function that sees if a request can be satisfied entirely 
from cached data.  But this is just to allow generic_file_read
to handle this, case. 

Eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

Stephen C. Tweedie [EMAIL PROTECTED] writes:

 Hi,
 
 On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
  
  On Wed, 23 May 2001, Stephen C. Tweedie wrote:
that the filesystems already do. And you can do it a lot _better_ than the
 
current buffer-cache-based approach. Done right, you can actually do all
IO in page-sized chunks, BUT fall down on sector-sized things for the
cases where you want to.
  
   Right, but you still lose the caching in that case.  The write works,
   but the cache becomes nothing more than a buffer.
  
  No. It is still cached. You find the buffer with page-buffer, and when
  all of them are up-to-date (whether from read-in or from having written
  to them all), you just mark the whole page up-to-date.
 
 It works, but *only* if the application writes a whole page worth of
 data.  From the previous emails I had the understanding that this
 application is writing small data items in random 512-byte blocks.  It
 is not writing the rest of the page.  The page never becomes uptodate.
 That in itself isn't a problem, but readpage() can't tell the
 underlying layers that only a part of the page is wanted, so there's
 no way to tell readpage that the page is in fact partially uptodate.
 
 And just telling the application to write the rest of the page too
 isn't going to cut it, because the rest of the page may contain other
 objects which aren't in cache so we can't write them without first
 reading the page.  The only alternative is to change the on-disk
 layout, forcing a minimum PAGESIZE on the IO chunks.
 
  This _works_. Try it on ext2 or NFS today.
 
 Not for this workload.  Now, maybe it's not an interesting workload.
 But shifting the uptodate granularity from buffer to page sized _does_
 impact the effectiveness of the cache for such an application. 
 
  So in short: the page cache supports _today_ all the optimizations.
 
 For write, perhaps; but for subsequent read, generic_read_page
 doesn't see any of the data in the page unless the whole page has been
 written.

generic_read_page???

block_read_full_page seems to handle this correctly.  At least
with respect to keeping the data around, and not doing the I/O
on data we already have.  But it still reads in the unpopulated
parts of the page even if it is unnecessary.

The case we don't get quite right are partial reads that hit cached
data, on a page that doesn't have PG_Uptodate set.  We don't actually
need to do the I/O on the surrounding page to satisfy the read
request.  But we do because generic_file_read doesn't even think about
that case.

For the small random read case we could use a 
mapping-a_ops-readpartialpage 
function that sees if a request can be satisfied entirely 
from cached data.  But this is just to allow generic_file_read
to handle this, case. 

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds


On 25 May 2001, Eric W. Biederman wrote:
 
 For the small random read case we could use a 
 mapping-a_ops-readpartialpage 

No, if so I'd prefer to just change readpage() to take the same kinds of
arguments commit_page() does, namely the beginning and end of the read
area. 

Filesystems could choose to ignore the arguments completely, and just act
the way they already do - filling in the whole page.

OR a filesystem might know that the page is partially up-to-date (because
of a partial write), and just return an immediate this area is already
uptodate return code or something. Or it could even fill in the page
partially, and just unlock it (but not mark it up-to-date: the reader then
has to wait for the page and then look at PG_error to decide whether the
partial read succeeded or not).

I don't think it really matters, I have to say. It would be very easy to
implement (all the buffer-based filesystems already use the common
fs/buffer.c readpage, so it would really need changes in just one place,
along with some expanded prototypes with ignored arguments in some other
places).

But it _could_ be a performance helper for some strange loads (write a
partial page and immediately read it back - what a stupid program), and
more importantly Al Viro felt earlier that a partial read approach might
help his metadata-in-page-cache stuff because metadata tends to sometimes
be scattered wildly across the disk.

So then we'd have

int (*readpage)(struct file *, struct page *, unsigned offset, unsigned len);

and the semantics would be:
 - the function needs to start IO for _at_least_ the page area
   [offset, offset+len[
 - return error code for _immediate_ errors (ie not asynchronous)
 - if there was an asynchronous read error, we set PG_error
 - if the page is fully populated, we set PG_uptodate
 - if the page was not fully populated, but the partial read succeeded,
   the filesystem needs to have some way of keeping track of the partial
   success (page-buffers is obviously the way for a block-based one),
   and must _not_ set PG_uptodate.
 - after the asynchronous operation (whether complete, partial or
   unsuccessful), the page is unlocked to tell the reader that it is done.

Now, this would be coupled with:
 - generic_file_read() does the read-ahead decisions, and may decide that
   we really only need a partial page.

But NOTE! The above is meant to potentially avoid unnecessary IO and thus
speed up the read-in. HOWEVER, it _will_ slow down the case where we first
would read a small part of the page and then soon afterwards read in the
rest of the page. I suspect that is the common case by far, and that the
current whole-page approach is the faster one in 99% of all cases. So I'm
not at all convinced that the above is actually worth it.

If somebody can show that the above is worth it and worth implementing (ie
the Al Viro kind of I have a real-life schenario where I'd like to use
it), and implements it (should be a fairly trivial exercise), then I'll
happily accept new semantics like this.

But I do _not_ want to see another new function (partialread()), and I
do _not_ want to see synchronous interfaces (Al's first suggestion).

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Alexander Viro



On Fri, 25 May 2001, Linus Torvalds wrote:

 If somebody can show that the above is worth it and worth implementing (ie
 the Al Viro kind of I have a real-life schenario where I'd like to use
 it), and implements it (should be a fairly trivial exercise), then I'll
 happily accept new semantics like this.

OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on
Alpha, etc.) consists of compact pieces - one per cylinder group.

There is a natural mapping from inodes to offsets in that beast.
However, these pieces can trivially be not page-aligned. readpage()
on a boundary of two pieces means large seek.

Another example (even funnier) is bitmaps. Same story, but here you can
have 1Kb per cylinder group. Which is 8Mb in that case. I.e. on Alpha
it means that readpage() will require 7 seeks, 8Mb each. And the worst
thing being, unless we have corrupted free inodes counters we _will_
find what we need in the first 1Kb chunk we are looking at.

I can easily give more examples - just ask. BTW, the fact that this stuff
is so fragmented is not a bug - we want it evenly spread over disk, just
to have the ability to allocate a block/inode not too far from the piece
of bitmap we'll need to modify.
Al
PS: Uff... OK, looking at the locking stuff in fs/super.c was useful - I've
found a way to do it that is seriously simpler than what I used to do.
Just let me torture it for a couple of hours - so far it looks fine...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds


On Fri, 25 May 2001, Alexander Viro wrote:
 
 OK, here's a real-world scenario: inode table on 1Kb ext2 (or 4Kb on
 Alpha, etc.) consists of compact pieces - one per cylinder group.
 
 There is a natural mapping from inodes to offsets in that beast.
 However, these pieces can trivially be not page-aligned. readpage()
 on a boundary of two pieces means large seek.

Yes.

But by real-world I mean you can tell in real life.

I see the theoretical arguments for it. But I want to know that it makes a
real difference under real load.

For example, I suspect that the metadata bitmaps in particular cache so
well that the fact that we need to do several seeks over them every once
in a while is a non-issue: we might be happier having the bitmaps in
memory (and having simpler code), than try to avoid the occasional seeks.

The simpler code argument in particular is, I think, a fairly strong
one. Our current bitmap code is quite horrible, with multiple layers of
caching (ext2 will explicitly hold references to some blocks, while at the
same time depending on the buffer cache to cache the other blocks -
nightmare)

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Alexander Viro



On Fri, 25 May 2001, Linus Torvalds wrote:

 For example, I suspect that the metadata bitmaps in particular cache so
 well that the fact that we need to do several seeks over them every once
 in a while is a non-issue: we might be happier having the bitmaps in
 memory (and having simpler code), than try to avoid the occasional seeks.

 The simpler code argument in particular is, I think, a fairly strong
 one. Our current bitmap code is quite horrible, with multiple layers of
 caching (ext2 will explicitly hold references to some blocks, while at the
 same time depending on the buffer cache to cache the other blocks -
 nightmare)

Oh, current code is a complete mess - no arguments here. 8-element LRU.
Combined with the fact that directories allocation tries to get even
distribution of directory inodes by cylinder groups, you blow that LRU
completely on a regular basis if your fs is larger that 16 cg. For 1Kb
blocks fs it's 128Mb. For 4Kb - 2Gb. And pain starts at the half of that
size.

If you are OK with adding two extra arguments to -readpage() I could
submit a patch replacing that with plain and simple page cache by tomorrow.
It should not be a problem to port, but I want to get some sleep before
testing it...

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Stephen C. Tweedie

Hi,

On Fri, May 25, 2001 at 09:09:37AM -0600, Eric W. Biederman wrote:

 The case we don't get quite right are partial reads that hit cached
 data, on a page that doesn't have PG_Uptodate set.  We don't actually
 need to do the I/O on the surrounding page to satisfy the read
 request.  But we do because generic_file_read doesn't even think about
 that case.

That's *precisely* the case in question.  The whole design of the page
cache involves reading entire pages at a time, in fact.  We _could_
read in only partial pages, but in that case we end up wasting a lot
of the page.

 For the small random read case we could use a 
 mapping-a_ops-readpartialpage 
 function that sees if a request can be satisfied entirely 
 from cached data.  But this is just to allow generic_file_read
 to handle this, case. 

Agreed.  The only case where blockdev-in-pagecache really results in
significantly more IO is partial writes followed by partial reads.
Reads from totally-uncached pages ought to just fill the entire page
from disk; it's only when there is something already present
in the cache for that page that we want to look for partial buffers.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Stephen C. Tweedie

Hi,

On Fri, May 25, 2001 at 02:24:52PM -0400, Alexander Viro wrote:

 If you are OK with adding two extra arguments to -readpage() I could
 submit a patch replacing that with plain and simple page cache by tomorrow.
 It should not be a problem to port, but I want to get some sleep before
 testing it...

The problem will be returning the IO completion status.  We can't just
rely on PG_Error: what happens if two separate partial reads are
submitted at once within the same page, yet the page is not completely
in cache?  If we forced readpage to be synchronous in that case we
could just return the status directly.  Otherwise we need a separate
way of determining the completion status once the page becomes
unlocked (eg. have a special readpage return which means all done,
completion status is X, and resubmit the readpage to get that
completion status once the page lock is dropped.)

--Stephen

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



blkdev-pagecache-2 [was Re: DVD blockdevice buffers]

2001-05-25 Thread Andrea Arcangeli

On Thu, May 24, 2001 at 12:32:20AM +0200, Andrea Arcangeli wrote:
 userspace. I will try to work on the blkdev patch tomorrow to bring it
 in an usable state.

It seems in an usable state right but it is still very early beta, I
need to recheck the whole thing, I will do that tomorrow, for now it
should get it right the fsck on a ro mount fs and the cache coherency
across multiple inodes all pointing to the same blkdev, it actually
worked without any problem in the first basic tests I did. However I
expect it to corrupt a rw mounted fs if you open the blkdev under it
(the fsck test happens with the fs ro), so while it's in an usable state
it's not ready for public consumation yet. Of course ramdisk is still
totally broken too. The other first round of bugs mentioned in the first
thread should be fixed. The blocksize is still hardwired to 4k, I'll
think about the read-modify-write problem later. About the proposed
readpage API change I think it's not worthwhile for new hardware where
reading 1k or 4k doesn't make relevant difference. Handling partial
I/O seems worthwhile only during writes because a partial write would
otherwise trigger a read-modify-write operation with a synchronous read.


ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: blkdev-pagecache-2 [was Re: DVD blockdevice buffers]

2001-05-25 Thread Andrea Arcangeli

On Fri, May 25, 2001 at 10:12:51PM +0200, Andrea Arcangeli wrote:
   
ftp://ftp.kernel.org/pub/linux/kernel/people/andrea/patches/v2.2/2.4.5pre6/blkdev-pagecache-2
   ^ 4 sorry
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-25 Thread Eric W. Biederman

Linus Torvalds [EMAIL PROTECTED] writes:

 On 25 May 2001, Eric W. Biederman wrote:
  
  For the small random read case we could use a 
  mapping-a_ops-readpartialpage 
 
 No, if so I'd prefer to just change readpage() to take the same kinds of
 arguments commit_page() does, namely the beginning and end of the read
 area. 

No.

I obviously picked a bad name, and a bad place to start.
int data_uptodate(struct page *page, unsigned offset, unsigned len)

This is really an extension to PG_uptodate, not readpage.  It should
never ever do any I/O.  It should just implement a check to see
if we have all of the data wanted already in the page in the page
cache.  As simply a buffer checking entity it will likely share
virtualy 0 code with read_page.

 Filesystems could choose to ignore the arguments completely, and just act
 the way they already do - filling in the whole page.
 
 OR a filesystem might know that the page is partially up-to-date (because
 of a partial write), and just return an immediate this area is already
 uptodate return code or something. Or it could even fill in the page
 partially, and just unlock it (but not mark it up-to-date: the reader then
 has to wait for the page and then look at PG_error to decide whether the
 partial read succeeded or not).

First mm/filemap.c has generic cache management, so it should make the
decision.

The logic is does this page have the data in cache?
If so just return it.

Otherwise read all that you can at once.  

So we either want a virtual function that can make the decision on
a per filesystem bases if we have the data we need in the page cache.
Or we need to convert the buffer_head into a more generic entity
so everyone can use it.

 I don't think it really matters, I have to say. It would be very easy to
 implement (all the buffer-based filesystems already use the common
 fs/buffer.c readpage, so it would really need changes in just one place,
 along with some expanded prototypes with ignored arguments in some other
 places).
 
 But it _could_ be a performance helper for some strange loads (write a
 partial page and immediately read it back - what a stupid program), and
 more importantly Al Viro felt earlier that a partial read approach might
 help his metadata-in-page-cache stuff because metadata tends to sometimes
 be scattered wildly across the disk.

Maybe I think despite the similarities (partial pages) Al  and I are
looking at two entirely different problems.

 So then we'd have
 
   int (*readpage)(struct file *, struct page *, unsigned offset, unsigned
 len);
 
 
 and the semantics would be:
  - the function needs to start IO for _at_least_ the page area
[offset, offset+len[
  - return error code for _immediate_ errors (ie not asynchronous)
  - if there was an asynchronous read error, we set PG_error
  - if the page is fully populated, we set PG_uptodate
  - if the page was not fully populated, but the partial read succeeded,
the filesystem needs to have some way of keeping track of the partial
success (page-buffers is obviously the way for a block-based one),
and must _not_ set PG_uptodate.
  - after the asynchronous operation (whether complete, partial or
unsuccessful), the page is unlocked to tell the reader that it is done.
 
 Now, this would be coupled with:
  - generic_file_read() does the read-ahead decisions, and may decide that
we really only need a partial page.
 
 But NOTE! The above is meant to potentially avoid unnecessary IO and thus
 speed up the read-in. HOWEVER, it _will_ slow down the case where we first
 would read a small part of the page and then soon afterwards read in the
 rest of the page. I suspect that is the common case by far, and that the
 current whole-page approach is the faster one in 99% of all cases. So I'm
 not at all convinced that the above is actually worth it.

I don't want partial I/O at all.  And I always want to see reads
reading in all of the data for a page.  I just want an interface
where we can say hey we don't actually have to do any I/O for this
read request, give them back their data.

 If somebody can show that the above is worth it and worth implementing (ie
 the Al Viro kind of I have a real-life schenario where I'd like to use
 it), and implements it (should be a fairly trivial exercise), then I'll
 happily accept new semantics like this.
 
 But I do _not_ want to see another new function (partialread()), and I
 do _not_ want to see synchronous interfaces (Al's first suggestion).

My naming mistake I don't want to see this logic combined with
readpage.  That is an entirely different case.

I can't see how adding a slow case to PageUptodate to check for a
partially uptodate page could hurt our performance.  And I can imagine
how it could help.

Eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  

Re: DVD blockdevice buffers

2001-05-25 Thread Linus Torvalds



On 25 May 2001, Eric W. Biederman wrote:

 I obviously picked a bad name, and a bad place to start.
 int data_uptodate(struct page *page, unsigned offset, unsigned len)

 This is really an extension to PG_uptodate, not readpage.

Ugh.

The above is just horrible.

It doesn't fix any problems, it is only an ugly work-around for a
situation that never happens in real life. An application that only
re-reads the data that it just wrote itself is a _stupid_ application, and
I'm absolutely not interested in having a new interface that is useless
for everything _but_ such a stupid application.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-24 Thread Stephen C. Tweedie

Hi,

On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
 
> On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> > > that the filesystems already do. And you can do it a lot _better_ than the
> > > current buffer-cache-based approach. Done right, you can actually do all
> > > IO in page-sized chunks, BUT fall down on sector-sized things for the
> > > cases where you want to.
> >
> > Right, but you still lose the caching in that case.  The write works,
> > but the "cache" becomes nothing more than a buffer.
> 
> No. It is still cached. You find the buffer with "page->buffer", and when
> all of them are up-to-date (whether from read-in or from having written
> to them all), you just mark the whole page up-to-date.

It works, but *only* if the application writes a whole page worth of
data.  From the previous emails I had the understanding that this
application is writing small data items in random 512-byte blocks.  It
is not writing the rest of the page.  The page never becomes uptodate.
That in itself isn't a problem, but readpage() can't tell the
underlying layers that only a part of the page is wanted, so there's
no way to tell readpage that the page is in fact partially uptodate.

And just telling the application to write the rest of the page too
isn't going to cut it, because the rest of the page may contain other
objects which aren't in cache so we can't write them without first
reading the page.  The only alternative is to change the on-disk
layout, forcing a minimum PAGESIZE on the IO chunks.

> This _works_. Try it on ext2 or NFS today.

Not for this workload.  Now, maybe it's not an interesting workload.
But shifting the uptodate granularity from buffer to page sized _does_
impact the effectiveness of the cache for such an application. 

> So in short: the page cache supports _today_ all the optimizations.

For write, perhaps; but for subsequent read, generic_read_page
doesn't see any of the data in the page unless the whole page has been
written.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 04:40:14PM -0400, Jeff Garzik wrote:
> Linus Torvalds wrote:
> > Now, it may be that the preliminary patches from Andrea do not work this
> > way. I didn't look at them too closely, and I assume that Andrea basically
> > made the block-size be the same as the page size. That's how I would have
> > done it (and then waited for people to find real life cases where we want
> > to allow sector writes).
> 
> Due to limitations in low-level drivers, Andrea was forced to hardcode
> 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE.

Yes, actually to trigger the read-modify-write logic not more than with
the current buffercache I could simply decrease the softblocksize of the
blkdev pagecache to 1k, like the default granularity of the current
buffercache before any filesystem is mounted, but that would impose a
_very_ significant performance hit to the non-cached case which is quite
important as well mainly for a blkdev I think.

I measured on high end disks reading (out of cache) with 4k buffercache
blocksize instead of with 1k buffercache blocksize is an exact x2
improvement because at that speed the bottleneck become the work that
has to be done by the cpu.

Infact rawio /dev/raw* is as well 2 times slower than the 2.4 4k
bufferecache on blkdev in those environment (of course with rawio the
cpu is not used much comared to the buffered I/O) and that's one of the
reasons I also imposed a 4k granularity on the direct I/O from
open("/dev/hda", O_DIRECT|O_RDRW)  I didn't benchmarked yet but I
suspect that doing rawio with forced 4k bh (as opposed to 512bytes bh of
/dev/raw*) will make O_DIRECT on the blkdev much faster than the
buffered I/O on the blkdev through pagecache just like O_DIRECT scored
the 170MByte/sec of very scalable I/O recently I think also because it
was done through ext2 that imposed a 4k softblocksize:

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/1175.html

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/att-1175/01-directio.png

(boudicca.tux.org is not online at the moment but I assume it will
return online soon)

However this is still flexible, right now my first object is to solve
the showstoppers (so for example I can run my machine with that patch
applied) and then we can think how to solve the 4k/1k/512byte
softblocksize issues. Possibly automatically or selectable from
userspace. I will try to work on the blkdev patch tomorrow to bring it
in an usable state.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 06:13:13PM -0400, Alexander Viro wrote:
> Uh-oh... After you solved what?

The superblock is pinned by the kernel in buffercache while you fsck a
ro mounted ext2, so I must somehow uptodate this superblock in the
buffercache before collecting away the pagecache containing more recent
info from fsck. It's all done lazily, I just thought not to break the
assumption that an idling buffercache will never become not uptodate
under you anytime because it seems not too painful to implement compared
to changing the fs, it puts the check in a slow path and it doesn't
break the API with the buffercache (so I don't need to change all the fs
to check if the superblock is still uptodate before marking it dirty).

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
> [..] I assume that Andrea basically
> made the block-size be the same as the page size. That's how I would have

exactly (softblocksize is 4k fixed, regardless of the page cache size to
avoid confusing device drivers).

> done it (and then waited for people to find real life cases where we want
> to allow sector writes).

Correct, the partial write logic is kind of disabled on x86 because the
artificial softblocksize of the blkdev pagecache matches the
pagecachesize but it should just work on the other archs.

Now I can try to make the bh more granular for partial writes in a
dynamic manner (so we don't pay the overhead of the 512byte bh in the
common case) but I think this would need its own additional logic and I
prefer to think about it after I solved the coherency issues between
pinned buffer cache and filesystem, so after the showstoppers are solved
and the patch is just usable in real life (possibly with the overhead of
read-modify-write for some workload doing small random write I/O).
An easy short term fix for removing the read-modify-write would be to use the
hardblocksize of the underlying device as the softblocksize but again
that would cause us to pay for the 512byte bhs which I don't like to... ;)

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Alexander Viro



On Thu, 24 May 2001, Andrea Arcangeli wrote:

> prefer to think about it after I solved the coherency issues between
> pinned buffer cache and filesystem, so after the showstoppers are solved

Uh-oh... After you solved what?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Jeff Garzik

Linus Torvalds wrote:
> Now, it may be that the preliminary patches from Andrea do not work this
> way. I didn't look at them too closely, and I assume that Andrea basically
> made the block-size be the same as the page size. That's how I would have
> done it (and then waited for people to find real life cases where we want
> to allow sector writes).

Due to limitations in low-level drivers, Andrea was forced to hardcode
4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE.

-- 
Jeff Garzik  | "Are you the police?"
Building 1024| "No, ma'am.  We're musicians."
MandrakeSoft |
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Linus Torvalds



On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> > that the filesystems already do. And you can do it a lot _better_ than the
> > current buffer-cache-based approach. Done right, you can actually do all
> > IO in page-sized chunks, BUT fall down on sector-sized things for the
> > cases where you want to.
>
> Right, but you still lose the caching in that case.  The write works,
> but the "cache" becomes nothing more than a buffer.

No. It is still cached. You find the buffer with "page->buffer", and when
all of them are up-to-date (whether from read-in or from having written
to them all), you just mark the whole page up-to-date.

This _works_. Try it on ext2 or NFS today.

Now, it may be that the preliminary patches from Andrea do not work this
way. I didn't look at them too closely, and I assume that Andrea basically
made the block-size be the same as the page size. That's how I would have
done it (and then waited for people to find real life cases where we want
to allow sector writes).

So in short: the page cache supports _today_ all the optimizations. In
fact, you can, on NFS, do 4096 one-byte writes, and they will be (a)
coalesced into one write over the wire, and (b) will be cached in the page
and the page marked up-to-date.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote:
> 
> On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> No, you can actually do all the "prepare_write()"/"commit_write()" stuff
> that the filesystems already do. And you can do it a lot _better_ than the
> current buffer-cache-based approach. Done right, you can actually do all
> IO in page-sized chunks, BUT fall down on sector-sized things for the
> cases where you want to. 

Right, but you still lose the caching in that case.  The write works,
but the "cache" becomes nothing more than a buffer.

This actually came up recently after the first posting of the
bdev-on-pagecache patches, when somebody was getting lousy
database performance for an application I think they had developed
from scratch --- it was using 512-byte blocks as the basic write
alignment and was relying on the kernel caching that.  In fact, in
that case even our old buffer cache was failing due to the default
blocksize of 1024 bytes, and he had had to add an ioctl to force the
blocksize to 512 bytes before the application would perform at all
well on Linux.

So we do have at least one real-world example which will fail if we
increase the IO granularity.  We may well decide that the pain is
worth it, but the page cache really cannot deal properly with this
right now without having an uptodate labeling at finer granularity
than the page (which would be unnecessary ugliness in most cases).

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Linus Torvalds


On Wed, 23 May 2001, Stephen C. Tweedie wrote:
> 
> Right.  I'd like to see buffered IO able to work well --- apart from
> the VM issues, it's the easiest way to allow the application to take
> advantage of readahead.  However, there's one sticking point we
> encountered, which is applications which write to block devices in
> units smaller than a page.  Small block writes get magically
> transformed into read/modify/write cycles if you shift the block
> devices into the page cache.

No, you can actually do all the "prepare_write()"/"commit_write()" stuff
that the filesystems already do. And you can do it a lot _better_ than the
current buffer-cache-based approach. Done right, you can actually do all
IO in page-sized chunks, BUT fall down on sector-sized things for the
cases where you want to. 

This is exactly the same issue that filesystems had with writers of less
than a page - and the page cache interfaces allow for byte-granular writes
(as actually shown by things like NFS, which do exactly that. For a block
device, the granularity obviously tends to be at least 512 bytes).

> Of course, we could just say "then don't do that" and be done with it
> --- after all, we already have this behaviour when writing to regular
> files.

No, we really don't. When you write an aligned 1kB block to a 1kB ext2
filesystem, it will _not_ do a page-sized read-modify-write. It will just
create the proper 1kB buffers, and mark one of the dirty.

Now, admittedly it is _easier_ to just always consider things 4kB in
size. And faster too, for the common cases. So it might not be worth it to
do the extra work unless somebody can show a good reason for it.

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote:

> Right now we don't try to aggressively drop streaming pages, but it's
> possible. Using raw devices is a silly work-around that should not be
> needed, and this load shows a real problem in current Linux (one soon to
> be fixed, I think - Andrea already has some experimental patches for the
> page-cache thing).

Right.  I'd like to see buffered IO able to work well --- apart from
the VM issues, it's the easiest way to allow the application to take
advantage of readahead.  However, there's one sticking point we
encountered, which is applications which write to block devices in
units smaller than a page.  Small block writes get magically
transformed into read/modify/write cycles if you shift the block
devices into the page cache.

Of course, we could just say "then don't do that" and be done with it
--- after all, we already have this behaviour when writing to regular
files.

--Stephen
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Sat, May 19, 2001 at 07:36:07PM -0700, Linus Torvalds wrote:

 Right now we don't try to aggressively drop streaming pages, but it's
 possible. Using raw devices is a silly work-around that should not be
 needed, and this load shows a real problem in current Linux (one soon to
 be fixed, I think - Andrea already has some experimental patches for the
 page-cache thing).

Right.  I'd like to see buffered IO able to work well --- apart from
the VM issues, it's the easiest way to allow the application to take
advantage of readahead.  However, there's one sticking point we
encountered, which is applications which write to block devices in
units smaller than a page.  Small block writes get magically
transformed into read/modify/write cycles if you shift the block
devices into the page cache.

Of course, we could just say then don't do that and be done with it
--- after all, we already have this behaviour when writing to regular
files.

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Linus Torvalds


On Wed, 23 May 2001, Stephen C. Tweedie wrote:
 
 Right.  I'd like to see buffered IO able to work well --- apart from
 the VM issues, it's the easiest way to allow the application to take
 advantage of readahead.  However, there's one sticking point we
 encountered, which is applications which write to block devices in
 units smaller than a page.  Small block writes get magically
 transformed into read/modify/write cycles if you shift the block
 devices into the page cache.

No, you can actually do all the prepare_write()/commit_write() stuff
that the filesystems already do. And you can do it a lot _better_ than the
current buffer-cache-based approach. Done right, you can actually do all
IO in page-sized chunks, BUT fall down on sector-sized things for the
cases where you want to. 

This is exactly the same issue that filesystems had with writers of less
than a page - and the page cache interfaces allow for byte-granular writes
(as actually shown by things like NFS, which do exactly that. For a block
device, the granularity obviously tends to be at least 512 bytes).

 Of course, we could just say then don't do that and be done with it
 --- after all, we already have this behaviour when writing to regular
 files.

No, we really don't. When you write an aligned 1kB block to a 1kB ext2
filesystem, it will _not_ do a page-sized read-modify-write. It will just
create the proper 1kB buffers, and mark one of the dirty.

Now, admittedly it is _easier_ to just always consider things 4kB in
size. And faster too, for the common cases. So it might not be worth it to
do the extra work unless somebody can show a good reason for it.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Stephen C. Tweedie

Hi,

On Wed, May 23, 2001 at 11:12:00AM -0700, Linus Torvalds wrote:
 
 On Wed, 23 May 2001, Stephen C. Tweedie wrote:
 No, you can actually do all the prepare_write()/commit_write() stuff
 that the filesystems already do. And you can do it a lot _better_ than the
 current buffer-cache-based approach. Done right, you can actually do all
 IO in page-sized chunks, BUT fall down on sector-sized things for the
 cases where you want to. 

Right, but you still lose the caching in that case.  The write works,
but the cache becomes nothing more than a buffer.

This actually came up recently after the first posting of the
bdev-on-pagecache patches, when somebody was getting lousy
database performance for an application I think they had developed
from scratch --- it was using 512-byte blocks as the basic write
alignment and was relying on the kernel caching that.  In fact, in
that case even our old buffer cache was failing due to the default
blocksize of 1024 bytes, and he had had to add an ioctl to force the
blocksize to 512 bytes before the application would perform at all
well on Linux.

So we do have at least one real-world example which will fail if we
increase the IO granularity.  We may well decide that the pain is
worth it, but the page cache really cannot deal properly with this
right now without having an uptodate labeling at finer granularity
than the page (which would be unnecessary ugliness in most cases).

--Stephen
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Linus Torvalds



On Wed, 23 May 2001, Stephen C. Tweedie wrote:
  that the filesystems already do. And you can do it a lot _better_ than the
  current buffer-cache-based approach. Done right, you can actually do all
  IO in page-sized chunks, BUT fall down on sector-sized things for the
  cases where you want to.

 Right, but you still lose the caching in that case.  The write works,
 but the cache becomes nothing more than a buffer.

No. It is still cached. You find the buffer with page-buffer, and when
all of them are up-to-date (whether from read-in or from having written
to them all), you just mark the whole page up-to-date.

This _works_. Try it on ext2 or NFS today.

Now, it may be that the preliminary patches from Andrea do not work this
way. I didn't look at them too closely, and I assume that Andrea basically
made the block-size be the same as the page size. That's how I would have
done it (and then waited for people to find real life cases where we want
to allow sector writes).

So in short: the page cache supports _today_ all the optimizations. In
fact, you can, on NFS, do 4096 one-byte writes, and they will be (a)
coalesced into one write over the wire, and (b) will be cached in the page
and the page marked up-to-date.

Linus

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Jeff Garzik

Linus Torvalds wrote:
 Now, it may be that the preliminary patches from Andrea do not work this
 way. I didn't look at them too closely, and I assume that Andrea basically
 made the block-size be the same as the page size. That's how I would have
 done it (and then waited for people to find real life cases where we want
 to allow sector writes).

Due to limitations in low-level drivers, Andrea was forced to hardcode
4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE.

-- 
Jeff Garzik  | Are you the police?
Building 1024| No, ma'am.  We're musicians.
MandrakeSoft |
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Alexander Viro



On Thu, 24 May 2001, Andrea Arcangeli wrote:

 prefer to think about it after I solved the coherency issues between
 pinned buffer cache and filesystem, so after the showstoppers are solved

Uh-oh... After you solved what?

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 01:01:56PM -0700, Linus Torvalds wrote:
 [..] I assume that Andrea basically
 made the block-size be the same as the page size. That's how I would have

exactly (softblocksize is 4k fixed, regardless of the page cache size to
avoid confusing device drivers).

 done it (and then waited for people to find real life cases where we want
 to allow sector writes).

Correct, the partial write logic is kind of disabled on x86 because the
artificial softblocksize of the blkdev pagecache matches the
pagecachesize but it should just work on the other archs.

Now I can try to make the bh more granular for partial writes in a
dynamic manner (so we don't pay the overhead of the 512byte bh in the
common case) but I think this would need its own additional logic and I
prefer to think about it after I solved the coherency issues between
pinned buffer cache and filesystem, so after the showstoppers are solved
and the patch is just usable in real life (possibly with the overhead of
read-modify-write for some workload doing small random write I/O).
An easy short term fix for removing the read-modify-write would be to use the
hardblocksize of the underlying device as the softblocksize but again
that would cause us to pay for the 512byte bhs which I don't like to... ;)

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 06:13:13PM -0400, Alexander Viro wrote:
 Uh-oh... After you solved what?

The superblock is pinned by the kernel in buffercache while you fsck a
ro mounted ext2, so I must somehow uptodate this superblock in the
buffercache before collecting away the pagecache containing more recent
info from fsck. It's all done lazily, I just thought not to break the
assumption that an idling buffercache will never become not uptodate
under you anytime because it seems not too painful to implement compared
to changing the fs, it puts the check in a slow path and it doesn't
break the API with the buffercache (so I don't need to change all the fs
to check if the superblock is still uptodate before marking it dirty).

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-23 Thread Andrea Arcangeli

On Wed, May 23, 2001 at 04:40:14PM -0400, Jeff Garzik wrote:
 Linus Torvalds wrote:
  Now, it may be that the preliminary patches from Andrea do not work this
  way. I didn't look at them too closely, and I assume that Andrea basically
  made the block-size be the same as the page size. That's how I would have
  done it (and then waited for people to find real life cases where we want
  to allow sector writes).
 
 Due to limitations in low-level drivers, Andrea was forced to hardcode
 4096 for the block size, instead of using PAGE_SIZE or PAGE_CACHE_SIZE.

Yes, actually to trigger the read-modify-write logic not more than with
the current buffercache I could simply decrease the softblocksize of the
blkdev pagecache to 1k, like the default granularity of the current
buffercache before any filesystem is mounted, but that would impose a
_very_ significant performance hit to the non-cached case which is quite
important as well mainly for a blkdev I think.

I measured on high end disks reading (out of cache) with 4k buffercache
blocksize instead of with 1k buffercache blocksize is an exact x2
improvement because at that speed the bottleneck become the work that
has to be done by the cpu.

Infact rawio /dev/raw* is as well 2 times slower than the 2.4 4k
bufferecache on blkdev in those environment (of course with rawio the
cpu is not used much comared to the buffered I/O) and that's one of the
reasons I also imposed a 4k granularity on the direct I/O from
open(/dev/hda, O_DIRECT|O_RDRW)  I didn't benchmarked yet but I
suspect that doing rawio with forced 4k bh (as opposed to 512bytes bh of
/dev/raw*) will make O_DIRECT on the blkdev much faster than the
buffered I/O on the blkdev through pagecache just like O_DIRECT scored
the 170MByte/sec of very scalable I/O recently I think also because it
was done through ext2 that imposed a 4k softblocksize:

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/1175.html

http://boudicca.tux.org/hypermail/linux-kernel/2001week17/att-1175/01-directio.png

(boudicca.tux.org is not online at the moment but I assume it will
return online soon)

However this is still flexible, right now my first object is to solve
the showstoppers (so for example I can run my machine with that patch
applied) and then we can think how to solve the 4k/1k/512byte
softblocksize issues. Possibly automatically or selectable from
userspace. I will try to work on the blkdev patch tomorrow to bring it
in an usable state.

Andrea
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-21 Thread Jens Axboe

On Mon, May 21 2001, Adam Schrotenboer wrote:
> On Sun, 20 May 2001, Jens Axboe wrote:
> 
> > On Sat, May 19 2001, Adam Schrotenboer wrote:
> > > /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
> > > using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
> > 
> > It's automagically included, no config options necessary
> > (drivers/char/raw.c)
> 
> Then where is /dev/raw* ? I'm using devfs, if that helps any.

Apparently raw doesn't setup using the devfs_reg functions, someone need
to fix that. So either fix it (look at other drivers) or don't use
devfs.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-21 Thread Adam Schrotenboer

On Sun, 20 May 2001, Jens Axboe wrote:

> On Sat, May 19 2001, Adam Schrotenboer wrote:
> > /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
> > using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
> 
> It's automagically included, no config options necessary
> (drivers/char/raw.c)

Then where is /dev/raw* ? I'm using devfs, if that helps any.
> -- 
> Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-21 Thread Adam Schrotenboer

On Sun, 20 May 2001, Jens Axboe wrote:

 On Sat, May 19 2001, Adam Schrotenboer wrote:
  /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
  using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
 
 It's automagically included, no config options necessary
 (drivers/char/raw.c)

Then where is /dev/raw* ? I'm using devfs, if that helps any.
 -- 
 Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-21 Thread Jens Axboe

On Mon, May 21 2001, Adam Schrotenboer wrote:
 On Sun, 20 May 2001, Jens Axboe wrote:
 
  On Sat, May 19 2001, Adam Schrotenboer wrote:
   /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
   using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
  
  It's automagically included, no config options necessary
  (drivers/char/raw.c)
 
 Then where is /dev/raw* ? I'm using devfs, if that helps any.

Apparently raw doesn't setup using the devfs_reg functions, someone need
to fix that. So either fix it (look at other drivers) or don't use
devfs.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Linus Torvalds

In article <[EMAIL PROTECTED]>, Jens Axboe  <[EMAIL PROTECTED]> wrote:
>> 
>> As a result the system performance goes down. I'm still able to use
>> my applications, but es every single piece of unused memory is swapped
>> out, and swapping in costs a certain amount of time.
>
>That's why streaming media applications like a dvd player should use raw
>I/O -- to bypass system cache. See /dev/raw*

I disagree.. 

The fact is that the block device fs infrastructure is just sadly
broken. By using the buffer cache, it makes memory management very hard,
and just upgrading to the page cache would (a) speed stuff up and (b)
make it much easier for the kernel to do the right thing wrt the MM use.

Right now we don't try to aggressively drop streaming pages, but it's
possible. Using raw devices is a silly work-around that should not be
needed, and this load shows a real problem in current Linux (one soon to
be fixed, I think - Andrea already has some experimental patches for the
page-cache thing).

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Adam Schrotenboer

On Sun, 20 May 2001, Jens Axboe wrote:

> On Sat, May 19 2001, Adam Schrotenboer wrote:
> > /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
> > using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
> 
> It's automagically included, no config options necessary
> (drivers/char/raw.c)
then why can't I find /dev/raw* (I'm using devfs, FWIW)
> 
> -- 
> Jens Axboe
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Jens Axboe

On Sat, May 19 2001, Adam Schrotenboer wrote:
> /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
> using 2.4.4-ac11 and playing w/ 2.4.5-pre3.

It's automagically included, no config options necessary
(drivers/char/raw.c)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Adam Schrotenboer

/dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
using 2.4.4-ac11 and playing w/ 2.4.5-pre3.

TIA
Adam Schrotenboer

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Adam Schrotenboer

/dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
using 2.4.4-ac11 and playing w/ 2.4.5-pre3.

TIA
Adam Schrotenboer

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Jens Axboe

On Sat, May 19 2001, Adam Schrotenboer wrote:
 /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
 using 2.4.4-ac11 and playing w/ 2.4.5-pre3.

It's automagically included, no config options necessary
(drivers/char/raw.c)

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Adam Schrotenboer

On Sun, 20 May 2001, Jens Axboe wrote:

 On Sat, May 19 2001, Adam Schrotenboer wrote:
  /dev/raw*  Where? I can't find it in my .config (grep RAW .config). I am 
  using 2.4.4-ac11 and playing w/ 2.4.5-pre3.
 
 It's automagically included, no config options necessary
 (drivers/char/raw.c)
then why can't I find /dev/raw* (I'm using devfs, FWIW)
 
 -- 
 Jens Axboe
 
 

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-19 Thread Linus Torvalds

In article [EMAIL PROTECTED], Jens Axboe  [EMAIL PROTECTED] wrote:
 
 As a result the system performance goes down. I'm still able to use
 my applications, but es every single piece of unused memory is swapped
 out, and swapping in costs a certain amount of time.

That's why streaming media applications like a dvd player should use raw
I/O -- to bypass system cache. See /dev/raw*

I disagree.. 

The fact is that the block device fs infrastructure is just sadly
broken. By using the buffer cache, it makes memory management very hard,
and just upgrading to the page cache would (a) speed stuff up and (b)
make it much easier for the kernel to do the right thing wrt the MM use.

Right now we don't try to aggressively drop streaming pages, but it's
possible. Using raw devices is a silly work-around that should not be
needed, and this load shows a real problem in current Linux (one soon to
be fixed, I think - Andrea already has some experimental patches for the
page-cache thing).

Linus
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-18 Thread Eduard Hasenleithner

On Fri, May 18, 2001 at 09:25:31PM +0200, Jens Axboe wrote:
> On Fri, May 18 2001, Eduard Hasenleithner wrote:
> > I have a problem with the buffering mechanism of my blockdevice,
> > namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
> > data linearly from the DVD, an excessive amount of buffer memory gets
> > allocated.
> > 
> > This can easily be reproduced with
> > cat /dev/sr0 > /dev/null
> > 
> > Remember, nearly the same task is carried out when playing a DVD.
> > 
> > As a result the system performance goes down. I'm still able to use
> > my applications, but es every single piece of unused memory is swapped
> > out, and swapping in costs a certain amount of time.
> 
> That's why streaming media applications like a dvd player should use raw
> I/O -- to bypass system cache. See /dev/raw*
> 

Oh, thank you. That was very fast!

I use xine. To be honest, the procedure of how to create a raw device
is described in their FAQ. But it is not described, what the raw device
does, only that it provides a speed improvement.

Until today, I didn't know what rawio actually does. Strange that I didn't
come across on some information about it.

Was there a official announcement of the availability of this feature?
Is some more detailled information about the rawio existing?

-- 
Eduard Hasenleithner
student of
Salzburg University of Applied Sciences and Technologies
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-18 Thread Jens Axboe

On Fri, May 18 2001, Eduard Hasenleithner wrote:
> I have a problem with the buffering mechanism of my blockdevice,
> namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
> data linearly from the DVD, an excessive amount of buffer memory gets
> allocated.
> 
> This can easily be reproduced with
>   cat /dev/sr0 > /dev/null
> 
> Remember, nearly the same task is carried out when playing a DVD.
> 
> As a result the system performance goes down. I'm still able to use
> my applications, but es every single piece of unused memory is swapped
> out, and swapping in costs a certain amount of time.

That's why streaming media applications like a dvd player should use raw
I/O -- to bypass system cache. See /dev/raw*

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



DVD blockdevice buffers

2001-05-18 Thread Eduard Hasenleithner

I have a problem with the buffering mechanism of my blockdevice,
namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
data linearly from the DVD, an excessive amount of buffer memory gets
allocated.

This can easily be reproduced with
cat /dev/sr0 > /dev/null

Remember, nearly the same task is carried out when playing a DVD.

As a result the system performance goes down. I'm still able to use
my applications, but es every single piece of unused memory is swapped
out, and swapping in costs a certain amount of time.

So, what wents wrong? I tried to find some information on this with
google and geocrawler, but i didn't have a success :(

Kernel: linux-2.4.4

hoping for some tips ...

-- 
Eduard Hasenleithner
student of
Salzburg University of Applied Sciences and Technologies
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



DVD blockdevice buffers

2001-05-18 Thread Eduard Hasenleithner

I have a problem with the buffering mechanism of my blockdevice,
namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
data linearly from the DVD, an excessive amount of buffer memory gets
allocated.

This can easily be reproduced with
cat /dev/sr0  /dev/null

Remember, nearly the same task is carried out when playing a DVD.

As a result the system performance goes down. I'm still able to use
my applications, but es every single piece of unused memory is swapped
out, and swapping in costs a certain amount of time.

So, what wents wrong? I tried to find some information on this with
google and geocrawler, but i didn't have a success :(

Kernel: linux-2.4.4

hoping for some tips ...

-- 
Eduard Hasenleithner
student of
Salzburg University of Applied Sciences and Technologies
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-18 Thread Jens Axboe

On Fri, May 18 2001, Eduard Hasenleithner wrote:
 I have a problem with the buffering mechanism of my blockdevice,
 namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
 data linearly from the DVD, an excessive amount of buffer memory gets
 allocated.
 
 This can easily be reproduced with
   cat /dev/sr0  /dev/null
 
 Remember, nearly the same task is carried out when playing a DVD.
 
 As a result the system performance goes down. I'm still able to use
 my applications, but es every single piece of unused memory is swapped
 out, and swapping in costs a certain amount of time.

That's why streaming media applications like a dvd player should use raw
I/O -- to bypass system cache. See /dev/raw*

-- 
Jens Axboe

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



Re: DVD blockdevice buffers

2001-05-18 Thread Eduard Hasenleithner

On Fri, May 18, 2001 at 09:25:31PM +0200, Jens Axboe wrote:
 On Fri, May 18 2001, Eduard Hasenleithner wrote:
  I have a problem with the buffering mechanism of my blockdevice,
  namely a ide_scsi DVD-ROM drive. After inserting a DVD and reading
  data linearly from the DVD, an excessive amount of buffer memory gets
  allocated.
  
  This can easily be reproduced with
  cat /dev/sr0  /dev/null
  
  Remember, nearly the same task is carried out when playing a DVD.
  
  As a result the system performance goes down. I'm still able to use
  my applications, but es every single piece of unused memory is swapped
  out, and swapping in costs a certain amount of time.
 
 That's why streaming media applications like a dvd player should use raw
 I/O -- to bypass system cache. See /dev/raw*
 

Oh, thank you. That was very fast!

I use xine. To be honest, the procedure of how to create a raw device
is described in their FAQ. But it is not described, what the raw device
does, only that it provides a speed improvement.

Until today, I didn't know what rawio actually does. Strange that I didn't
come across on some information about it.

Was there a official announcement of the availability of this feature?
Is some more detailled information about the rawio existing?

-- 
Eduard Hasenleithner
student of
Salzburg University of Applied Sciences and Technologies
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/