Re: open(O_DIRECT) on a tmpfs?

2007-01-08 Thread Bill Davidsen

Denis Vlasenko wrote:

On Friday 05 January 2007 17:20, Bill Davidsen wrote:
  

Denis Vlasenko wrote:


But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!
  


  
I'm not sure I can see how you find "don't use cache" not cache related. 
Saving the resources needed for cache would seem to obviously leave them 
for other processes.



I feel that word "direct" has nothing to do with caching (or lack thereof).
"Direct" means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: "I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks."

  

But _conceptually_ "direct DMAing" and "do-not-cache-me"
are orthogonal, right?
  

In the sense that you must do DMA or use cache, yes.



Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this "enhanced" cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

  

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say "do not cache me".
  
But none of those advisories says how to cache or not, only what the 
expected behavior will be. So FADV_NOREUSE does not control cache use, 
it simply allows the system to make assumptions.



Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying "I want direct DMA to disk without
extra copying". With fadvise(FADV_DONTNEED) you are saying
"do not expect access in the near future" == "do not try to optimize
for possible accesses in near future" == "do not cache"!.
  
As long as "don't cache" doesn't imply "don't buffer." In the case of a 
large copy or other large single-file write (8.5GB backup DVDs come to 
mind), the desired behavior is to buffer if possible, start writing 
immediately (data will not change in buffer), and release the buffer as 
soon as write is complete. That doesn't seem to be the current 
interpretation of DONTNEED. Or O_DIRECT either, I agree.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.
  
In linux if you point the gun at your foot and pull the trigger it goes 
bang. I have no problem with that.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.
  
Why should the kernel make an effort to remember? Incompetence, like 
virtue, is its own reward.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the "deranged monkey" comparison in that mail...
The problem with the suggested Linux implementation is that it's 
complex, and currently would move a lot of the logic into user space, in 
code which is probably not portable, or might tickle bad behavior on 
other systems.


Around 2.4.16 (or an -aa variant) I tried code to track writes per file, 
and if some number of bytes had been written to a file without a read or 
seek, any buffered blocks were queued to be written. This got around the 
behavior of generating data until memory was full, then writing it all 
out and having the disk very busy. It was just a proof of concept, but 
it did spread the disk writes to a more constant load and more 
consistent response to other i/o. There doesn't seem to be an easy 
tunable to do this, probably because the need isn't all that common.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with 

Re: open(O_DIRECT) on a tmpfs?

2007-01-08 Thread Bill Davidsen

Denis Vlasenko wrote:

On Friday 05 January 2007 17:20, Bill Davidsen wrote:
  

Denis Vlasenko wrote:


But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!
  


  
I'm not sure I can see how you find don't use cache not cache related. 
Saving the resources needed for cache would seem to obviously leave them 
for other processes.



I feel that word direct has nothing to do with caching (or lack thereof).
Direct means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks.

  

But _conceptually_ direct DMAing and do-not-cache-me
are orthogonal, right?
  

In the sense that you must do DMA or use cache, yes.



Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this enhanced cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

  

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say do not cache me.
  
But none of those advisories says how to cache or not, only what the 
expected behavior will be. So FADV_NOREUSE does not control cache use, 
it simply allows the system to make assumptions.



Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying I want direct DMA to disk without
extra copying. With fadvise(FADV_DONTNEED) you are saying
do not expect access in the near future == do not try to optimize
for possible accesses in near future == do not cache!.
  
As long as don't cache doesn't imply don't buffer. In the case of a 
large copy or other large single-file write (8.5GB backup DVDs come to 
mind), the desired behavior is to buffer if possible, start writing 
immediately (data will not change in buffer), and release the buffer as 
soon as write is complete. That doesn't seem to be the current 
interpretation of DONTNEED. Or O_DIRECT either, I agree.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.
  
In linux if you point the gun at your foot and pull the trigger it goes 
bang. I have no problem with that.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.
  
Why should the kernel make an effort to remember? Incompetence, like 
virtue, is its own reward.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the deranged monkey comparison in that mail...
The problem with the suggested Linux implementation is that it's 
complex, and currently would move a lot of the logic into user space, in 
code which is probably not portable, or might tickle bad behavior on 
other systems.


Around 2.4.16 (or an -aa variant) I tried code to track writes per file, 
and if some number of bytes had been written to a file without a read or 
seek, any buffered blocks were queued to be written. This got around the 
behavior of generating data until memory was full, then writing it all 
out and having the disk very busy. It was just a proof of concept, but 
it did spread the disk writes to a more constant load and more 
consistent response to other i/o. There doesn't seem to be an easy 
tunable to do this, probably because the need isn't all that common.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To 

Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Denis Vlasenko
On Friday 05 January 2007 17:20, Bill Davidsen wrote:
> Denis Vlasenko wrote:
> > But O_DIRECT is _not_ about cache. At least I think it was not about
> > cache initially, it was more about DMAing data directly from/to
> > application address space to/from disks, saving memcpy's and double
> > allocations. Why do you think it has that special alignment requirements?
> > Are they cache related? Not at all!

> I'm not sure I can see how you find "don't use cache" not cache related. 
> Saving the resources needed for cache would seem to obviously leave them 
> for other processes.

I feel that word "direct" has nothing to do with caching (or lack thereof).
"Direct" means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: "I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks."

> > But _conceptually_ "direct DMAing" and "do-not-cache-me"
> > are orthogonal, right?
>
> In the sense that you must do DMA or use cache, yes.

Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this "enhanced" cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

> > That's why we also have bona fide fadvise and madvise
> > with FADV_DONTNEED/MADV_DONTNEED:
> >
> > http://www.die.net/doc/linux/man/man2/fadvise.2.html
> > http://www.die.net/doc/linux/man/man2/madvise.2.html
> >
> > _This_ is the proper way to say "do not cache me".
>
> But none of those advisories says how to cache or not, only what the 
> expected behavior will be. So FADV_NOREUSE does not control cache use, 
> it simply allows the system to make assumptions.

Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying "I want direct DMA to disk without
extra copying". With fadvise(FADV_DONTNEED) you are saying
"do not expect access in the near future" == "do not try to optimize
for possible accesses in near future" == "do not cache"!.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the "deranged monkey" comparison in that mail...
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Bill Davidsen

Denis Vlasenko wrote:

On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
  

Hugh Dickins wrote:
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.



But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!
  
I'm not sure I can see how you find "don't use cache" not cache related. 
Saving the resources needed for cache would seem to obviously leave them 
for other processes.

After that people started adding unrelated semantics on it -
"oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch".
  
Did O_DIRECT ever use cache in some way? Doing DMA directly out of user 
space would seem to avoid using cache unless code was actually added to 
write to cache as well as disk, since the data isn't needed in any buffer.

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
"do-not-cache-me" than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ "direct DMAing" and "do-not-cache-me"
are orthogonal, right?
  

In the sense that you must do DMA or use cache, yes.

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say "do not cache me".
  
But none of those advisories says how to cache or not, only what the 
expected behavior will be. So FADV_NOREUSE does not control cache use, 
it simply allows the system to make assumptions. If I still had the load 
which generated my cache problems I would try both methods while doing a 
large data copy, and see if the end result was similar. In theory 
NOREUSE "could be" more efficient of disk, but also use a lot of cache 
depending  on the implementation.


One of the problems with RAID-5 and large data is that you can read it a 
lot faster than you can write it (in most cases), resulting in filling 
the cache with data from one process. Perhaps a scheduler tunable for 
allowed queued disk data would help with this, but copying a TB data set 
has a very bad effect on other i/o.

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.


Since tmpfs is useful for testing programs, this would have an actual 
user benefit.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Jesper Juhl

On 05/01/07, Jesper Juhl <[EMAIL PROTECTED]> wrote:

On 04/01/07, Hua Zhong <[EMAIL PROTECTED]> wrote:
> > I see that as a good argument _not_ to allow O_DIRECT on
> > tmpfs, which inevitably impacts cache, even if O_DIRECT were
> > requested.
> >
> > But I'd also expect any app requesting O_DIRECT in that way,
> > as a caring citizen, to fall back to going without O_DIRECT
> > when it's not supported.
>
> According to "man 2 open" on my system:
>
>O_DIRECT
>   Try to minimize cache effects of the I/O to and from this file.
>   In  general  this will degrade performance, but it is useful in
>   special situations, such as  when  applications  do  their  own
>   caching.  File I/O is done directly to/from user space buffers.
>   The I/O is synchronous, i.e., at the completion of the  read(2)
>   or write(2) system call, data is guaranteed to have been trans-
>   ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
>   user  buffer and file offset must all be multiples of the logi-
>   cal block size of the file system. Under Linux 2.6 alignment to
>   512-byte boundaries suffices.
>   A semantically similar interface for block devices is described
>   in raw(8).
>
> This says nothing about (probably disk based) persistent backing store. I 
don't see why tmpfs has to conflict with it.
>
> So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.
>

I'd agree.  O_DIRECT means data will go direct to backing store, so if
RAM *is* the backing store as in the tmpfs case, then I see why
O_DIRECT should fail for it...


Whoops, that should of course have read " then I *DON'T* see why
O_DIRECT should fail" .

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Jesper Juhl

On 04/01/07, Hua Zhong <[EMAIL PROTECTED]> wrote:

> I see that as a good argument _not_ to allow O_DIRECT on
> tmpfs, which inevitably impacts cache, even if O_DIRECT were
> requested.
>
> But I'd also expect any app requesting O_DIRECT in that way,
> as a caring citizen, to fall back to going without O_DIRECT
> when it's not supported.

According to "man 2 open" on my system:

   O_DIRECT
  Try to minimize cache effects of the I/O to and from this file.
  In  general  this will degrade performance, but it is useful in
  special situations, such as  when  applications  do  their  own
  caching.  File I/O is done directly to/from user space buffers.
  The I/O is synchronous, i.e., at the completion of the  read(2)
  or write(2) system call, data is guaranteed to have been trans-
  ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
  user  buffer and file offset must all be multiples of the logi-
  cal block size of the file system. Under Linux 2.6 alignment to
  512-byte boundaries suffices.
  A semantically similar interface for block devices is described
  in raw(8).

This says nothing about (probably disk based) persistent backing store. I don't 
see why tmpfs has to conflict with it.

So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.



I'd agree.  O_DIRECT means data will go direct to backing store, so if
RAM *is* the backing store as in the tmpfs case, then I see why
O_DIRECT should fail for it...

I often use tmpfs when I want to test new setups - it's easy to get
rid of again and it's fast during testing. Why shouldn't I be able to
test apps that use O_DIRECT this way?

--
Jesper Juhl <[EMAIL PROTECTED]>
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Helge Hafting

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Hua Zhong wrote:
  

So I'd argue that it makes more sense to support O_DIRECT
on tmpfs as the memory IS the backing store.



A few more voices in favour and I'll be persuaded.  Perhaps I'm
out of date: when O_DIRECT came in, just a few filesystems supported
it, and it was perfectly normal for open O_DIRECT to be failed; but
I wouldn't want tmpfs to stand out now as a lone obstacle.
  

Having tmpfs suppoting O_DIRECT makes sense.
For me, O_DIRECT says "write directly to the device
and don't return till its done."  Which is what tmpfs
always do anyway.

The support could probably be as simple as ignoring
the flag entirely, mask it away in open() or something like that.


Arguments about "O_DIRECT says don't cache it and tmpfs
_is_ the cache" don't work.  O_DIRECT says "write straight
to the device" and the device just happens to be pagecache
memory.  The tmpfs file sure isn't cached elsewhere in
addition to its tmpfs pages.

Helge Hafting

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Michael Tokarev
Hugh Dickins wrote:
> On Thu, 4 Jan 2007, Michael Tokarev wrote:
>> I wonder why open() with O_DIRECT (for example) bit set is
>> disallowed on a tmpfs (again, for example) filesystem,
>> returning EINVAL.
[]
> p.s.  You said "O_DIRECT (for example)" - what other open
> flag do you think tmpfs should support which it does not?

Well.  Somehow I was under an impression O_SYNC behaves the
same as O_DIRECT on a tmpfs.  But I was wrong - tmpfs permits
O_SYNC opens just fine.  Strange thing to do having in mind
its behaviour with O_DIRECT - to me it's inconsistent ;)
But that's it - looks like only O_DIRECT is "mishandled"
(which is not a big deal obviously).

Thanks for your time!

/mjt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Michael Tokarev
Hugh Dickins wrote:
 On Thu, 4 Jan 2007, Michael Tokarev wrote:
 I wonder why open() with O_DIRECT (for example) bit set is
 disallowed on a tmpfs (again, for example) filesystem,
 returning EINVAL.
[]
 p.s.  You said O_DIRECT (for example) - what other open
 flag do you think tmpfs should support which it does not?

Well.  Somehow I was under an impression O_SYNC behaves the
same as O_DIRECT on a tmpfs.  But I was wrong - tmpfs permits
O_SYNC opens just fine.  Strange thing to do having in mind
its behaviour with O_DIRECT - to me it's inconsistent ;)
But that's it - looks like only O_DIRECT is mishandled
(which is not a big deal obviously).

Thanks for your time!

/mjt
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Helge Hafting

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Hua Zhong wrote:
  

So I'd argue that it makes more sense to support O_DIRECT
on tmpfs as the memory IS the backing store.



A few more voices in favour and I'll be persuaded.  Perhaps I'm
out of date: when O_DIRECT came in, just a few filesystems supported
it, and it was perfectly normal for open O_DIRECT to be failed; but
I wouldn't want tmpfs to stand out now as a lone obstacle.
  

Having tmpfs suppoting O_DIRECT makes sense.
For me, O_DIRECT says write directly to the device
and don't return till its done.  Which is what tmpfs
always do anyway.

The support could probably be as simple as ignoring
the flag entirely, mask it away in open() or something like that.


Arguments about O_DIRECT says don't cache it and tmpfs
_is_ the cache don't work.  O_DIRECT says write straight
to the device and the device just happens to be pagecache
memory.  The tmpfs file sure isn't cached elsewhere in
addition to its tmpfs pages.

Helge Hafting

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Jesper Juhl

On 04/01/07, Hua Zhong [EMAIL PROTECTED] wrote:

 I see that as a good argument _not_ to allow O_DIRECT on
 tmpfs, which inevitably impacts cache, even if O_DIRECT were
 requested.

 But I'd also expect any app requesting O_DIRECT in that way,
 as a caring citizen, to fall back to going without O_DIRECT
 when it's not supported.

According to man 2 open on my system:

   O_DIRECT
  Try to minimize cache effects of the I/O to and from this file.
  In  general  this will degrade performance, but it is useful in
  special situations, such as  when  applications  do  their  own
  caching.  File I/O is done directly to/from user space buffers.
  The I/O is synchronous, i.e., at the completion of the  read(2)
  or write(2) system call, data is guaranteed to have been trans-
  ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
  user  buffer and file offset must all be multiples of the logi-
  cal block size of the file system. Under Linux 2.6 alignment to
  512-byte boundaries suffices.
  A semantically similar interface for block devices is described
  in raw(8).

This says nothing about (probably disk based) persistent backing store. I don't 
see why tmpfs has to conflict with it.

So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.



I'd agree.  O_DIRECT means data will go direct to backing store, so if
RAM *is* the backing store as in the tmpfs case, then I see why
O_DIRECT should fail for it...

I often use tmpfs when I want to test new setups - it's easy to get
rid of again and it's fast during testing. Why shouldn't I be able to
test apps that use O_DIRECT this way?

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Jesper Juhl

On 05/01/07, Jesper Juhl [EMAIL PROTECTED] wrote:

On 04/01/07, Hua Zhong [EMAIL PROTECTED] wrote:
  I see that as a good argument _not_ to allow O_DIRECT on
  tmpfs, which inevitably impacts cache, even if O_DIRECT were
  requested.
 
  But I'd also expect any app requesting O_DIRECT in that way,
  as a caring citizen, to fall back to going without O_DIRECT
  when it's not supported.

 According to man 2 open on my system:

O_DIRECT
   Try to minimize cache effects of the I/O to and from this file.
   In  general  this will degrade performance, but it is useful in
   special situations, such as  when  applications  do  their  own
   caching.  File I/O is done directly to/from user space buffers.
   The I/O is synchronous, i.e., at the completion of the  read(2)
   or write(2) system call, data is guaranteed to have been trans-
   ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
   user  buffer and file offset must all be multiples of the logi-
   cal block size of the file system. Under Linux 2.6 alignment to
   512-byte boundaries suffices.
   A semantically similar interface for block devices is described
   in raw(8).

 This says nothing about (probably disk based) persistent backing store. I 
don't see why tmpfs has to conflict with it.

 So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.


I'd agree.  O_DIRECT means data will go direct to backing store, so if
RAM *is* the backing store as in the tmpfs case, then I see why
O_DIRECT should fail for it...


Whoops, that should of course have read  then I *DON'T* see why
O_DIRECT should fail .

--
Jesper Juhl [EMAIL PROTECTED]
Don't top-post  http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please  http://www.expita.com/nomime.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Bill Davidsen

Denis Vlasenko wrote:

On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
  

Hugh Dickins wrote:
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.



But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!
  
I'm not sure I can see how you find don't use cache not cache related. 
Saving the resources needed for cache would seem to obviously leave them 
for other processes.

After that people started adding unrelated semantics on it -
oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch.
  
Did O_DIRECT ever use cache in some way? Doing DMA directly out of user 
space would seem to avoid using cache unless code was actually added to 
write to cache as well as disk, since the data isn't needed in any buffer.

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
do-not-cache-me than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ direct DMAing and do-not-cache-me
are orthogonal, right?
  

In the sense that you must do DMA or use cache, yes.

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say do not cache me.
  
But none of those advisories says how to cache or not, only what the 
expected behavior will be. So FADV_NOREUSE does not control cache use, 
it simply allows the system to make assumptions. If I still had the load 
which generated my cache problems I would try both methods while doing a 
large data copy, and see if the end result was similar. In theory 
NOREUSE could be more efficient of disk, but also use a lot of cache 
depending  on the implementation.


One of the problems with RAID-5 and large data is that you can read it a 
lot faster than you can write it (in most cases), resulting in filling 
the cache with data from one process. Perhaps a scheduler tunable for 
allowed queued disk data would help with this, but copying a TB data set 
has a very bad effect on other i/o.

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.


Since tmpfs is useful for testing programs, this would have an actual 
user benefit.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-05 Thread Denis Vlasenko
On Friday 05 January 2007 17:20, Bill Davidsen wrote:
 Denis Vlasenko wrote:
  But O_DIRECT is _not_ about cache. At least I think it was not about
  cache initially, it was more about DMAing data directly from/to
  application address space to/from disks, saving memcpy's and double
  allocations. Why do you think it has that special alignment requirements?
  Are they cache related? Not at all!

 I'm not sure I can see how you find don't use cache not cache related. 
 Saving the resources needed for cache would seem to obviously leave them 
 for other processes.

I feel that word direct has nothing to do with caching (or lack thereof).
Direct means that I want to avoid extra allocations and memcpy:

write(fd, hugebuf, 100*1024*1024);

Here application uses 100 megs for hugebuf, and if it is not sufficiently
aligned, even smartest kernel in this universe cannot DMA this data
to disk. No way. So it needs to allocate ANOTHER, aligned buffer,
memcpy the data (completely flushing L1 and L2 dcaches), and DMA it
from there. Thus we use twice as much RAM as we really need, and do
a lot of mostly pointless memory moves! And worse, application cannot
even detect it - it works, it's just slow and eats a lot of RAM and CPU.

That's where O_DIRECT helps. When app wants to avoid that, it opens fd
with O_DIRECT. App in effect says: I *do* want to avoid extra shuffling,
because I will write huge amounts of data in big blocks.

  But _conceptually_ direct DMAing and do-not-cache-me
  are orthogonal, right?

 In the sense that you must do DMA or use cache, yes.

Let's say I implemented a heuristic in my cp command:
if source file is indeed a regular file and it is
larger than 128K, allocate aligned 128K buffer
and try to copy it using O_DIRECT i/o.

Then I use this enhanced cp command to copy a large directory
recursively, and then I run grep on that directory.

Can you explain why cp shouldn't cache the data it just wrote?
I *am* going to use it shortly thereafter!

  That's why we also have bona fide fadvise and madvise
  with FADV_DONTNEED/MADV_DONTNEED:
 
  http://www.die.net/doc/linux/man/man2/fadvise.2.html
  http://www.die.net/doc/linux/man/man2/madvise.2.html
 
  _This_ is the proper way to say do not cache me.

 But none of those advisories says how to cache or not, only what the 
 expected behavior will be. So FADV_NOREUSE does not control cache use, 
 it simply allows the system to make assumptions.

Exactly. If you don't need the data, Just let kernel know that.
When you use O_DIRECT, you are saying I want direct DMA to disk without
extra copying. With fadvise(FADV_DONTNEED) you are saying
do not expect access in the near future == do not try to optimize
for possible accesses in near future == do not cache!.

Again: with O_DIRECT:

write(fd, hugebuf, 100*1024*1024);

kernel _has _difficulty_ caching these data, simply because
data isn't copied into kernel pages anyway, and if user will
continue to use hugebuf after write(), kernel simply cannot
cache that data - it _hasn't_ the data.

But if user will unmap the hugebuf? What then? Should kernel
forget that data in these pages is in effect a cached data from
the file being written to? Not necessarily.

Four years ago Linus wrote an email about it:

http://lkml.org/lkml/2002/5/11/58

btw, as an Oracle DBA on my day job, I completely agree
with Linus on the deranged monkey comparison in that mail...
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Chen, Kenneth W
Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM
> On Thu, 4 Jan 2007, Hua Zhong wrote:
> > So I'd argue that it makes more sense to support O_DIRECT
> > on tmpfs as the memory IS the backing store.
> 
> A few more voices in favour and I'll be persuaded.  Perhaps I'm
> out of date: when O_DIRECT came in, just a few filesystems supported
> it, and it was perfectly normal for open O_DIRECT to be failed; but
> I wouldn't want tmpfs to stand out now as a lone obstacle.

Maybe a bit hackish, all we need is to have an empty .direct_IO method
in shmem_aops to make __dentry_open() to pass the O_DIRECT check.  The
following patch adds 40 bytes to kernel text on x86-64.  An even more
hackish but zero cost route is to make .direct_IO variable non-zero via
a cast of -1 or some sort (that is probably ugly as hell).


diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c
--- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800
+++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800
@@ -2314,10 +2314,18 @@ static void destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
 }
 
+ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs)
+{
+   /* dummy direct_IO function.  Not to be executed */
+   BUG();
+}
+
 static const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
 #ifdef CONFIG_TMPFS
+   .direct_IO  = shmem_direct_IO,
.prepare_write  = shmem_prepare_write,
.commit_write   = simple_commit_write,
 #endif
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Nick Piggin

Denis Vlasenko wrote:

On Thursday 04 January 2007 17:19, Bill Davidsen wrote:


Hugh Dickins wrote:
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.



But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!


I don't know whether that is the case. The two issues are related -- the
IO is be done zero-copy because there is no cache involved, and due to
there being no cache, there are alignment restrictions.

I think IRIX might have implemented O_DIRECT first, and although the
semantics are a bit vague, I think it has always been to do zero copy
IO _and_ to bypass cache (ie. no splice-like tricks).


After that people started adding unrelated semantics on it -
"oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch".


It is because they already do their own caching, so going through
another, dumber, cache of same or less size (the pagecache) is useless.
fadvise does not change that.

That said, tmpfs's page are not really a cache (except when they are
swapcache, but let's not complicate things). So O_DIRECT on tmpfs
may not exactly be wrong.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Denis Vlasenko
On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
> Hugh Dickins wrote:
> In many cases the use of O_DIRECT is purely to avoid impact on cache 
> used by other applications. An application which writes a large quantity 
> of data will have less impact on other applications by using O_DIRECT, 
> assuming that the data will not be read from cache due to application 
> pattern or the data being much larger than physical memory.

But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!

After that people started adding unrelated semantics on it -
"oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch".

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
"do-not-cache-me" than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ "direct DMAing" and "do-not-cache-me"
are orthogonal, right?

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say "do not cache me".

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.
--
vda
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Mark Lord

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Hua Zhong wrote:

So I'd argue that it makes more sense to support O_DIRECT
on tmpfs as the memory IS the backing store.


A few more voices in favour and I'll be persuaded. 


I see no reason to restrict it as is currently done.

Policy belongs in userspace, not in the kernel,
so long as the code impact is miniscule.

Cheers
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Hua Zhong wrote:
> 
> So I'd argue that it makes more sense to support O_DIRECT
> on tmpfs as the memory IS the backing store.

A few more voices in favour and I'll be persuaded.  Perhaps I'm
out of date: when O_DIRECT came in, just a few filesystems supported
it, and it was perfectly normal for open O_DIRECT to be failed; but
I wouldn't want tmpfs to stand out now as a lone obstacle.

Christoph, what's your take on this?

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hua Zhong
> I see that as a good argument _not_ to allow O_DIRECT on 
> tmpfs, which inevitably impacts cache, even if O_DIRECT were 
> requested.
> 
> But I'd also expect any app requesting O_DIRECT in that way, 
> as a caring citizen, to fall back to going without O_DIRECT 
> when it's not supported.

According to "man 2 open" on my system:

   O_DIRECT
  Try to minimize cache effects of the I/O to and from this file.
  In  general  this will degrade performance, but it is useful in
  special situations, such as  when  applications  do  their  own
  caching.  File I/O is done directly to/from user space buffers.
  The I/O is synchronous, i.e., at the completion of the  read(2)
  or write(2) system call, data is guaranteed to have been trans-
  ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
  user  buffer and file offset must all be multiples of the logi-
  cal block size of the file system. Under Linux 2.6 alignment to
  512-byte boundaries suffices.
  A semantically similar interface for block devices is described
  in raw(8).

This says nothing about (probably disk based) persistent backing store. I don't 
see why tmpfs has to conflict with it.

So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.

And EINVAL isn't even a very specific error.

Hua

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bill Davidsen

Peter Staubach wrote:

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Bill Davidsen wrote:
 
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by
other applications. An application which writes a large quantity of 
data will
have less impact on other applications by using O_DIRECT, assuming 
that the
data will not be read from cache due to application pattern or the 
data being

much larger than physical memory.



I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.


I suppose that one could also argue that the backing store for tmpfs
is the memory itself and thus, O_DIRECT could or should be supported. 


I suspect that many applications don't try to distinguish an open error 
beyond pass/fail. If the application actually tried to correct errors, 
like creating missing directories, it might, but if the error is going 
to be reported to the user and treated as fatal there's probably no 
logic to tell "can't do it" from "could if you asked the right way."


I always thought the difference between Linux and Windows was the "big 
brother" attitude. If someone wants to use O_DIRECT and tmpfs, and the 
system can allow it, why have code to block it because someone thinks 
they know better how the users should do things.


--
bill davidsen <[EMAIL PROTECTED]>
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Peter Staubach

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Bill Davidsen wrote:
  

In many cases the use of O_DIRECT is purely to avoid impact on cache used by
other applications. An application which writes a large quantity of data will
have less impact on other applications by using O_DIRECT, assuming that the
data will not be read from cache due to application pattern or the data being
much larger than physical memory.



I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.


I suppose that one could also argue that the backing store for tmpfs
is the memory itself and thus, O_DIRECT could or should be supported.

   Thanx...

  ps
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Bill Davidsen wrote:
> 
> In many cases the use of O_DIRECT is purely to avoid impact on cache used by
> other applications. An application which writes a large quantity of data will
> have less impact on other applications by using O_DIRECT, assuming that the
> data will not be read from cache due to application pattern or the data being
> much larger than physical memory.

I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bill Davidsen

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Michael Tokarev wrote:

I wonder why open() with O_DIRECT (for example) bit set is
disallowed on a tmpfs (again, for example) filesystem,
returning EINVAL.


Because it would be (a very small amount of) work and bloat to
support O_DIRECT on tmpfs; because that work didn't seem useful;
and because the nature of tmpfs (completely in page cache) is at
odds with the nature of O_DIRECT (completely avoiding page cache),
so it would seem misleading to support it.

You have a valid view, that we should not forbid what can easily be
allowed; and a valid (experimental) use for O_DIRECT on tmpfs; and
a valid alternative perception, that the nature of tmpfs is already
direct, so O_DIRECT should be allowed as a no-op upon it.


It does seem odd to require that every application using O_DIRECT would 
have to contain code to make it work with tmpfs, or that the admin would 
have to jump through a hoop and introduce (slight) overhead to bypass 
the problem, when the implementation is mostly to stop disallowing 
something which would currently work if allowed.




On the other hand, I'm glad that you've found a good workaround,
using loop, and suspect that it's appropriate that you should have
to use such a workaround: if the app cares so much that it insists
on O_DIRECT succeeding (for the ordering and persistence of its
metadata), would it be right for tmpfs to deceive it?


In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.


I'm inclined to stick with the status quo;
but could be persuaded by a chorus behind you.


This isn't impacting me directly, but I can imagine some applications I 
have written, which currently use O_DIRECT, failing if someone chose the 
put a control file on tmpfs. I may be missing some benefit from 
restricting O_DIRECT, feel free to point it out.


Hugh

p.s.  You said "O_DIRECT (for example)" - what other open
flag do you think tmpfs should support which it does not?



--
bill davidsen <[EMAIL PROTECTED]>
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bodo Eggert
Michael Tokarev <[EMAIL PROTECTED]> wrote:

> I wonder why open() with O_DIRECT (for example) bit set is
> disallowed on a tmpfs (again, for example) filesystem,
> returning EINVAL.
> 
> Yes, the question may seems strange a bit, because of two
> somewhat conflicting reasons.  First, there's no reason to
> use O_DIRECT with tmpfs in a first place, because tmpfs does
> not have backing store at all, so there's no place to do
> direct writes to.  But on another hand, again due to the very
> nature of tmpfs, there's no reason not to allow O_DIRECT
> open and just ignore it, -- exactly because there's no
> backing store for this filesystem.

I'm using a tmpfs as a mostly-ramdisk, that is I've set up a large swap
partition in case I need the RAM instead of using it for a filesystem.
Therefore it will sometimes have a backing store.

OTOH, ramfs does not have this property (the cache is the backing store),
so it would make sense to allow it at least there.

BTW: Maybe you could use a ramdisk instead of the loop-on-tmpfs.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Michael Tokarev wrote:
> I wonder why open() with O_DIRECT (for example) bit set is
> disallowed on a tmpfs (again, for example) filesystem,
> returning EINVAL.

Because it would be (a very small amount of) work and bloat to
support O_DIRECT on tmpfs; because that work didn't seem useful;
and because the nature of tmpfs (completely in page cache) is at
odds with the nature of O_DIRECT (completely avoiding page cache),
so it would seem misleading to support it.

You have a valid view, that we should not forbid what can easily be
allowed; and a valid (experimental) use for O_DIRECT on tmpfs; and
a valid alternative perception, that the nature of tmpfs is already
direct, so O_DIRECT should be allowed as a no-op upon it.

On the other hand, I'm glad that you've found a good workaround,
using loop, and suspect that it's appropriate that you should have
to use such a workaround: if the app cares so much that it insists
on O_DIRECT succeeding (for the ordering and persistence of its
metadata), would it be right for tmpfs to deceive it?

I'm inclined to stick with the status quo;
but could be persuaded by a chorus behind you.

Hugh

p.s.  You said "O_DIRECT (for example)" - what other open
flag do you think tmpfs should support which it does not?
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Michael Tokarev wrote:
 I wonder why open() with O_DIRECT (for example) bit set is
 disallowed on a tmpfs (again, for example) filesystem,
 returning EINVAL.

Because it would be (a very small amount of) work and bloat to
support O_DIRECT on tmpfs; because that work didn't seem useful;
and because the nature of tmpfs (completely in page cache) is at
odds with the nature of O_DIRECT (completely avoiding page cache),
so it would seem misleading to support it.

You have a valid view, that we should not forbid what can easily be
allowed; and a valid (experimental) use for O_DIRECT on tmpfs; and
a valid alternative perception, that the nature of tmpfs is already
direct, so O_DIRECT should be allowed as a no-op upon it.

On the other hand, I'm glad that you've found a good workaround,
using loop, and suspect that it's appropriate that you should have
to use such a workaround: if the app cares so much that it insists
on O_DIRECT succeeding (for the ordering and persistence of its
metadata), would it be right for tmpfs to deceive it?

I'm inclined to stick with the status quo;
but could be persuaded by a chorus behind you.

Hugh

p.s.  You said O_DIRECT (for example) - what other open
flag do you think tmpfs should support which it does not?
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bodo Eggert
Michael Tokarev [EMAIL PROTECTED] wrote:

 I wonder why open() with O_DIRECT (for example) bit set is
 disallowed on a tmpfs (again, for example) filesystem,
 returning EINVAL.
 
 Yes, the question may seems strange a bit, because of two
 somewhat conflicting reasons.  First, there's no reason to
 use O_DIRECT with tmpfs in a first place, because tmpfs does
 not have backing store at all, so there's no place to do
 direct writes to.  But on another hand, again due to the very
 nature of tmpfs, there's no reason not to allow O_DIRECT
 open and just ignore it, -- exactly because there's no
 backing store for this filesystem.

I'm using a tmpfs as a mostly-ramdisk, that is I've set up a large swap
partition in case I need the RAM instead of using it for a filesystem.
Therefore it will sometimes have a backing store.

OTOH, ramfs does not have this property (the cache is the backing store),
so it would make sense to allow it at least there.

BTW: Maybe you could use a ramdisk instead of the loop-on-tmpfs.
-- 
Ich danke GMX dafür, die Verwendung meiner Adressen mittels per SPF
verbreiteten Lügen zu sabotieren.

http://david.woodhou.se/why-not-spf.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bill Davidsen

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Michael Tokarev wrote:

I wonder why open() with O_DIRECT (for example) bit set is
disallowed on a tmpfs (again, for example) filesystem,
returning EINVAL.


Because it would be (a very small amount of) work and bloat to
support O_DIRECT on tmpfs; because that work didn't seem useful;
and because the nature of tmpfs (completely in page cache) is at
odds with the nature of O_DIRECT (completely avoiding page cache),
so it would seem misleading to support it.

You have a valid view, that we should not forbid what can easily be
allowed; and a valid (experimental) use for O_DIRECT on tmpfs; and
a valid alternative perception, that the nature of tmpfs is already
direct, so O_DIRECT should be allowed as a no-op upon it.


It does seem odd to require that every application using O_DIRECT would 
have to contain code to make it work with tmpfs, or that the admin would 
have to jump through a hoop and introduce (slight) overhead to bypass 
the problem, when the implementation is mostly to stop disallowing 
something which would currently work if allowed.




On the other hand, I'm glad that you've found a good workaround,
using loop, and suspect that it's appropriate that you should have
to use such a workaround: if the app cares so much that it insists
on O_DIRECT succeeding (for the ordering and persistence of its
metadata), would it be right for tmpfs to deceive it?


In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.


I'm inclined to stick with the status quo;
but could be persuaded by a chorus behind you.


This isn't impacting me directly, but I can imagine some applications I 
have written, which currently use O_DIRECT, failing if someone chose the 
put a control file on tmpfs. I may be missing some benefit from 
restricting O_DIRECT, feel free to point it out.


Hugh

p.s.  You said O_DIRECT (for example) - what other open
flag do you think tmpfs should support which it does not?



--
bill davidsen [EMAIL PROTECTED]
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Bill Davidsen wrote:
 
 In many cases the use of O_DIRECT is purely to avoid impact on cache used by
 other applications. An application which writes a large quantity of data will
 have less impact on other applications by using O_DIRECT, assuming that the
 data will not be read from cache due to application pattern or the data being
 much larger than physical memory.

I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Peter Staubach

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Bill Davidsen wrote:
  

In many cases the use of O_DIRECT is purely to avoid impact on cache used by
other applications. An application which writes a large quantity of data will
have less impact on other applications by using O_DIRECT, assuming that the
data will not be read from cache due to application pattern or the data being
much larger than physical memory.



I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.


I suppose that one could also argue that the backing store for tmpfs
is the memory itself and thus, O_DIRECT could or should be supported.

   Thanx...

  ps
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Bill Davidsen

Peter Staubach wrote:

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Bill Davidsen wrote:
 
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by
other applications. An application which writes a large quantity of 
data will
have less impact on other applications by using O_DIRECT, assuming 
that the
data will not be read from cache due to application pattern or the 
data being

much larger than physical memory.



I see that as a good argument _not_ to allow O_DIRECT on tmpfs,
which inevitably impacts cache, even if O_DIRECT were requested.

But I'd also expect any app requesting O_DIRECT in that way, as a caring
citizen, to fall back to going without O_DIRECT when it's not supported.


I suppose that one could also argue that the backing store for tmpfs
is the memory itself and thus, O_DIRECT could or should be supported. 


I suspect that many applications don't try to distinguish an open error 
beyond pass/fail. If the application actually tried to correct errors, 
like creating missing directories, it might, but if the error is going 
to be reported to the user and treated as fatal there's probably no 
logic to tell can't do it from could if you asked the right way.


I always thought the difference between Linux and Windows was the big 
brother attitude. If someone wants to use O_DIRECT and tmpfs, and the 
system can allow it, why have code to block it because someone thinks 
they know better how the users should do things.


--
bill davidsen [EMAIL PROTECTED]
 CTO TMR Associates, Inc
 Doing interesting things with small computers since 1979

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hua Zhong
 I see that as a good argument _not_ to allow O_DIRECT on 
 tmpfs, which inevitably impacts cache, even if O_DIRECT were 
 requested.
 
 But I'd also expect any app requesting O_DIRECT in that way, 
 as a caring citizen, to fall back to going without O_DIRECT 
 when it's not supported.

According to man 2 open on my system:

   O_DIRECT
  Try to minimize cache effects of the I/O to and from this file.
  In  general  this will degrade performance, but it is useful in
  special situations, such as  when  applications  do  their  own
  caching.  File I/O is done directly to/from user space buffers.
  The I/O is synchronous, i.e., at the completion of the  read(2)
  or write(2) system call, data is guaranteed to have been trans-
  ferred.  Under Linux 2.4 transfer sizes, and the  alignment  of
  user  buffer and file offset must all be multiples of the logi-
  cal block size of the file system. Under Linux 2.6 alignment to
  512-byte boundaries suffices.
  A semantically similar interface for block devices is described
  in raw(8).

This says nothing about (probably disk based) persistent backing store. I don't 
see why tmpfs has to conflict with it.

So I'd argue that it makes more sense to support O_DIRECT on tmpfs as the 
memory IS the backing store.

And EINVAL isn't even a very specific error.

Hua

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Hugh Dickins
On Thu, 4 Jan 2007, Hua Zhong wrote:
 
 So I'd argue that it makes more sense to support O_DIRECT
 on tmpfs as the memory IS the backing store.

A few more voices in favour and I'll be persuaded.  Perhaps I'm
out of date: when O_DIRECT came in, just a few filesystems supported
it, and it was perfectly normal for open O_DIRECT to be failed; but
I wouldn't want tmpfs to stand out now as a lone obstacle.

Christoph, what's your take on this?

Hugh
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Mark Lord

Hugh Dickins wrote:

On Thu, 4 Jan 2007, Hua Zhong wrote:

So I'd argue that it makes more sense to support O_DIRECT
on tmpfs as the memory IS the backing store.


A few more voices in favour and I'll be persuaded. 


I see no reason to restrict it as is currently done.

Policy belongs in userspace, not in the kernel,
so long as the code impact is miniscule.

Cheers
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Denis Vlasenko
On Thursday 04 January 2007 17:19, Bill Davidsen wrote:
 Hugh Dickins wrote:
 In many cases the use of O_DIRECT is purely to avoid impact on cache 
 used by other applications. An application which writes a large quantity 
 of data will have less impact on other applications by using O_DIRECT, 
 assuming that the data will not be read from cache due to application 
 pattern or the data being much larger than physical memory.

But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!

After that people started adding unrelated semantics on it -
oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch.

DB people from certain well-known commercial DB have zero coding
taste. No wonder their binaries are nearly 100 MB (!!!) in size...

In all fairness, O_DIRECT's direct-DMA makes is easier to implement
do-not-cache-me than to do it for generic read()/write()
(just because O_DIRECT is (was?) using different code path,
not integrated into VM cache machinery that much).

But _conceptually_ direct DMAing and do-not-cache-me
are orthogonal, right?

That's why we also have bona fide fadvise and madvise
with FADV_DONTNEED/MADV_DONTNEED:

http://www.die.net/doc/linux/man/man2/fadvise.2.html
http://www.die.net/doc/linux/man/man2/madvise.2.html

_This_ is the proper way to say do not cache me.

I think tmpfs should just ignore O_DIRECT bit.
That won't require much coding.
--
vda
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Nick Piggin

Denis Vlasenko wrote:

On Thursday 04 January 2007 17:19, Bill Davidsen wrote:


Hugh Dickins wrote:
In many cases the use of O_DIRECT is purely to avoid impact on cache 
used by other applications. An application which writes a large quantity 
of data will have less impact on other applications by using O_DIRECT, 
assuming that the data will not be read from cache due to application 
pattern or the data being much larger than physical memory.



But O_DIRECT is _not_ about cache. At least I think it was not about
cache initially, it was more about DMAing data directly from/to
application address space to/from disks, saving memcpy's and double
allocations. Why do you think it has that special alignment requirements?
Are they cache related? Not at all!


I don't know whether that is the case. The two issues are related -- the
IO is be done zero-copy because there is no cache involved, and due to
there being no cache, there are alignment restrictions.

I think IRIX might have implemented O_DIRECT first, and although the
semantics are a bit vague, I think it has always been to do zero copy
IO _and_ to bypass cache (ie. no splice-like tricks).


After that people started adding unrelated semantics on it -
oh, we use O_DIRECT in our database code and it pushes EVERYTHING
else out of cache. This is bad. Let's overload O_DIRECT to also mean
'do not pollute the cache'. Here's the patch.


It is because they already do their own caching, so going through
another, dumber, cache of same or less size (the pagecache) is useless.
fadvise does not change that.

That said, tmpfs's page are not really a cache (except when they are
swapcache, but let's not complicate things). So O_DIRECT on tmpfs
may not exactly be wrong.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: open(O_DIRECT) on a tmpfs?

2007-01-04 Thread Chen, Kenneth W
Hugh Dickins wrote on Thursday, January 04, 2007 11:14 AM
 On Thu, 4 Jan 2007, Hua Zhong wrote:
  So I'd argue that it makes more sense to support O_DIRECT
  on tmpfs as the memory IS the backing store.
 
 A few more voices in favour and I'll be persuaded.  Perhaps I'm
 out of date: when O_DIRECT came in, just a few filesystems supported
 it, and it was perfectly normal for open O_DIRECT to be failed; but
 I wouldn't want tmpfs to stand out now as a lone obstacle.

Maybe a bit hackish, all we need is to have an empty .direct_IO method
in shmem_aops to make __dentry_open() to pass the O_DIRECT check.  The
following patch adds 40 bytes to kernel text on x86-64.  An even more
hackish but zero cost route is to make .direct_IO variable non-zero via
a cast of -1 or some sort (that is probably ugly as hell).


diff -Nurp linus-2.6.git/mm/shmem.c linus-2.6.git.ken/mm/shmem.c
--- linus-2.6.git/mm/shmem.c2006-12-27 19:06:11.0 -0800
+++ linus-2.6.git.ken/mm/shmem.c2007-01-04 21:03:14.0 -0800
@@ -2314,10 +2314,18 @@ static void destroy_inodecache(void)
kmem_cache_destroy(shmem_inode_cachep);
 }
 
+ssize_t shmem_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
+   loff_t offset, unsigned long nr_segs)
+{
+   /* dummy direct_IO function.  Not to be executed */
+   BUG();
+}
+
 static const struct address_space_operations shmem_aops = {
.writepage  = shmem_writepage,
.set_page_dirty = __set_page_dirty_nobuffers,
 #ifdef CONFIG_TMPFS
+   .direct_IO  = shmem_direct_IO,
.prepare_write  = shmem_prepare_write,
.commit_write   = simple_commit_write,
 #endif
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/