Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 01:36:36 +0100, Arnd Bergmann wrote:
 
 Using the current glibc implementation on a compressed file system ideally
 should be a very expensive no-op because you won't actually allocate much
 space for a file when writing zeroes to it. You also don't benefit of a
 contiguous allocation in logfs, since flash has uniform seek times over
 all the medium.
 
 I'd suggest you implement posix_fallocate as an real nop and just return
 success without doing anything. You could also return ENOSPC in case
 the blocks requested by posix_fallocate don't fit on the medium without
 compression, but that is more or less just guesswork (like statfs is).

Quoting POSIX_FALLOCATE(3):
   The function posix_fallocate() ensures that disk space is allocated for
   the file referred to by the descriptor fd for the bytes  in  the range
   starting  at  offset  and continuing for len bytes.  After a successful
   call to posix_fallocate(), subsequent writes to bytes in the specified
   range are guaranteed not to fail because of lack of disk space.

   If  the  size  of  the  file  is less than offset+len, then the file is
   increased to this size; otherwise the file size is left unchanged.

Afaics, the (main) purpose of this function is not to decrease
fragmentation but to ensure mmap() won't cause any problems because the
medium fills up.  That problem exists for LogFS as well, once rw mmap()
is supported.

Simply returning success without doing anything would be a bug.  -ENOSPC
is a better choice, but still a lame implementation.  And falling back
on libc to write zeroes in a loop is an exercise in futility.

Does the allocation have to be persistent beyond lifetime of the file
descriptor?  It would be fairly simple to support the write guarantee
while the file is open (or rather the inode remains cached) and drop it
afterwards.

Jörn

-- 
[One] doesn't need to know [...] how to cause a headache in order
to take an aspirin.
-- Scott Culp, Manager of the Microsoft Security Response Center, 2001
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Sat, Mar 03, 2007 at 11:45:32PM +0100, Arnd Bergmann wrote:
  I'd be more happy to have the write out zeroes loop in glibc. ?And
  glibc needs to have it anyway, for older kernels.
 
 A generic_fallocate makes sense to me iff we can do it in the kernel
 more significantly more efficiently than in glibc, e.g. by using only
 a single page in page cache instead of one for each page to be preallocated.

We can't do that with the current page cache interfaces.  But what
might make sense is to have a block_dump_prealloc that takes a get_block
callback to do what you propose.  It still wouldn't be entirely generic,
but would allow block based filesystems to do a not entirely dumb
implementation.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Jörn Engel
On Mon, 5 March 2007 00:32:14 +, Anton Altaparmakov wrote:
 
 I don't know how your compression algorithm works [...]

LogFS is designed for flash media, so it does not have to worry much
about reducing disk seeks.  It is log-structured, which simplifies
compression further.

When writing a block, it basically compresses it and appends it to the
log.  Writes only have to be byte-aligned, so no space is lost for
padding.

The bad news for posix_fallocate() is that even if libc is smart enough
to write random data, mmap() can still cause problems.  If the VM
decides to write a given page twice, the second write compresses better
and the medium has filled up between the two writes, the users will have
fun.

Jörn

-- 
Joern's library part 9:
http://www.scl.ameslab.gov/Publications/Gus/TwelveWays.html
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Anton Altaparmakov

On 5 Mar 2007, at 14:37, Theodore Tso wrote:

On Sun, Mar 04, 2007 at 11:22:06PM +, Anton Altaparmakov wrote:

And I specifically did NOT update the initialized size in the inode
thus it will remain at its old value thus all new allocated blocks
will be considered as present but not initialized thus a read will
always return zero whilst a write will do the right thing and pad
with zeroes as necessary (if the write is smaller than the block
size, etc).


You're describing a method of doing in-advance preallocation
where the filesystem format explicitly has support for this kind of
feature in a way that doesn't require pre-zeroing the data blocks in
question.


Indeed.


The question which this subthread was concerned about was
whether the kernel should get involved in initializing datablocks in
the case where the filesystem format does not have this support, or
whether this functionality should continue to be done in userspace.
Given that glibc already has to support this for older kernels, I
would argue that there's no point putting in generic support for
filesystem that can't support a more advanced way of doing things.


Yes, I understood that after I had sent my post...  And yes, I would  
agree.  If glibc already does this there does not appear to be any  
value in just moving existing functionality into the kernel.  Simply  
let dumb file systems return ENOSYS and let glibc do it...  And any  
FS which can do it better can implement the function and then glibc  
should not go anywhere near it.


Best regards,

Anton
--
Anton Altaparmakov aia21 at cam.ac.uk (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/


-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
 Given that glibc already has to support this for older kernels, I
 would argue that there's no point putting in generic support for
 filesystem that can't support a more advanced way of doing things.

Well, I'm sure the kernel can do better than the code we have in libc
now.  The kernel has access to the bitmasks which say which blocks have
already been allocated.  The libc code does not and we have to be very
simple-minded and simply touch every block.  And this means reading it
and then writing it back.  The kernel would know when the reading part
is not necessary.  Add to then the block granularity (we use f_bsize as
returned from fstatfs but that's not the best value in some cases) and
you have compelling data to have generic code in the kernel.  Then libc
implementation can then go away completely which is a good thing.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 Does the allocation have to be persistent beyond lifetime of the file
 descriptor?

Of course.  You call posix_fallocate once for the lifetime of the file
when it is created to ensure that all future uses will work.

It seems your filesystem will not be able to support this unless
compression is turned off.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 The bad news for posix_fallocate() is that even if libc is smart enough
 to write random data, mmap() can still cause problems.

This is not smart, quite to the contrary.  The standard guarantees that
all not-yet-written-to places in the file are zero.  And if a block has
already been written posix_fallocate cannot change it.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
 Theodore Tso wrote:
  Given that glibc already has to support this for older kernels, I
  would argue that there's no point putting in generic support for
  filesystem that can't support a more advanced way of doing things.
 
 Well, I'm sure the kernel can do better than the code we have in libc
 now.  The kernel has access to the bitmasks which say which blocks have
 already been allocated.

The layer of the kernel where a totally generic fallback would be
implemented does not have access to this information.  We could do
a mostly generic helper for block filesystems that allows to implement
fallocate this way without a lot of their own code.
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Jörn Engel wrote:
 Of course.  You call posix_fallocate once for the lifetime of the file
 when it is created to ensure that all future uses will work.
 
 That part is not quite clear from the manpage but I trust most people
 would assume the same.

Not only that, it is what this function is for.  In the POSIX committee
we've looked at the functions in detail before adding them, even if some
information is not in the man page but instead in the Rationale.


 Still, it is quite obvious that noone designing this interface has lost
 much thought to compressing filesystems.

You already have problems with supporting the functionality
posix_fallocate is supporting.  You cannot reliably support MAP_SHARED
files if all of a sudden the compression causes and expansion of a block
and that causes a ENOSPC error.  So, don't expect pity.  This is a
function in support of a real and reliable implementation of memory
mapped files.  You don't use MAP_SHARED on such filesystems, it'll eat
your kittens sooner or later anyway.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Theodore Tso
On Mon, Mar 05, 2007 at 07:15:33AM -0800, Ulrich Drepper wrote:
 Well, I'm sure the kernel can do better than the code we have in libc
 now.  The kernel has access to the bitmasks which say which blocks have
 already been allocated.  The libc code does not and we have to be very
 simple-minded and simply touch every block.  And this means reading it
 and then writing it back.  The kernel would know when the reading part
 is not necessary.  Add to then the block granularity (we use f_bsize as
 returned from fstatfs but that's not the best value in some cases) and
 you have compelling data to have generic code in the kernel.  Then libc
 implementation can then go away completely which is a good thing.

You have a very good point; indeed since we don't export an interface
which allows userspace to determine whether or not a block is in use,
that does mean a huge amount of churn in the page cache.  So maybe it
would be worth doing in the kernel as a result, although the libc
implementation still wouldn't be able to go away for long time due to
the need to be backwards compatible with older kernels that didn't
have this support.

Regards,

- Ted
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Ulrich Drepper
Theodore Tso wrote:
 [...] although the libc
 implementation still wouldn't be able to go away for long time due to
 the need to be backwards compatible with older kernels that didn't
 have this support.

It's better than that.  If somebody compiles glibc to not run on older
kernels at all (tested at runtime) then the code is dropped.  E.g., the
current Fedora glibc does not support 2.6.8 or earlier.

So, don't let the compat code be a factor in the decision making.

-- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖



signature.asc
Description: OpenPGP digital signature


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Mingming Cao

Jan Kara wrote:

On Fri, 02 Mar 2007 09:40:54 +1100
Nathan Scott [EMAIL PROTECTED] wrote:




On Thu, 2007-03-01 at 14:25 -0800, Andrew Morton wrote:



On Fri, 2 Mar 2007 00:04:45 +0530
Amit K. Arora [EMAIL PROTECTED] wrote:




This is to give a heads up on few patches that we will be soon coming up
with. These patches implement a new system call sys_fallocate() and a
new inode operation fallocate, for persistent preallocation. The new
system call, as Andrew suggested, will look like:

asmlinkage long sys_fallocate(int fd, loff_t offset, loff_t len);


...

I'd agree with Eric on the command flag extension.


Seems like a separate syscall would be better, command sounds
a bit ioctl like, especially if that command is passed into the
filesystems..




madvise, fadvise, lseek, etc seem to work OK.

I get repeatedly traumatised by patch rejects whenever a new syscall gets
added, so I'm biased.

The advantage of a command flag is that we can add new modes in the future
without causing lots of churn, waiting for arch maintainers to catch up,
potentially adding new compat code, etc.

Rename it to mode? ;)



I am wondering if it is useful to add another mode to advise block 
allocation policy? Something like indicating which physical block/block 
group to allocate from (goal), and whether ask for strict contigous 
blocks. This will help preallocation or reservation to choose the right 
blocks for the file.


  Yes, I also think this would be useful so you can guide
preallocation for things like defragmentation (e.g. preallocate space
for the file being defragmented and move the file to it).

Honza

Yep, I think it makes sense to use preallocation for defragmentation.
After all both preallocation and defragmentation shall call underlying 
filesystem multiple block allocator to try to allocate a chunk of 
contiguous blocks on disk. ext4 online defrag implementation by Takashi 
already support to choose a goal allocation block to guide the ext4 
block allocator to place the defraged file is a specific location.


Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
and/or whether the goal block is important over the size of prealloc 
extent), might make it more useful for the orginial goal (get contigous 
and guranteed blocks) and for defragmentation.


Regards,
Mingming
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jan Kara wrote:

 I am wondering if it is useful to add another mode to advise block 
 allocation policy? Something like indicating which physical block/block 
 group to allocate from (goal), and whether ask for strict contigous 
 blocks. This will help preallocation or reservation to choose the right 
 blocks for the file.
   Yes, I also think this would be useful so you can guide
 preallocation for things like defragmentation (e.g. preallocate space
 for the file being defragmented and move the file to it).

Hints  policies for allocation would certainly be useful, but I think
they belong outside this interface.  i.e. you could flag an inode for
whatever allocation you choose, and -then- call posix_fallocate so that
the allocator will take the hints you've given it.

See also this blurb from the posix_fallocate definition:

It is implementation-defined whether a previous posix_fadvise() call
influences allocation strategy.

FWIW I don't see a lot of point in asking for strict contiguous blocks
- the allocator will presumeably try to do this in any case, and I'm not
sure when you would want to fail if you get more than one extent...?

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Eric Sandeen
Jörn Engel wrote:
 Does the allocation have to be persistent beyond lifetime of the file
 descriptor?  It would be fairly simple to support the write guarantee
 while the file is open (or rather the inode remains cached) and drop it
 afterwards.

The posix_fallocate() function shall ensure that any required storage
for regular file data starting at offset and continuing for len bytes is
allocated on the file system storage media.

I interpret on the storage media to mean that it is persistent.

-Eric
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Why should we teach students Linux??

2007-03-05 Thread Roel Bindels
Hello listers,

I'm tutor on the Faculty ICT, department NID. This is a bachelor degree
and we are preparing our students to become something more then just
System Administrators (such as manager, consulting, etc). Since this
department is part of the Microsoft camp, the students are educated
mostly in this direction, which I think is not a bad thing. A better
thing would be if we could give our students the opportunity to meat
both the systems on the same level, at least, that is my opinion.

To change a curriculum of a study, I need a solid case. So if somebody
knows a link/document about why we should educate our students in the
Linux OS, please send it. Or article about the usage of Linux in company's.

I hope you will all take some time to send me your best links/documents.

with best regards

Roel Bindels

-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Heads up on sys_fallocate()

2007-03-05 Thread Christoph Hellwig
On Mon, Mar 05, 2007 at 12:02:59PM -0800, Mingming Cao wrote:
 Yep, I think it makes sense to use preallocation for defragmentation.
 After all both preallocation and defragmentation shall call underlying 
 filesystem multiple block allocator to try to allocate a chunk of 
 contiguous blocks on disk. ext4 online defrag implementation by Takashi 
 already support to choose a goal allocation block to guide the ext4 
 block allocator to place the defraged file is a specific location.
 
 Passing a little bit more hint to sys_fallocate() (i.e, goal block, 
 and/or whether the goal block is important over the size of prealloc 
 extent), might make it more useful for the orginial goal (get contigous 
 and guranteed blocks) and for defragmentation.

fallocate with the whence argument and flags is already quite complicated,
I'd rather have another call for placement decisions, that would
be called on an fd to do placement decissions for any further allocations
(prealloc, write, etc)
-
To unsubscribe from this list: send the line unsubscribe linux-fsdevel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html