Re: posix_fallocate

2013-11-19 Thread Christoph Badura
On Mon, Nov 18, 2013 at 12:31:41PM +1100, matthew green wrote:
 i would buy this argument if mmap()ing a large sparse file
 and filling it up randomly (but with relatively large chunks
 at a time) did not lead to severely fragmented files that
 can take 10x to read, vs one written with plain sequential
 write() calls.

There's another option that should avoid that behaviour: make the file
system place the sparse blocks approximately where it would place them
where they written in sequential order.  One could do the same also
after an lseek() that creates a hole.

Such a change should be relatively straightforward for file systems like
UFS that limit the amount of data that is allocated in a cylinder group on
sequential writes.

--chris


Re: posix_fallocate

2013-11-19 Thread Eduardo Horvath
On Tue, 19 Nov 2013, Christoph Badura wrote:

 On Mon, Nov 18, 2013 at 12:31:41PM +1100, matthew green wrote:
  i would buy this argument if mmap()ing a large sparse file
  and filling it up randomly (but with relatively large chunks
  at a time) did not lead to severely fragmented files that
  can take 10x to read, vs one written with plain sequential
  write() calls.
 
 There's another option that should avoid that behaviour: make the file
 system place the sparse blocks approximately where it would place them
 where they written in sequential order.  One could do the same also
 after an lseek() that creates a hole.
 
 Such a change should be relatively straightforward for file systems like
 UFS that limit the amount of data that is allocated in a cylinder group on
 sequential writes.

Or... LFS doesn't allocate the actual location of disk blocks until it 
starts the write operation.  ISTR FFS allocates disk locations when the 
data blocks are created.  Maybe FFS should do what LFS does.  It's easier 
to make the blocks contiguous if they're all allocated at the same time.

Eduardo


Re: posix_fallocate

2013-11-17 Thread David Holland
On Sun, Nov 17, 2013 at 02:02:15AM +0100, Emmanuel Dreyfus wrote:
  NetBSD-current seems to lack posix_fallocate(2)
  http://pubs.opengroup.org/onlinepubs/009695299/functions/posix_fallocate
  .html
  
  Is someone already working on it, or has thoughs about how it should be
  implemented?

I have most of the plumbing (see DIOCGDISCARDINFO and DIOCDISCARD)
but not an implementation for ffs or any other fs.

I think the chief question at this level is whether to support the
keep the length flag for fallocate, fdiscard, both, or neither. The
linux fallocate uses this to allow allocating blocks past EOF, which
strikes me as nuts; however, for discarding blocks it seems that
there's no inherent problem with having a hole at EOF and that
shrinking the file just because you dropped the block that contains
EOF is kind of silly. (And if the block containing EOF was the only
block, does it ratchet the size all the way back to zero? Ugh.)

I vaguely recall that some time back somebody had a preliminary
implementation of either fallocate or fdiscard or both for ffs, which
was not really good enough to commit. But I forget who and in what
context, so of course now I can't find it.

-- 
David A. Holland
dholl...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread Emmanuel Dreyfus
David Holland dholland-t...@netbsd.org wrote:

 I think the chief question at this level is whether to support the
 keep the length flag for fallocate, fdiscard, both, or neither. The
 linux fallocate uses this to allow allocating blocks past EOF, which
 strikes me as nuts; 

Why is it bad?

I am interested to port software, and of course it uses
FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread Rhialto
On Sun 17 Nov 2013 at 07:59:44 +, David Holland wrote:
 I think the chief question at this level is whether to support the
 keep the length flag for fallocate, fdiscard, both, or neither. The

What keep the length flag? I don't see one at the indicated URL.

The way I read it, the call just fills any holes and/or extends the
length of the file. There are no blocks reserved in some magic way
That seems to be indicated by the wording  If the offset+ len is beyond
the current file size, then posix_fallocate() shall adjust the file size
to offset+ len.

 David A. Holland
-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'



pgpqZgnWVYHkC.pgp
Description: PGP signature


Re: posix_fallocate

2013-11-17 Thread Emmanuel Dreyfus
Robert Elz k...@munnari.oz.au wrote:

 To answer both you and the Mouse - the difference is that a user process
 actually writing data consumes measurable resources, and thus is easy to
 find and kill.   When everything happens in the kernel, spotting which
 arrantly idle user process is making it happen is not at all easy.

We could fork a kernel thread that would go to userspace to do the work
with a write() loop, with appropriate credentials. Does it makes sense?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
m...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread Mouse
 I think the chief question at this level is whether to support the
 keep the length flag for fallocate, fdiscard, both, or neither.
 The linux fallocate uses this to allow allocating blocks past EOF,
 which strikes me as nuts;
 Why is it bad?

Well, I'm not dholland.  But my own answer to that is that it's a
_major_ change to Unix filesystem semantics for it to be possible for
there to be data after EOF (st_size); the nominal file size is no
longer the amount of data conceptually stored, nor a cap on the amoutn
of data actually stored, nor does it describe the end of that data.
(Unless you are going to try to support the proposition that it is
possible for a file to have blocks allocated with no contents; storage
that does not contain anything is an even more radical departure from
traditional semantics.)

Of course, there is nothing inherently wrong with changing semantics.
But a change this fundamental will affect a _lot_ of the system, far
more than just adding posix_fallocate, and should, IMO, be thought out
a lot more thoroughly.

My own view is that it _is_ nuts, because I can't come up with a
coherent paradigm for how it should work.  What is the Linux position
on the presence of data past st_size?  Is it accessible via read()?
write()?  mmap()?  If there are any of those via which it is not
accessible, does an ftruncate() call that increases st_size() change
that?  Is there a way to lower st_size that doesn't free data between
the new st_size and the old stsize?  ...and the old actual last data?
How can one find out the actual distance between the beginning of the
file and the end of data?  Which command-line tools have been extended
to handle such data, and how?  What about data before offset 0, before
what has traditionally been thought of as the beginning of the file?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: posix_fallocate

2013-11-17 Thread Robert Elz
Date:Sun, 17 Nov 2013 10:33:43 +0100
From:m...@netbsd.org (Emmanuel Dreyfus)
Message-ID:  1lch6me.jn7y3m16232ejm%m...@netbsd.org

  | We could fork a kernel thread that would go to userspace to do the work
  | with a write() loop, with appropriate credentials. Does it makes sense?

It would need to be a read/write loop, nothing says that there cannot already
be blocks allocated in the space being fallocated, and their content should
not change.

But yes, if implemented that way it would be much less of a problem.

But if implemented that way, why bother at all?  Why not just put the
code in a user space libc posix_fallocate() function, and be done with
it, it should not require any kernel support at all.

That's not true of the inverse function that David Holland referred to
(though like Rhialto, I can't see what relationship that has with the
posix_fallocate() call that we were asked about) for making holes in files.
That one is not a problem, and needs to be in the kernel to be implemented
(as the physical structure of a file is deliberately not exposed to userland.)
Implementing that (assuming there's some standard interface definition for
it) might be sensible, I still see no use at all for a (kernel) 
posix_fallocate().

kre

ps: another reason that a userland process is less of a problem than the
kernel interface described in the opengroup posix_fallocate() spec, is
that a user process must either do multiple sys calls (and is subject to
being signalled, and hence terminated, between sys calls) or malloc (or
brk(2)) enough space for a buffer as big as the write call - that is
typically going to limit a single sys call to no more than a few tend of GBs
(on today's systems) as that's generally as big as a process can grow.

On the other hand, posix_fallocate() could allocate pitabytes in a single
invocation of the sys call, assuming that the filesystem had that much
space available.   I haven't looked recently, but last time I did,
preemptible sys calls still didn't mean that userland signals would be
delivered in the middle of the operation of a single sys call, nor does
anything suggest that signals are supposed to interrupt the operation of
posiz_fallocate() half way through - which suggests to me, that as designed,
it should continue until it is finished once invoked, whatever anyone tries
to do to the process that invoked it.



Re: posix_fallocate

2013-11-17 Thread Christos Zoulas
On Nov 17,  1:15pm, k...@munnari.oz.au (Robert Elz) wrote:
-- Subject: Re: posix_fallocate

| ps: I have not examined the FreeBSD implementation - if they've done it the
| hard, safe, way, and worked out all the potential kinks, and if it doesn't
| depend too much upon other aspects of their I/O system implementation (like
| whatever they have to make softdeps work) then perhaps copying that might be
| feasible -- if the demand for this really exists, and it isn't being requested
| just because it is in the spec and NetBSD is lacking it.

From the cursory look at it, they just write.

christos


Re: posix_fallocate

2013-11-17 Thread David Holland
On Sun, Nov 17, 2013 at 10:24:04AM +0100, Rhialto wrote:
   I think the chief question at this level is whether to support the
   keep the length flag for fallocate, fdiscard, both, or neither. The
  
  What keep the length flag? I don't see one at the indicated URL.

It's a linuxism in linux's native fallocate().

-- 
David A. Holland
dholl...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread David Holland
On Sun, Nov 17, 2013 at 10:24:04AM +0100, Emmanuel Dreyfus wrote:
   I think the chief question at this level is whether to support the
   keep the length flag for fallocate, fdiscard, both, or neither. The
   linux fallocate uses this to allow allocating blocks past EOF, which
   strikes me as nuts; 
  
  Why is it bad?

What Mouse said, pretty much. Not only is it semantic nonsense, it
requires reworking fsck.

  I am interested to port software, and of course it uses
  FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE...

FALLOC_FL_PUNCH_HOLE is fdiscard(). I don't believe in deleting things
by making an allocate call, or perpetrating other people's bad design
to avoid patching a couple packages.

The question is whether FALLOC_FL_KEEP_SIZE makes sense; it does for
fdiscard, but I remain unconvinced in the case of fallocate.

-- 
David A. Holland
dholl...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread David Holland
On Sun, Nov 17, 2013 at 10:33:43AM +0100, Emmanuel Dreyfus wrote:
   To answer both you and the Mouse - the difference is that a user process
   actually writing data consumes measurable resources, and thus is easy to
   find and kill.   When everything happens in the kernel, spotting which
   arrantly idle user process is making it happen is not at all easy.
  
  We could fork a kernel thread that would go to userspace to do the work
  with a write() loop, with appropriate credentials. Does it makes sense?

I do not think that makes the slightest sense. Also if you really want
FALLOC_FL_KEEP_SIZE, it won't be adequate.

-- 
David A. Holland
dholl...@netbsd.org


Re: posix_fallocate

2013-11-17 Thread Mouse
 [posix_fallocate]
 We could fork a kernel thread that would go to userspace to do the
 work with a write() loop, with appropriate credentials.  Does it
 makes sense?
 It would need to be a read/write loop, nothing says that there cannot
 already be blocks allocated in the space being fallocated, and their
 content should not change.

That's a reason to put it in the kernel, actually.  The kernel can tell
which ranges of a file have already had space allocated for them, so it
can be just writes.  (Well, for local filesystems.  Throw NFS or its
ilk into the mix and it gets more interesting.)

 But if implemented that way, why bother at all?  Why not just put the
 code in a user space libc posix_fallocate() function, and be done
 with it, it should not require any kernel support at all.

Well, I don't know what the point of having posix_fallocate at all
would be.  But the obvious answer to this is atomicity, especially in
the presence of other writers: the kernel is capable of making sure it
doesn't destroy someone else's write by mistake, which userland isn't
(in the userland implementation, there's a window between read and
write when someone else can write only to get overwritten).  If that
matters for the target application, that's a reason to prefer an
in-kernel implementation.

 On the other hand, posix_fallocate() could allocate pitabytes in a
 single invocation of the sys call, assuming that the filesystem had
 that much space available.   I haven't looked recently, but last time
 I did, preemptible sys calls still didn't mean that userland signals
 would be delivered in the middle of the operation of a single sys
 call, [...]

No, it doesn't mean that, strictly.  But any syscall that feels like it
can be signalable during any sleep involved in its operation; this was
true even before multiprocessor support.  There will be a loop involved
_somewhere_ in any posix_fallocate() implementation, and I can't
imagine that an implementation wouldn't sleep somewhere waiting for the
underlying filesystem operations.  Those sleeps can be made
interruptible by signals; at most it will complicate the exit path.

 nor does anything suggest that signals are supposed to interrupt the
 operation of posiz_fallocate() half way through - which suggests to
 me, that as designed, it should continue until it is finished once
 invoked, whatever anyone tries to do to the process that invoked it.

Actually, I see nothing in the description that prevents that.  Given
what it does, a half-completed posix_fallocate is indistinguishable, to
userland, from a never-started posix_fallocate, provided the former
hasn't got as far as affecting st_size.  It would be interesting to
deal with a posix_fallocate that raises st_size being interrupted after
it's written some but not all of the new space, especially in the
presence of another writer writing into the same area of the file, but
I feel certain those problems are solvable, even if it means pushing a
small fraction of the implementation down into the filesystem - and
maybe not even that; the existence of kqueue's EVFILT_VNODE NOTE_WRITE
means that most of the necessary machinery is already in place.

This is not to say that I support the idea of adding posix_fallocate;
I'm not sure what I think on that question.  But some of the arguments
kre has presented here do not, IMO, hold water.

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


Re: posix_fallocate

2013-11-17 Thread Robert Elz
Date:Sun, 17 Nov 2013 14:12:16 -0500 (EST)
From:Mouse mo...@rodents-montreal.org
Message-ID:  201311171912.oaa17...@chip.rodents-montreal.org

  | That's a reason to put it in the kernel, actually.  The kernel can tell
  | which ranges of a file have already had space allocated for them, so it
  | can be just writes.

Sure, but that's just an optimisation, and surely only matters if this
is done enough that the difference is actually of significance.  Do you
really believe it is, or ever will be?

  | Well, I don't know what the point of having posix_fallocate at all
  | would be.

Agree there, it seems mostly useless.

  | But the obvious answer to this is atomicity, especially in
  | the presence of other writers:

Really?  You're imagining multiple writers, writing to the same file,
without any co-ordination (like locking, or whatever other way works) and
you're actually worried about unpredictable results???   Really?

But beyond that, since you plan on allowing signals to interrupt the
operation part way through, what atomicity are you really getting anyway?

  | No, it doesn't mean that, strictly.  But any syscall that feels like it
  | can be signalable during any sleep involved in its operation;

Sure, though the internal implementation of the sys call has to explicitly
make that happen - and to be useful, the sys call interface really has to
be designed to support it.   This one isn't (and as specified, cannot be.)

That is, if you imagine the sys call in question being SIGKILL (as in the
scenario I postulated initially) there's no problem - if the code for the
sys call wants to allow that to work before it completes, it easily can.

But instead imagine it is SIGALRM - now there's a problem, since there's
no way in the interface to report how much of the work was done, the only
thing that can be done if EINTR happens, is repeat the sys call.  For
a periodic sys call like SIGALRM (or any of the other timer signals) chances
are that a signal will arrive during the sys call every time, resulting
in an infinite loop of fallocate/signal/fallocate/signal...

This thing is just poorly designed.  Let's just ignore it (by all means,
implement the hole making part from the linux interface if desired, but
the allocation side just isn't needed).

  | Given what it does, a half-completed posix_fallocate is indistinguishable,
  | to userland, from a never-started posix_fallocate, provided the former
  | hasn't got as far as affecting st_size.

Of course, I'm only really interested in cases where the size can't help
being affected, as it starts at 0 - cases where actual holes in the middle
of a file are being filled in I see as so unlikely in practice that they're
totally irrelevant (like unless the app has just made the file by seeking
forward and writing a byte, how does it ever know whether or not there are
holes to fill?   And why would it do it that way, and follow by fallocate() ?
If fallocate() exists, surely it would just use that to make the file?)

But as above, that it is indistinguishable is the problem.

kre

ps: Note that I see that the linux way of handling fallocate isn't to
write blocks of 0's, but to allocate uninitialised blocks, and mark them
uninitialised - I assume the way that works, is that if an app reads one
of those blocks, it is just given 0's - and if it writes (the expected
operation) whatever is there (0's or junk) just gets overwritten.  That
way they make fallocate() really fast (just assigns block numbers to the file)
but it requires that the un-init flag (wheverever, and however, they keep that)
is 100% reliable.   Nothing is really that reliable...   FFS doesn't have
any mechanism to do that, so actually writing 0's would be the only way,
and given that, fallocate() looks to be a total waste of time - again,
given that it is an optional sys call, that no-one is required to implement,
and so which no-one can assume actually exists.



Re: posix_fallocate

2013-11-17 Thread Rhialto
On Mon 18 Nov 2013 at 05:11:35 +0700, Robert Elz wrote:
 how does it ever know whether or not there are
 holes to fill?   And why would it do it that way, and follow by fallocate() ?

Wasn't there some (proposed? actual?) interface to find holes in files?

-Olaf.
-- 
___ Olaf 'Rhialto' Seibert  -- The Doctor: No, 'eureka' is Greek for
\X/ rhialto/at/xs4all.nl-- 'this bath is too hot.'



pgpJE4ppPn_az.pgp
Description: PGP signature


Re: posix_fallocate

2013-11-17 Thread Robert Elz
Date:Sun, 17 Nov 2013 23:49:50 +0100
From:Rhialto rhia...@falu.nl
Message-ID:  20131117224950.gh23...@falu.nl

  | Wasn't there some (proposed? actual?) interface to find holes in files?

I was told off-list about SEEK_HOLE and SEEK_DATA that allow (if the
filesys supports them) apps to jump to the next hole, then end of hole,
and that way work out where the holes are.

According to the linux man page, they exist there, and FreeBSD DragonFly and
Solaris - and might come to posix.

Personally, I think the underlying structure of files should not be made
visible to apps at all - they should just see a byte stream (perhaps with
an advisory useful block size to write in).

kre



Re: posix_fallocate

2013-11-17 Thread Mouse
 That's a reason to put it in the kernel, actually.  The kernel can
 tell which ranges of a file have already had space allocated for
 them, so it can be just writes.
 Sure, but that's just an optimisation, and surely only matters if
 this is done enough that the difference is actually of significance.
 Do you really believe it is, or ever will be?

Not just an optimization; it also affects correctness - see below.

 But the obvious answer to this is atomicity, especially in the
 presence of other writers:
 Really?  You're imagining multiple writers, writing to the same file,
 without any co-ordination (like locking, or whatever other way works)
 and you're actually worried about unpredictable results???  Really?

Actually, thinking about it more, atomicity is the wrong word.

The correct thing to worry about here is that, as I read the pointed-to
webpage, posix_fallocate() is defined to do nothing to already-present
data.  But, when racing with another writer, a non-kernel
implementation will always have conditions under which it can destroy
data written by some other process.

 No, it doesn't mean that, strictly.  But any syscall that feels like
 it can be signalable during any sleep involved in its operation;
 Sure, though the internal implementation of the sys call has to
 explicitly make that happen

Well, sure.

 - and to be useful, the sys call interface really has to be designed
 to support it.   This one isn't (and as specified, cannot be.)

Huh?  I don't see it that way.

 That is, if you imagine the [signal] in question being SIGKILL (as in
 the scenario I postulated initially) there's no problem - if the code
 for the sys call wants to allow that to work before it completes, it
 easily can.

 But instead imagine it is SIGALRM - now there's a problem, since
 there's no way in the interface to report how much of the work was
 done, the only thing that can be done if EINTR happens, is repeat the
 sys call.

Right.  That's a reason to put it in the kernel, so it doesn't need to
redo past work.  (This is not a complete fix, because it applies only
to calls which don't increase st_size.)

I note that the documentation webpage lists EINTR.

 For a periodic [signal] like SIGALRM (or any of the other timer
 signals) chances are that a signal will arrive during the sys call
 every time, resulting in an infinite loop of
 fallocate/signal/fallocate/signal...

Sure.  There are lots of ways programmers can write code which ends up
in livelock.  I don't see how this one deserves any more special
treatment than the others.

 [...] - cases where actual holes in the middle of a file are being
 filled in I see as so unlikely in practice that they're totally
 irrelevant (like unless the app has just made the file by seeking
 forward and writing a byte, how does it ever know whether or not
 there are holes to fill?  And why would it do it that way, and follow
 by fallocate()?  If fallocate() exists, surely it would just use that
 to make the file?)

I see it as being intended for programs doing things like databases:
they may want to allocate disk space when they know they'll want it but
it's still easy to back out of the operation if it fails.  Once the
space is allocated, then they can carry on knowing they won't run into
a full disk partway through, later, when it's much harder to deal with.

In this paradigm, the application doesn't know whether there used to be
a hole there and doesn't care; the important thing is that, after the
call, there isn't.

I'm not sure why it's better than (read-and-)write for such
application, but that's the use case it feels designed for to me.

(Also, seeking and writing is not the only way to create a large file;
truncate/ftruncate can extend files on at least some systems.)

 ps: Note that I see that the linux way of handling fallocate isn't to
 write blocks of 0's, but to allocate uninitialised blocks, and mark
 them uninitialised [...]  That way they make fallocate() really fast
 (just assigns block numbers to the file) but it requires that the
 un-init flag (wheverever, and however, they keep that) is 100%
 reliable.  Nothing is really that reliable...

Not 100% reliable, but at least as reliable as the rest of the
filesystem.  FFS assumes that di_db[] in inodes, and the block
allocation bitmaps, won't change behind its back, too; I don't see why
this would be any different, really.

I'm not sure whether I'd be willing to pay one more bit per frag in
order to (greatly) speed up reads of allocated but unwritten blocks; my
own guess would be that such things are rare enough that optimizing
them doesn't really matter - though, of course, I don't often find uses
for things I don't have.  Perhaps Linux has found a real use for such
things.  The major use I can think of for them are things like NFS
swapfiles, where you want to allocate the whole file but have no need
to write it.  My own livebackup is in a similar situation; it could
benefit from an allocated but unwritten state 

re: posix_fallocate

2013-11-17 Thread matthew green

 Personally, I think the underlying structure of files should not be made
 visible to apps at all - they should just see a byte stream (perhaps with
 an advisory useful block size to write in).

i would buy this argument if mmap()ing a large sparse file
and filling it up randomly (but with relatively large chunks
at a time) did not lead to severely fragmented files that
can take 10x to read, vs one written with plain sequential
write() calls.  because of this, some workaround is
necessary.  it is very disappointing to see an average of
120 iops of 64KB each (and only because i formatted my
FS with 64kb blocks/frags!.) whe sequentially reading a
file created by mmap().

posix_fallocate() is answering a real problem.  the work
around today is to write the file, which doubles the IO
traffic, and i am not sure we can do better with FFS, due
to the issues you've mentioned, but there are many other
filesystems in existence that do allow block allocation
without exposing prior data or initialisation.

given the current issues, i'd be happy with a userspace
implementation.


.mrg.


Re: posix_fallocate

2013-11-16 Thread Christos Zoulas
In article 1lcgiu4.18zr2h51aac07zm%m...@netbsd.org,
Emmanuel Dreyfus m...@netbsd.org wrote:
Hi

NetBSD-current seems to lack posix_fallocate(2)
http://pubs.opengroup.org/onlinepubs/009695299/functions/posix_fallocate
.html

Is someone already working on it, or has thoughs about how it should be
implemented?

FreeBSD has it as a system call. It should be easy to dup.

christos



Re: posix_fallocate

2013-11-16 Thread Robert Elz
Date:Sun, 17 Nov 2013 03:18:56 + (UTC)
From:chris...@astron.com (Christos Zoulas)
Message-ID:  l69cj0$f0v$1...@ger.gmane.org

  | In article 1lcgiu4.18zr2h51aac07zm%m...@netbsd.org,
  | Emmanuel Dreyfus m...@netbsd.org wrote:
  | NetBSD-current seems to lack posix_fallocate(2)
 
  | FreeBSD has it as a system call. It should be easy to dup.

I would suggest avoiding it.   While the objective for it looks clear,
and perhaps even useful, to me it doesn't seem to be implementable safely.

To me there appears to be just two ways to implement this - the safest would
be a complex reservation scheme, which would account for blocks reserved to
a file as if they were actually allocated, and so reducing the available
space for other allocations on the filesystem.   To me that looks to be
an accounting nightmare to actually implement correctly in all cases (there
are so many weird situations that would need solutions.)

Alternatively, the system could actually allocate all required blocks at
the time of the posix_fallocate() call - effectively filling in any holes
in the designated region of the file.   The spec doesn't say what data is
to be put in the blocks allocated to fill the holes (a well behaved
application wouldn't care, as it would normally write to the file before
reading it, and would be using fallocate to guarantee that the entire set
of write sys call it needed to make would succeed (or the fallocate()
would fail), and the system could not run out of space half way through.)

There would seem to be just two viable choices - fill the blocks with 0's,
or leave random data there.

The latter isn't really a choice, it is a security hole a mile wide, so
fill with 0's would be the only real option.  The problem is that this opens
a trivial DoS attack like 

for (;;) {
ftruncate(fd);
posix_fallocate(fd, (off_t)0, huge);
}

where the (off_t) huge is howwver big the application can get away with
without failing.

For a sys call that is merely advisory to implement (not required)
this all seems like a poor idea to me.

Any application that really needs the function can duplicate it in user
space (just a loop of read/write sys calls over the range required) which
then costs user space resources, rather than kernel (or at least, not just
kernel).

kre

ps: I have not examined the FreeBSD implementation - if they've done it the
hard, safe, way, and worked out all the potential kinks, and if it doesn't
depend too much upon other aspects of their I/O system implementation (like
whatever they have to make softdeps work) then perhaps copying that might be
feasible -- if the demand for this really exists, and it isn't being requested
just because it is in the spec and NetBSD is lacking it.



Re: posix_fallocate

2013-11-16 Thread Mouse
 [...] this opens a trivial DoS attack like 

   for (;;) {
   ftruncate(fd);
   posix_fallocate(fd, (off_t)0, huge);
   }

How, exactly, is this any more of a DoS than doing the same thing but
using one or more write() calls instead of the posix_fallocate()?

/~\ The ASCII Mouse
\ / Ribbon Campaign
 X  Against HTMLmo...@rodents-montreal.org
/ \ Email!   7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B