Re: posix_fallocate
On Mon, Nov 18, 2013 at 12:31:41PM +1100, matthew green wrote: i would buy this argument if mmap()ing a large sparse file and filling it up randomly (but with relatively large chunks at a time) did not lead to severely fragmented files that can take 10x to read, vs one written with plain sequential write() calls. There's another option that should avoid that behaviour: make the file system place the sparse blocks approximately where it would place them where they written in sequential order. One could do the same also after an lseek() that creates a hole. Such a change should be relatively straightforward for file systems like UFS that limit the amount of data that is allocated in a cylinder group on sequential writes. --chris
Re: posix_fallocate
On Tue, 19 Nov 2013, Christoph Badura wrote: On Mon, Nov 18, 2013 at 12:31:41PM +1100, matthew green wrote: i would buy this argument if mmap()ing a large sparse file and filling it up randomly (but with relatively large chunks at a time) did not lead to severely fragmented files that can take 10x to read, vs one written with plain sequential write() calls. There's another option that should avoid that behaviour: make the file system place the sparse blocks approximately where it would place them where they written in sequential order. One could do the same also after an lseek() that creates a hole. Such a change should be relatively straightforward for file systems like UFS that limit the amount of data that is allocated in a cylinder group on sequential writes. Or... LFS doesn't allocate the actual location of disk blocks until it starts the write operation. ISTR FFS allocates disk locations when the data blocks are created. Maybe FFS should do what LFS does. It's easier to make the blocks contiguous if they're all allocated at the same time. Eduardo
Re: posix_fallocate
On Sun, Nov 17, 2013 at 02:02:15AM +0100, Emmanuel Dreyfus wrote: NetBSD-current seems to lack posix_fallocate(2) http://pubs.opengroup.org/onlinepubs/009695299/functions/posix_fallocate .html Is someone already working on it, or has thoughs about how it should be implemented? I have most of the plumbing (see DIOCGDISCARDINFO and DIOCDISCARD) but not an implementation for ffs or any other fs. I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The linux fallocate uses this to allow allocating blocks past EOF, which strikes me as nuts; however, for discarding blocks it seems that there's no inherent problem with having a hole at EOF and that shrinking the file just because you dropped the block that contains EOF is kind of silly. (And if the block containing EOF was the only block, does it ratchet the size all the way back to zero? Ugh.) I vaguely recall that some time back somebody had a preliminary implementation of either fallocate or fdiscard or both for ffs, which was not really good enough to commit. But I forget who and in what context, so of course now I can't find it. -- David A. Holland dholl...@netbsd.org
Re: posix_fallocate
David Holland dholland-t...@netbsd.org wrote: I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The linux fallocate uses this to allow allocating blocks past EOF, which strikes me as nuts; Why is it bad? I am interested to port software, and of course it uses FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE... -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: posix_fallocate
On Sun 17 Nov 2013 at 07:59:44 +, David Holland wrote: I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The What keep the length flag? I don't see one at the indicated URL. The way I read it, the call just fills any holes and/or extends the length of the file. There are no blocks reserved in some magic way That seems to be indicated by the wording If the offset+ len is beyond the current file size, then posix_fallocate() shall adjust the file size to offset+ len. David A. Holland -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' pgpqZgnWVYHkC.pgp Description: PGP signature
Re: posix_fallocate
Robert Elz k...@munnari.oz.au wrote: To answer both you and the Mouse - the difference is that a user process actually writing data consumes measurable resources, and thus is easy to find and kill. When everything happens in the kernel, spotting which arrantly idle user process is making it happen is not at all easy. We could fork a kernel thread that would go to userspace to do the work with a write() loop, with appropriate credentials. Does it makes sense? -- Emmanuel Dreyfus http://hcpnet.free.fr/pubz m...@netbsd.org
Re: posix_fallocate
I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The linux fallocate uses this to allow allocating blocks past EOF, which strikes me as nuts; Why is it bad? Well, I'm not dholland. But my own answer to that is that it's a _major_ change to Unix filesystem semantics for it to be possible for there to be data after EOF (st_size); the nominal file size is no longer the amount of data conceptually stored, nor a cap on the amoutn of data actually stored, nor does it describe the end of that data. (Unless you are going to try to support the proposition that it is possible for a file to have blocks allocated with no contents; storage that does not contain anything is an even more radical departure from traditional semantics.) Of course, there is nothing inherently wrong with changing semantics. But a change this fundamental will affect a _lot_ of the system, far more than just adding posix_fallocate, and should, IMO, be thought out a lot more thoroughly. My own view is that it _is_ nuts, because I can't come up with a coherent paradigm for how it should work. What is the Linux position on the presence of data past st_size? Is it accessible via read()? write()? mmap()? If there are any of those via which it is not accessible, does an ftruncate() call that increases st_size() change that? Is there a way to lower st_size that doesn't free data between the new st_size and the old stsize? ...and the old actual last data? How can one find out the actual distance between the beginning of the file and the end of data? Which command-line tools have been extended to handle such data, and how? What about data before offset 0, before what has traditionally been thought of as the beginning of the file? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: posix_fallocate
Date:Sun, 17 Nov 2013 10:33:43 +0100 From:m...@netbsd.org (Emmanuel Dreyfus) Message-ID: 1lch6me.jn7y3m16232ejm%m...@netbsd.org | We could fork a kernel thread that would go to userspace to do the work | with a write() loop, with appropriate credentials. Does it makes sense? It would need to be a read/write loop, nothing says that there cannot already be blocks allocated in the space being fallocated, and their content should not change. But yes, if implemented that way it would be much less of a problem. But if implemented that way, why bother at all? Why not just put the code in a user space libc posix_fallocate() function, and be done with it, it should not require any kernel support at all. That's not true of the inverse function that David Holland referred to (though like Rhialto, I can't see what relationship that has with the posix_fallocate() call that we were asked about) for making holes in files. That one is not a problem, and needs to be in the kernel to be implemented (as the physical structure of a file is deliberately not exposed to userland.) Implementing that (assuming there's some standard interface definition for it) might be sensible, I still see no use at all for a (kernel) posix_fallocate(). kre ps: another reason that a userland process is less of a problem than the kernel interface described in the opengroup posix_fallocate() spec, is that a user process must either do multiple sys calls (and is subject to being signalled, and hence terminated, between sys calls) or malloc (or brk(2)) enough space for a buffer as big as the write call - that is typically going to limit a single sys call to no more than a few tend of GBs (on today's systems) as that's generally as big as a process can grow. On the other hand, posix_fallocate() could allocate pitabytes in a single invocation of the sys call, assuming that the filesystem had that much space available. I haven't looked recently, but last time I did, preemptible sys calls still didn't mean that userland signals would be delivered in the middle of the operation of a single sys call, nor does anything suggest that signals are supposed to interrupt the operation of posiz_fallocate() half way through - which suggests to me, that as designed, it should continue until it is finished once invoked, whatever anyone tries to do to the process that invoked it.
Re: posix_fallocate
On Nov 17, 1:15pm, k...@munnari.oz.au (Robert Elz) wrote: -- Subject: Re: posix_fallocate | ps: I have not examined the FreeBSD implementation - if they've done it the | hard, safe, way, and worked out all the potential kinks, and if it doesn't | depend too much upon other aspects of their I/O system implementation (like | whatever they have to make softdeps work) then perhaps copying that might be | feasible -- if the demand for this really exists, and it isn't being requested | just because it is in the spec and NetBSD is lacking it. From the cursory look at it, they just write. christos
Re: posix_fallocate
On Sun, Nov 17, 2013 at 10:24:04AM +0100, Rhialto wrote: I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The What keep the length flag? I don't see one at the indicated URL. It's a linuxism in linux's native fallocate(). -- David A. Holland dholl...@netbsd.org
Re: posix_fallocate
On Sun, Nov 17, 2013 at 10:24:04AM +0100, Emmanuel Dreyfus wrote: I think the chief question at this level is whether to support the keep the length flag for fallocate, fdiscard, both, or neither. The linux fallocate uses this to allow allocating blocks past EOF, which strikes me as nuts; Why is it bad? What Mouse said, pretty much. Not only is it semantic nonsense, it requires reworking fsck. I am interested to port software, and of course it uses FALLOC_FL_KEEP_SIZE and FALLOC_FL_PUNCH_HOLE... FALLOC_FL_PUNCH_HOLE is fdiscard(). I don't believe in deleting things by making an allocate call, or perpetrating other people's bad design to avoid patching a couple packages. The question is whether FALLOC_FL_KEEP_SIZE makes sense; it does for fdiscard, but I remain unconvinced in the case of fallocate. -- David A. Holland dholl...@netbsd.org
Re: posix_fallocate
On Sun, Nov 17, 2013 at 10:33:43AM +0100, Emmanuel Dreyfus wrote: To answer both you and the Mouse - the difference is that a user process actually writing data consumes measurable resources, and thus is easy to find and kill. When everything happens in the kernel, spotting which arrantly idle user process is making it happen is not at all easy. We could fork a kernel thread that would go to userspace to do the work with a write() loop, with appropriate credentials. Does it makes sense? I do not think that makes the slightest sense. Also if you really want FALLOC_FL_KEEP_SIZE, it won't be adequate. -- David A. Holland dholl...@netbsd.org
Re: posix_fallocate
[posix_fallocate] We could fork a kernel thread that would go to userspace to do the work with a write() loop, with appropriate credentials. Does it makes sense? It would need to be a read/write loop, nothing says that there cannot already be blocks allocated in the space being fallocated, and their content should not change. That's a reason to put it in the kernel, actually. The kernel can tell which ranges of a file have already had space allocated for them, so it can be just writes. (Well, for local filesystems. Throw NFS or its ilk into the mix and it gets more interesting.) But if implemented that way, why bother at all? Why not just put the code in a user space libc posix_fallocate() function, and be done with it, it should not require any kernel support at all. Well, I don't know what the point of having posix_fallocate at all would be. But the obvious answer to this is atomicity, especially in the presence of other writers: the kernel is capable of making sure it doesn't destroy someone else's write by mistake, which userland isn't (in the userland implementation, there's a window between read and write when someone else can write only to get overwritten). If that matters for the target application, that's a reason to prefer an in-kernel implementation. On the other hand, posix_fallocate() could allocate pitabytes in a single invocation of the sys call, assuming that the filesystem had that much space available. I haven't looked recently, but last time I did, preemptible sys calls still didn't mean that userland signals would be delivered in the middle of the operation of a single sys call, [...] No, it doesn't mean that, strictly. But any syscall that feels like it can be signalable during any sleep involved in its operation; this was true even before multiprocessor support. There will be a loop involved _somewhere_ in any posix_fallocate() implementation, and I can't imagine that an implementation wouldn't sleep somewhere waiting for the underlying filesystem operations. Those sleeps can be made interruptible by signals; at most it will complicate the exit path. nor does anything suggest that signals are supposed to interrupt the operation of posiz_fallocate() half way through - which suggests to me, that as designed, it should continue until it is finished once invoked, whatever anyone tries to do to the process that invoked it. Actually, I see nothing in the description that prevents that. Given what it does, a half-completed posix_fallocate is indistinguishable, to userland, from a never-started posix_fallocate, provided the former hasn't got as far as affecting st_size. It would be interesting to deal with a posix_fallocate that raises st_size being interrupted after it's written some but not all of the new space, especially in the presence of another writer writing into the same area of the file, but I feel certain those problems are solvable, even if it means pushing a small fraction of the implementation down into the filesystem - and maybe not even that; the existence of kqueue's EVFILT_VNODE NOTE_WRITE means that most of the necessary machinery is already in place. This is not to say that I support the idea of adding posix_fallocate; I'm not sure what I think on that question. But some of the arguments kre has presented here do not, IMO, hold water. /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Re: posix_fallocate
Date:Sun, 17 Nov 2013 14:12:16 -0500 (EST) From:Mouse mo...@rodents-montreal.org Message-ID: 201311171912.oaa17...@chip.rodents-montreal.org | That's a reason to put it in the kernel, actually. The kernel can tell | which ranges of a file have already had space allocated for them, so it | can be just writes. Sure, but that's just an optimisation, and surely only matters if this is done enough that the difference is actually of significance. Do you really believe it is, or ever will be? | Well, I don't know what the point of having posix_fallocate at all | would be. Agree there, it seems mostly useless. | But the obvious answer to this is atomicity, especially in | the presence of other writers: Really? You're imagining multiple writers, writing to the same file, without any co-ordination (like locking, or whatever other way works) and you're actually worried about unpredictable results??? Really? But beyond that, since you plan on allowing signals to interrupt the operation part way through, what atomicity are you really getting anyway? | No, it doesn't mean that, strictly. But any syscall that feels like it | can be signalable during any sleep involved in its operation; Sure, though the internal implementation of the sys call has to explicitly make that happen - and to be useful, the sys call interface really has to be designed to support it. This one isn't (and as specified, cannot be.) That is, if you imagine the sys call in question being SIGKILL (as in the scenario I postulated initially) there's no problem - if the code for the sys call wants to allow that to work before it completes, it easily can. But instead imagine it is SIGALRM - now there's a problem, since there's no way in the interface to report how much of the work was done, the only thing that can be done if EINTR happens, is repeat the sys call. For a periodic sys call like SIGALRM (or any of the other timer signals) chances are that a signal will arrive during the sys call every time, resulting in an infinite loop of fallocate/signal/fallocate/signal... This thing is just poorly designed. Let's just ignore it (by all means, implement the hole making part from the linux interface if desired, but the allocation side just isn't needed). | Given what it does, a half-completed posix_fallocate is indistinguishable, | to userland, from a never-started posix_fallocate, provided the former | hasn't got as far as affecting st_size. Of course, I'm only really interested in cases where the size can't help being affected, as it starts at 0 - cases where actual holes in the middle of a file are being filled in I see as so unlikely in practice that they're totally irrelevant (like unless the app has just made the file by seeking forward and writing a byte, how does it ever know whether or not there are holes to fill? And why would it do it that way, and follow by fallocate() ? If fallocate() exists, surely it would just use that to make the file?) But as above, that it is indistinguishable is the problem. kre ps: Note that I see that the linux way of handling fallocate isn't to write blocks of 0's, but to allocate uninitialised blocks, and mark them uninitialised - I assume the way that works, is that if an app reads one of those blocks, it is just given 0's - and if it writes (the expected operation) whatever is there (0's or junk) just gets overwritten. That way they make fallocate() really fast (just assigns block numbers to the file) but it requires that the un-init flag (wheverever, and however, they keep that) is 100% reliable. Nothing is really that reliable... FFS doesn't have any mechanism to do that, so actually writing 0's would be the only way, and given that, fallocate() looks to be a total waste of time - again, given that it is an optional sys call, that no-one is required to implement, and so which no-one can assume actually exists.
Re: posix_fallocate
On Mon 18 Nov 2013 at 05:11:35 +0700, Robert Elz wrote: how does it ever know whether or not there are holes to fill? And why would it do it that way, and follow by fallocate() ? Wasn't there some (proposed? actual?) interface to find holes in files? -Olaf. -- ___ Olaf 'Rhialto' Seibert -- The Doctor: No, 'eureka' is Greek for \X/ rhialto/at/xs4all.nl-- 'this bath is too hot.' pgpJE4ppPn_az.pgp Description: PGP signature
Re: posix_fallocate
Date:Sun, 17 Nov 2013 23:49:50 +0100 From:Rhialto rhia...@falu.nl Message-ID: 20131117224950.gh23...@falu.nl | Wasn't there some (proposed? actual?) interface to find holes in files? I was told off-list about SEEK_HOLE and SEEK_DATA that allow (if the filesys supports them) apps to jump to the next hole, then end of hole, and that way work out where the holes are. According to the linux man page, they exist there, and FreeBSD DragonFly and Solaris - and might come to posix. Personally, I think the underlying structure of files should not be made visible to apps at all - they should just see a byte stream (perhaps with an advisory useful block size to write in). kre
Re: posix_fallocate
That's a reason to put it in the kernel, actually. The kernel can tell which ranges of a file have already had space allocated for them, so it can be just writes. Sure, but that's just an optimisation, and surely only matters if this is done enough that the difference is actually of significance. Do you really believe it is, or ever will be? Not just an optimization; it also affects correctness - see below. But the obvious answer to this is atomicity, especially in the presence of other writers: Really? You're imagining multiple writers, writing to the same file, without any co-ordination (like locking, or whatever other way works) and you're actually worried about unpredictable results??? Really? Actually, thinking about it more, atomicity is the wrong word. The correct thing to worry about here is that, as I read the pointed-to webpage, posix_fallocate() is defined to do nothing to already-present data. But, when racing with another writer, a non-kernel implementation will always have conditions under which it can destroy data written by some other process. No, it doesn't mean that, strictly. But any syscall that feels like it can be signalable during any sleep involved in its operation; Sure, though the internal implementation of the sys call has to explicitly make that happen Well, sure. - and to be useful, the sys call interface really has to be designed to support it. This one isn't (and as specified, cannot be.) Huh? I don't see it that way. That is, if you imagine the [signal] in question being SIGKILL (as in the scenario I postulated initially) there's no problem - if the code for the sys call wants to allow that to work before it completes, it easily can. But instead imagine it is SIGALRM - now there's a problem, since there's no way in the interface to report how much of the work was done, the only thing that can be done if EINTR happens, is repeat the sys call. Right. That's a reason to put it in the kernel, so it doesn't need to redo past work. (This is not a complete fix, because it applies only to calls which don't increase st_size.) I note that the documentation webpage lists EINTR. For a periodic [signal] like SIGALRM (or any of the other timer signals) chances are that a signal will arrive during the sys call every time, resulting in an infinite loop of fallocate/signal/fallocate/signal... Sure. There are lots of ways programmers can write code which ends up in livelock. I don't see how this one deserves any more special treatment than the others. [...] - cases where actual holes in the middle of a file are being filled in I see as so unlikely in practice that they're totally irrelevant (like unless the app has just made the file by seeking forward and writing a byte, how does it ever know whether or not there are holes to fill? And why would it do it that way, and follow by fallocate()? If fallocate() exists, surely it would just use that to make the file?) I see it as being intended for programs doing things like databases: they may want to allocate disk space when they know they'll want it but it's still easy to back out of the operation if it fails. Once the space is allocated, then they can carry on knowing they won't run into a full disk partway through, later, when it's much harder to deal with. In this paradigm, the application doesn't know whether there used to be a hole there and doesn't care; the important thing is that, after the call, there isn't. I'm not sure why it's better than (read-and-)write for such application, but that's the use case it feels designed for to me. (Also, seeking and writing is not the only way to create a large file; truncate/ftruncate can extend files on at least some systems.) ps: Note that I see that the linux way of handling fallocate isn't to write blocks of 0's, but to allocate uninitialised blocks, and mark them uninitialised [...] That way they make fallocate() really fast (just assigns block numbers to the file) but it requires that the un-init flag (wheverever, and however, they keep that) is 100% reliable. Nothing is really that reliable... Not 100% reliable, but at least as reliable as the rest of the filesystem. FFS assumes that di_db[] in inodes, and the block allocation bitmaps, won't change behind its back, too; I don't see why this would be any different, really. I'm not sure whether I'd be willing to pay one more bit per frag in order to (greatly) speed up reads of allocated but unwritten blocks; my own guess would be that such things are rare enough that optimizing them doesn't really matter - though, of course, I don't often find uses for things I don't have. Perhaps Linux has found a real use for such things. The major use I can think of for them are things like NFS swapfiles, where you want to allocate the whole file but have no need to write it. My own livebackup is in a similar situation; it could benefit from an allocated but unwritten state
re: posix_fallocate
Personally, I think the underlying structure of files should not be made visible to apps at all - they should just see a byte stream (perhaps with an advisory useful block size to write in). i would buy this argument if mmap()ing a large sparse file and filling it up randomly (but with relatively large chunks at a time) did not lead to severely fragmented files that can take 10x to read, vs one written with plain sequential write() calls. because of this, some workaround is necessary. it is very disappointing to see an average of 120 iops of 64KB each (and only because i formatted my FS with 64kb blocks/frags!.) whe sequentially reading a file created by mmap(). posix_fallocate() is answering a real problem. the work around today is to write the file, which doubles the IO traffic, and i am not sure we can do better with FFS, due to the issues you've mentioned, but there are many other filesystems in existence that do allow block allocation without exposing prior data or initialisation. given the current issues, i'd be happy with a userspace implementation. .mrg.
Re: posix_fallocate
In article 1lcgiu4.18zr2h51aac07zm%m...@netbsd.org, Emmanuel Dreyfus m...@netbsd.org wrote: Hi NetBSD-current seems to lack posix_fallocate(2) http://pubs.opengroup.org/onlinepubs/009695299/functions/posix_fallocate .html Is someone already working on it, or has thoughs about how it should be implemented? FreeBSD has it as a system call. It should be easy to dup. christos
Re: posix_fallocate
Date:Sun, 17 Nov 2013 03:18:56 + (UTC) From:chris...@astron.com (Christos Zoulas) Message-ID: l69cj0$f0v$1...@ger.gmane.org | In article 1lcgiu4.18zr2h51aac07zm%m...@netbsd.org, | Emmanuel Dreyfus m...@netbsd.org wrote: | NetBSD-current seems to lack posix_fallocate(2) | FreeBSD has it as a system call. It should be easy to dup. I would suggest avoiding it. While the objective for it looks clear, and perhaps even useful, to me it doesn't seem to be implementable safely. To me there appears to be just two ways to implement this - the safest would be a complex reservation scheme, which would account for blocks reserved to a file as if they were actually allocated, and so reducing the available space for other allocations on the filesystem. To me that looks to be an accounting nightmare to actually implement correctly in all cases (there are so many weird situations that would need solutions.) Alternatively, the system could actually allocate all required blocks at the time of the posix_fallocate() call - effectively filling in any holes in the designated region of the file. The spec doesn't say what data is to be put in the blocks allocated to fill the holes (a well behaved application wouldn't care, as it would normally write to the file before reading it, and would be using fallocate to guarantee that the entire set of write sys call it needed to make would succeed (or the fallocate() would fail), and the system could not run out of space half way through.) There would seem to be just two viable choices - fill the blocks with 0's, or leave random data there. The latter isn't really a choice, it is a security hole a mile wide, so fill with 0's would be the only real option. The problem is that this opens a trivial DoS attack like for (;;) { ftruncate(fd); posix_fallocate(fd, (off_t)0, huge); } where the (off_t) huge is howwver big the application can get away with without failing. For a sys call that is merely advisory to implement (not required) this all seems like a poor idea to me. Any application that really needs the function can duplicate it in user space (just a loop of read/write sys calls over the range required) which then costs user space resources, rather than kernel (or at least, not just kernel). kre ps: I have not examined the FreeBSD implementation - if they've done it the hard, safe, way, and worked out all the potential kinks, and if it doesn't depend too much upon other aspects of their I/O system implementation (like whatever they have to make softdeps work) then perhaps copying that might be feasible -- if the demand for this really exists, and it isn't being requested just because it is in the spec and NetBSD is lacking it.
Re: posix_fallocate
[...] this opens a trivial DoS attack like for (;;) { ftruncate(fd); posix_fallocate(fd, (off_t)0, huge); } How, exactly, is this any more of a DoS than doing the same thing but using one or more write() calls instead of the posix_fallocate()? /~\ The ASCII Mouse \ / Ribbon Campaign X Against HTMLmo...@rodents-montreal.org / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B