On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.ma...@oracle.com> wrote: >> I'm not sure why you would run out of memory in that case. > > Well, lets make sure I've got a good handle on the proposed interface: > > 1) fd = open(some_file, O_ATOMIC)
No, O_TRUNC should be used in open. Maybe it works with a separate truncate too. > 2) truncate(fd, 0) > 3) write(fd, new data) > > The semantics are that we promise not to let the truncate hit the disk > until the application does the write. > > We have a few choices on how we do this: > > 1) Leave the disk untouched, but keep something in memory that says this > inode is really truncated > > 2) Record on disk that we've done our atomic truncate but it is still > pending. We'd need some way to remove or invalidate this record after a > crash. > > 3) Go ahead and do the operation but don't allow the transaction to > commit until the write is done. > > option #1: keep something in memory. Well, any time we have a > requirement to pin something in memory until userland decides to do a > write, we risk oom. Since the file is open, you have to keep something in memory anyway, right? Adding a bit (or bool) does not make a difference IMO. Isn't this comparable to opening a temp file? > option #2: disk format change. Actually somewhat complex because if we > haven't crashed, we need to be able to read the inode in again without > invalidating the record but if we do crash, we have to invalidate the > record. Not impossible, but not trivial. > > option #3: Pin the whole transaction. Depending on the FS this may be > impossible. Certain operations require us to commit the transaction to > reclaim space, and we cannot allow userland to put that on hold without > deadlocking. #1 is the only one that makes sense. > What most people don't realize about the crash safe filesystems is they > don't have fine grained transactions. There is one single transaction > for all the operations done. This is mostly because it is less complex > and much faster, but it also makes any 'pin the whole transaction' type > system unusable. AFAIK the cost is mostly more complex code / runtime. The cost is not disk performance. -- Olaf -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html