On Fri, Jan 7, 2011 at 5:12 PM, Chris Mason <chris.ma...@oracle.com> wrote:
>> I'm not sure why you would run out of memory in that case.
>
> Well, lets make sure I've got a good handle on the proposed interface:
>
> 1) fd = open(some_file, O_ATOMIC)

No, O_TRUNC should be used in open. Maybe it works with a separate truncate too.

> 2) truncate(fd, 0)
> 3) write(fd, new data)
>
> The semantics are that we promise not to let the truncate hit the disk
> until the application does the write.
>
> We have a few choices on how we do this:
>
> 1) Leave the disk untouched, but keep something in memory that says this
> inode is really truncated
>
> 2) Record on disk that we've done our atomic truncate but it is still
> pending.  We'd need some way to remove or invalidate this record after a
> crash.
>
> 3) Go ahead and do the operation but don't allow the transaction to
> commit until the write is done.
>
> option #1: keep something in memory.  Well, any time we have a
> requirement to pin something in memory until userland decides to do a
> write, we risk oom.

Since the file is open, you have to keep something in memory anyway,
right? Adding a bit (or bool) does not make a difference IMO.
Isn't this comparable to opening a temp file?

> option #2: disk format change.  Actually somewhat complex because if we
> haven't crashed, we need to be able to read the inode in again without
> invalidating the record but if we do crash, we have to invalidate the
> record.  Not impossible, but not trivial.
>
> option #3: Pin the whole transaction.  Depending on the FS this may be
> impossible.  Certain operations require us to commit the transaction to
> reclaim space, and we cannot allow userland to put that on hold without
> deadlocking.

#1 is the only one that makes sense.

> What most people don't realize about the crash safe filesystems is they
> don't have fine grained transactions.  There is one single transaction
> for all the operations done.  This is mostly because it is less complex
> and much faster, but it also makes any 'pin the whole transaction' type
> system unusable.

AFAIK the cost is mostly more complex code / runtime. The cost is not
disk performance.

-- 
Olaf
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to