Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500:
> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.ma...@oracle.com> wrote:
> > Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500:
> >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.ma...@oracle.com> 
> >> wrote:
> >> > Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500:
> >> >> Hi,
> >> >>
> >> >> I think that the disk allocation size of each file becomes a monotone 
> >> >> increase
> >> >> when the file is made.
> >> >> But, it sometimes return to 0.  Is it correct?
> >> >
> >> > Well, there's a window during the processing of delayed allocation where
> >> > we don't have the bytes recorded as delalloc and we don't have the bytes
> >> > recorded in the inode yet.  That's why they are showing up as zero.
> >> >
> >> > We don't call inode_add_bytes() until after we insert the extent, but we
> >> > drop the delalloc byte count on the file before the IO is done.
> >> >
> >> > Fixing it will be a little tricky because all the extent accounting
> >> > assumes the inode_add_bytes happens at extent insertion time.
> >> >
> >>
> >> How does opening the inode with O_APPEND during this window know where
> >> to write the bytes?  If it's a pointer/cursor to the EOF then that
> >> size could be used during the window.  Is that right?
> >
> > This counter records the number of blocks allocated to the file, and
> > reading it with ls -l or stat is somewhat racey by nature.  Most of the
> > time its fine, btrfs just has a really big window where the results from
> > ls -l seem wrong.
> >
> 
> I see.  Is it using per-cpu vars or something similar?

Our stat function returns the block count in the inode plus the number
of bytes we have accounted as delayed allocation.

As we do writes to the file, the delayed allocation count goes up and
then eventually we decide we need to do some IO.

Before we do the IO, we have to decide where on the disk to write the
extents.  Once that is decided, we decrement the count of delayed
allocation bytes.

This is when stat starts returning the wrong answer.

Then we do the IO, and when the IO is done we actually insert the file
extents into the file metadata.  This is when stat starts returning the
right answer again.

The whole setup sounds strange, but this is how btrfs implements the
semantics from data=ordered.  We don't update the file to point to
the new blocks until after the IO is done, so we never have to wait on
the data IO before we can do a transaction commit.  It avoids all kinds
of latencies with fsync and other problems.

One easy solution is to just add another counter in the in-memory inode
for the number of bytes in flight that aren't accounted for in other
places.  But I'd rather not make the inode any bigger, so I'll have to
think if we can solve this another way.

> 
> > But, the counter really means nothing to the btrfs internals.  When we
> > do file operations we go based on the extent pointers we find in the
> > tree and i_size (i_size is strictly maintained).
> >
> 
> Would it be too heavy of an operation to have stat walk the btrfs tree
> to get its data?
> 

I'm afraid so, stat is fairly performance critical.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to