Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500: > On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.ma...@oracle.com> wrote: > > Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500: > >> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.ma...@oracle.com> > >> wrote: > >> > Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500: > >> >> Hi, > >> >> > >> >> I think that the disk allocation size of each file becomes a monotone > >> >> increase > >> >> when the file is made. > >> >> But, it sometimes return to 0. Is it correct? > >> > > >> > Well, there's a window during the processing of delayed allocation where > >> > we don't have the bytes recorded as delalloc and we don't have the bytes > >> > recorded in the inode yet. That's why they are showing up as zero. > >> > > >> > We don't call inode_add_bytes() until after we insert the extent, but we > >> > drop the delalloc byte count on the file before the IO is done. > >> > > >> > Fixing it will be a little tricky because all the extent accounting > >> > assumes the inode_add_bytes happens at extent insertion time. > >> > > >> > >> How does opening the inode with O_APPEND during this window know where > >> to write the bytes? If it's a pointer/cursor to the EOF then that > >> size could be used during the window. Is that right? > > > > This counter records the number of blocks allocated to the file, and > > reading it with ls -l or stat is somewhat racey by nature. Most of the > > time its fine, btrfs just has a really big window where the results from > > ls -l seem wrong. > > > > I see. Is it using per-cpu vars or something similar?
Our stat function returns the block count in the inode plus the number of bytes we have accounted as delayed allocation. As we do writes to the file, the delayed allocation count goes up and then eventually we decide we need to do some IO. Before we do the IO, we have to decide where on the disk to write the extents. Once that is decided, we decrement the count of delayed allocation bytes. This is when stat starts returning the wrong answer. Then we do the IO, and when the IO is done we actually insert the file extents into the file metadata. This is when stat starts returning the right answer again. The whole setup sounds strange, but this is how btrfs implements the semantics from data=ordered. We don't update the file to point to the new blocks until after the IO is done, so we never have to wait on the data IO before we can do a transaction commit. It avoids all kinds of latencies with fsync and other problems. One easy solution is to just add another counter in the in-memory inode for the number of bytes in flight that aren't accounted for in other places. But I'd rather not make the inode any bigger, so I'll have to think if we can solve this another way. > > > But, the counter really means nothing to the btrfs internals. When we > > do file operations we go based on the extent pointers we find in the > > tree and i_size (i_size is strictly maintained). > > > > Would it be too heavy of an operation to have stat walk the btrfs tree > to get its data? > I'm afraid so, stat is fairly performance critical. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html