(2010/12/08 5:15), Chris Mason wrote: > Excerpts from Mike Fedyk's message of 2010-12-07 15:07:08 -0500: >> On Tue, Dec 7, 2010 at 11:29 AM, Chris Mason <chris.ma...@oracle.com> wrote: >>> Excerpts from Mike Fedyk's message of 2010-12-07 14:16:55 -0500: >>>> On Tue, Dec 7, 2010 at 10:44 AM, Chris Mason <chris.ma...@oracle.com> >>>> wrote: >>>>> Excerpts from Tsutomu Itoh's message of 2010-12-07 02:59:52 -0500: >>>>>> Hi, >>>>>> >>>>>> I think that the disk allocation size of each file becomes a monotone >>>>>> increase >>>>>> when the file is made. >>>>>> But, it sometimes return to 0. Is it correct? >>>>> >>>>> Well, there's a window during the processing of delayed allocation where >>>>> we don't have the bytes recorded as delalloc and we don't have the bytes >>>>> recorded in the inode yet. That's why they are showing up as zero. >>>>> >>>>> We don't call inode_add_bytes() until after we insert the extent, but we >>>>> drop the delalloc byte count on the file before the IO is done. >>>>> >>>>> Fixing it will be a little tricky because all the extent accounting >>>>> assumes the inode_add_bytes happens at extent insertion time. >>>>> >>>> >>>> How does opening the inode with O_APPEND during this window know where >>>> to write the bytes? If it's a pointer/cursor to the EOF then that >>>> size could be used during the window. Is that right? >>> >>> This counter records the number of blocks allocated to the file, and >>> reading it with ls -l or stat is somewhat racey by nature. Most of the >>> time its fine, btrfs just has a really big window where the results from >>> ls -l seem wrong. >>> >> >> I see. Is it using per-cpu vars or something similar? > > Our stat function returns the block count in the inode plus the number > of bytes we have accounted as delayed allocation. > > As we do writes to the file, the delayed allocation count goes up and > then eventually we decide we need to do some IO. > > Before we do the IO, we have to decide where on the disk to write the > extents. Once that is decided, we decrement the count of delayed > allocation bytes. > > This is when stat starts returning the wrong answer. > > Then we do the IO, and when the IO is done we actually insert the file > extents into the file metadata. This is when stat starts returning the > right answer again.
I understood. However, I worry that the user is confused because the wrong condition is too long. > > The whole setup sounds strange, but this is how btrfs implements the > semantics from data=ordered. We don't update the file to point to > the new blocks until after the IO is done, so we never have to wait on > the data IO before we can do a transaction commit. It avoids all kinds > of latencies with fsync and other problems. > > One easy solution is to just add another counter in the in-memory inode > for the number of bytes in flight that aren't accounted for in other > places. But I'd rather not make the inode any bigger, so I'll have to > think if we can solve this another way. > >> >>> But, the counter really means nothing to the btrfs internals. When we >>> do file operations we go based on the extent pointers we find in the >>> tree and i_size (i_size is strictly maintained). >>> >> >> Would it be too heavy of an operation to have stat walk the btrfs tree >> to get its data? >> > > I'm afraid so, stat is fairly performance critical. > > -chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html