Hi Filipe, Chris,

I did more profiling work, and I want to writeup some of the things I
noticed. Hopefully it has some value for the community.

# btrfs_transaction.dirty_pages:
This "extent_io_tree" includes every eb created by
btrfs_init_new_buffer (this is the only function that adds dirty
extents there). But sometimes such eb is marked back as clean
(clean_tree_block), but is not removed from
btrfs_transaction.dirty_pages. So I think
btrfs_write_and_wait_transaction triggers a writeout for such ebs as
well, although they are not marked as dirty, so probably this won't do
anything. I don't know if this may cause any issue.

# btrfs_fs_info.dirty_metadata_bytes
This accounts dirty metadata per filesystem, not per particular
transaction. Because we may have transaction X in
btrfs_write_and_wait_transaction, and transaction X+1 creating dirty
ebs. This is probably intended.

# btrfs_init_new_buffer
I was trying to account exact amount of dirty ebs per transaction.
Obviously, any such dirty eb would be always created by this function.
And I was trying to compare the amount of ebs created by this function
vs the amount of ebs written out by write_one_eb.
I am aware that a particular eb can be freed in the same transaction
via btrfs_free_tree_block() either pinning it, or directly dropping it
back to the free-space cache. I am also aware that delayed references
may cancel one another, and run_one_delayed_ref() may pin the eb
without doing any extent tree update. I am also aware about VM
writeback happening via btree_writepages, thus requiring to re-COW
some of the ebs.
Still the total amount of metadata written by a particular transaction
is less than the total amount of ebs created by btrfs_init_new_buffer,
taking into account all the cases I mentioned above. I guess that
clean_tree_block() is the culprit, but I am not sure what happens to
such "cleaned" eb. It just remains in btree_inode page cache? Or it is
freed by some path I an unaware of?

# should_cow_block() + BTRFS_HEADER_FLAG_WRITTEN causing a re-COW of a
tree block
Chris, you mention that btrfs should be skipping memory pressure
writeback if there is not much dirty metadata. I guess you mean
BTRFS_DIRTY_METADATA_THRESH checked in btree_writepages, and also
skipping "for_kupdate" writeback. Grepping through kernel sources, I
see that btree_writepages should never see for_reclaim bit set,
because for that the btree_inode needs to provide the "writepage"
callback, but it only provides "writepages". (for_reclaim is set by
vmscan.c::pageout and migrate.c::writeout, which both require
"writepage" if I am seeing correctly).
So except for explicit writeback requests, btree_writepages should see
only the "for_background" writeback, which is mostly controlled by
dirty_background_ratio/dirty_background_bytes as I understand (not by
memory pressure).
Anyways, I disabled the WB_SYNC_NONE fully, and indeed was not seeing
any re-COWs. But on the other hand, I had few OOM panics, when dirty
metadata grew to several GBs.

# global_block_rsv and update_global_block_rsv
I don't fully understand what calc_global_metadata_size() is trying to
do, But I think once metadata size is above couple of GBs, then
update_global_block_rsv will always refill global_block_rsv back to
512Mb, as it does:
block_rsv->size = min_t(u64, num_bytes, 512 * 1024 * 1024);
and then bumps up "reserved" to "size".
As a result, should_end_transaction() will tell us that we need to
start a commit when we grab half of that, i.e., 256MB or more, from
global_block_rsv. This seems very reasonable, as we really don't want
to have more dirty metadata than that in a single transaction.
However, we might call update_global_block_rsv() on different
occasions, and every such call will bump it back to 512MB. So for
example, we create a new data block group, and then
btrfs_make_block_group will call update_global_block_rsv, allowing us
to dirty more metadata before we commit.
This seems a bit strange, why we are bumping it back and suddenly
allowing more updates to come in?

# run_delayed_refs by the commit thread
I see that the current spirit is that when somebody detaches from a
transaction by btrfs_end_transaction, we don't want to delay it for
too long.
So even if must_run_delayed_refs==1, we will run at most 32 delayed refs:
cur = max_t(unsigned long, cur, 32)
As a result, when we trigger a commit we might have tens of thousands
of delayed refs, to be run by the commit thread. The commit thread
calls
btrfs_run_delayed_refs several times, before it reaches the critical
section (TRANS_STATE_COMMIT_DOING), where we don't allow anybody to
join it.

Profiling these run_delayed_refs calls shows that they are slow due to:
- we need to read the extent tree blocks from disk[1] in order to
update the extent tree (read_extent_buffer_pages), and there is only
one thread reading them sequentially (the commit thread).
- before we get to the critical section of the commit, we can have
other threads also running delayed refs, so the commit thread needs to
compete on tree-block
locks with them (and they hold the locks because they also read tree
blocks from disk as it seems)

So my question is shouldn't we be much more aggressive in
__btrfs_end_transaction, running delayed refs several times and
checking trans->delayed_ref_updates after each run, and return only if
this number is zero or small enough.
This way when we trigger a commit, it will not have a lot of delayed
refs to run, it will get very quickly to the critical section, pass it
hopefully very quickly (get to TRANS_STATE_UNBLOCKED), and then we can
open a new transaction while the previous is doing
btrfs_write_and_wait_transaction.
That's what I wanted to ask.

Thanks!
Alex.


[1] In my case, btrfs metadata is ~10GBs and the machine has 8GB of
RAM. Due to this we need to read a lot of ebs from disk, as they are
not in the page cache. Also need to keep in mind that every COW of eb
requires a new slot in the page cache, because we index by "bytenr"
that we receive from the free-space cache, which is a "logical"
coordinate by which EXTENT_ITEMs are sorted in the extent tree.



On Mon, Jul 13, 2015 at 7:02 PM, Chris Mason <c...@fb.com> wrote:
> On Mon, Jul 13, 2015 at 06:55:29PM +0200, Alex Lyakas wrote:
>> Filipe,
>> Thanks for the explanation. Those reasons were not so obvious for me.
>>
>> Would it make sense not to COW the block in case-1, if we are mounted
>> with "notreelog"? Or, perhaps, to check that the block does not belong
>> to a log tree?
>>
>
> Hi Alex,
>
> The crc rules are the most important, we have to make sure the block
> isn't changed while it is in flight.  Also, think about something like
> this:
>
> transaction write block A, puts pointer to it in the btree, generation Y
>
> <hard disk properly completes the IO>
>
> transaction rewrites block A, same generation Y
>
> <hard disk drops the IO on the floor and never does it>
>
> Later on, we try to read block A again.  We find it has the correct crc
> and the correct generation number, but the contents are actually wrong.
>
>> The second case is more difficult. One problem is that
>> BTRFS_HEADER_FLAG_WRITTEN flag ends up on disk. So if we write a block
>> due to memory pressure (this is what I see happening), we complete the
>> writeback, release the extent buffer, and pages are evicted from the
>> page cache of btree_inode. After some time we read the block again
>> (because we want to modify it in the same transaction), but its header
>> is already marked as BTRFS_HEADER_FLAG_WRITTEN on disk. Even though at
>> this point it should be safe to avoid COW, we will re-COW.
>>
>> Would it make sense to have some runtime-only mechanism to lock-out
>> the write-back for an eb? I.e., if we know that eb is not under
>> writeback, and writeback is locked out from starting, we can redirty
>> the block without COW. Then we allow the writeback to start when it
>> wants to.
>>
>> In one of my test runs, btrfs had 6.4GB of metadata (before
>> raid-induced overhead), but during a particular transaction total of
>> 10GB of metadata (again, before raid-induced overhead) was written to
>> disk. (Thisis  total of all ebs having
>> header->generation==curr_transid, not only during commit of the
>> transaction). This particular run was with "notreelog".
>>
>> Machine had 8GB of RAM. Linux allows the btree_inode to grow its
>> page-cache upto ~6.9GB (judging by btree_inode->i_mapping->nrpages).
>> But even though the used amount of metadata is less than that, this
>> re-COW'ing of already-COW'ed blocks seems to cause page-cache
>> trashing...
>
> Interesting.  We've addressed this in the past with changes to the
> writepage(s) callback for the btree, basically skipping memory pressure
> related writeback if there isn't that much dirty.  There is a lot of
> room to improve those decisions, like preferring to write leaves over
> nodes, especially full leaves that are not likely to change again.
>
> -chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to