Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

Hans van Kranenburg Mon, 24 Jul 2017 15:43:01 -0700

Hi Chris,

On 07/24/2017 08:53 PM, Chris Mason wrote:
> On 07/24/2017 02:41 PM, David Sterba wrote:
>> On Mon, Jul 24, 2017 at 02:01:07PM -0400, Chris Mason wrote:
>>> On 07/24/2017 10:25 AM, David Sterba wrote:
>>>
>>>> Thanks for the extensive historical summary, this change really
>>>> deserves
>>>> it.
>>>>
>>>> Decoupling the assumptions about the device's block management is
>>>> really
>>>> a good thing, mount option 'ssd' should mean that the device just has
>>>> cheap seeks. Moving the the allocation tweaks to ssd_spread provides a
>>>> way to keep the behaviour for anybody who wants it.
>>>>
>>>> I'd like to push this change to 4.13-rc3, as I don't think we need more
>>>> time to let other users to test this. The effects of current ssd
>>>> implementation have been debated and debugged on IRC for a long time.
>>>
>>> The description is great, but I'd love to see many more benchmarks.


Before starting a visual guide through a subset of benchmark-ish
material that I gathered in the last months, I would first like to point
at a different kind of benchmark, which is not a technical one. It would
be about measuring the end user adoption of the btrfs file system.

There are weeks in which no day goes by on IRC without a user joining
the channel in tears, asking "why is btrfs telling me my disk is full
when df says it's only 50% used"? Every time the IRC tribe has to do the
whole story about chunks, allocated space, used space, balance, the user
understands it a bit, or the user might give up, being confused why all
of this is necessary and why btrfs cannot just manage its own space a
bit, etc. Since a few weeks the debugging sessions have become a lot
shorter. "Ok, remount -o nossd, now try balance again, yay now it works
without "another xyz enospc during balance" errors, and keep it in your
fstab, kthxbye."

These are the users that know IRC or try the mailinglist. Others just
rage quit btrfs again accompanied with some not so nice words in a
twitter message.

This patch solves a significant part of this bad 'out of the box' behaviour.

This behaviour (fully allocating raw space and then crashing) has also
been the main reason at work to only keep btrfs for a few specific use
cases, in tightly controlled environment with extra monitoring on it.
With the -o nossd behaviour, I can start doing much more fun things with
it now, not having to walk around all day, using balance to clean up.

>>> At
>>> Facebook we use the current ssd_spread mode in production on top of
>>> hardware raid5/6 (spinning storage) because it dramatically reduces the
>>> read/modify/write cycles done for metadata writes.
>>
>> Well, I think this is an example that ssd got misused because of the
>> side effects of the allocation. If you observe good patterns for raid5,
>> then the allocator should be adapted for that case, otherwise
>> ssd/ssd_spread should be independent of the raid level.

If I have an iSCSI lun on a NetApp filer with rotating disks, which is
seek-free for random writes (they go to NVRAM and then to WAFL), but not
for random reads, does that make it an ssd, or not? (don't answer, it's
a useless question)

When writing this patch, I deliberately went for a scenario with minimal
impact in the code, but maximum impact in default behaviour for the end
user.

> Absolutely.  The optimizations that made ssd_spread useful for first
> generation flash are the same things that raid5/6 need.  Big writes, or
> said differently a minimum size for fast writes.
> 
>>
>>> If we're going to play around with these, we need a good way to measure
>>> free space fragmentation as part of benchmarks, as well as the IO
>>> patterns coming out of the allocator.

Yes! First a collection of use cases, then some reproducible
simulations, then measurements and timelapse pictures... Interesting.

By the way. In the last weeks I've been trying to debug excessive
metadata write IO behaviour of a large filesystem.

(from show_metadata_tree_sizes.py output:)
EXTENT_TREE  15.64GiB  0(1021576) 1(3692) 2(16) 3(1)

So far the results only point in the direction of btrfs meaning
butterfly effects, since I cannot seem to reliably cause things to
happen. However, I'll start a separate mail thread about the findings so
far, and some helper code for measuring things. Let's not do that in
here. This was about the data part at first.

>> Hans has a tool that visualizes the fragmentation. Most complaints I've
>> seen were about 'ssd' itself, excessive fragmentation, early ENOSPC. Not
>> many people use ssd_spread, 'ssd' gets turned on automatically so it has
>> much wider impact.
>>
>>> At least for our uses, ssd_spread matters much more for metadata than
>>> data (the data writes are large and metadata is small).

Exactly. Metadata. But that's a completely different story from data.
Actually, thanks for the idea, I'm gonna try a test with this later...
'nossd' data and 'ssd_spread' metadata.

>>  From the changes overview:
>>
>>> 1. Throw out the current ssd_spread behaviour.
>>
>> would it be ok for you to keep ssd_working as before?
>>
>> I'd really like to get this patch merged soon because "do not use ssd
>> mode for ssd" has started to be the recommended workaround. Once this
>> sticks, we won't need to have any ssd mode anymore ...

E.g. Debian Stretch has just been released with a 4.9 kernel. So, that
workaround will continue to be used for quite a while I guess...

> Works for me.  I do want to make sure that commits in this area include
> the workload they were targeting, how they were measured and what
> impacts they had.

For this one, it simply is... Btrfs default options that are supposed to
optimize things for an ssd must not exhibit behaviour that causes the
ssd to actually wear out faster.

> That way when we go back to try and change this again
> we'll understand what profiles we want to preserve.

Ok, there we go.

1. Just after I finished chunk level btrfs-heatmap. This shows a
continous fight against -o ssd filesystem with balance. 1 picture per
day, 2.5TiB or so fs.

https://www.youtube.com/watch?v=Qj1lxAasytc

2. The behaviour of -o ssd on a filesystem with a postfix mailserver,
mailman mailinglist server and mail logging in /var/log. Data-extent
level btrfs-heatmap timelapse, 4 pictures per hour, 4 block groups with
the highest vaddr:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-01-19-noautodefrag-ichiban.mp4

3. The behaviour of -o nossd on the same filesystem as 2:

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-ichiban-walk-nossd.mp4

4. Again, same fs as in 2 and 3. Guess at which point I switched from -o
ssd (automatically chosen, it's not even an ssd at all) to adding an
explicit -o nossd...

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-07-19-btrfs_usage_ssd_vs_nossd.png

(When it goes down, that's using balance)

5. A video of a ~40TiB filesystem that switches from -o ssd (no, it's
not an ssd, it was automatically chosen for me) to explicit -o nossd at
some point. I think you should be able to guess when exactly this
happens. 1 picture per day, ordered on virtual address space.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-ssd-to-nossd.mp4

6. The effect of -o ssd on the filesystem in 5. At this point it was
already impossible to keep the allocated space down with balance.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-04-08-backups-16-Q23.png

7. A collection of pictures of what block groups look like after they've
been abused by -o ssd:

https://syrinx.knorrie.org/~knorrie/btrfs/fragmented/

These pictures are from the filesystem in 5. After going nossd, I wrote
a custom balance script that operates on free space fragmentation level,
and started cleaning up the mess, worst fragmented first. It turned out
that for the fragmentation number it returned for a data block group
(second number in the file name), fragmented > 200 was quite bad, < 200
was not that bad.

8. Same timespan as the timelapse in 5, displayed as totals.

https://syrinx.knorrie.org/~knorrie/btrfs/keep/2017-07-25-backups-17-Q23.png

All of the examples and more details about them can be found in mailing
list threads from the last months.

Have fun,

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: Do not use data_alloc_cluster in ssd mode

Reply via email to