>From what I gather, more elaborate handling of BPs is needed. I don't mind
implementing and evaluating prototypes of approaches you mentioned, maybe
even some kind of hybrid. The final solution should have more predictable,
and controllable, effect on performance with other i/o going on.
I'm more concerned by changes in semantics introduced by separating scrub
metadata walk and i/o. If I get it correctly, destroy, ddt, and possibly
other code, will be able change BPs in between those two phases, so these
changes also would have to be tracked.

Also, the new scrub algorithm would have to be guarded by a new
'spa_feature' flag.
I'm not sure if old behavior must be preserved for old pools?

Anyhow, If you have pointers at what code to look at, or where to start let
me know.

Regards,

On Sat, Jul 9, 2016 at 11:25 PM Matthew Ahrens <mahr...@delphix.com> wrote:

> We had an intern work on "sorted scrub" last year.  Essentially the idea
> was to read the metadata to gather into memory all the BP's that need to be
> scrubbed, sort them by DVA (i.e. offset on disk) and then issue the scrub
> i/os in that sorted order.  However, memory can't hold all of the BP's, so
> we do multiple passes over the metadata, each pass gathering the next chunk
> of BP's.  This code is implemented and seems to work but probably needs
> some more testing and code cleanup.
>
> One of the downsides of that approach is having to do multiple passes over
> the metadata if it doesn't all fit in memory (which it typically does
> not).  In some circumstances, this is worth it, but in others not so much.
> To improve on that, we would like to do just one pass over the metadata to
> find all the block pointers.  Rather than storing the BP's sorted in
> memory, we would store them on disk, but only roughly sorted.  There are
> several ways we could do the sorting, which is one of the issues that makes
> this problem interesting.
>
> We could divide each top-level vdev into chunks (like metaslabs, but
> probably a different number of them) and for each chunk have an on-disk
> list of BP's in that chunk that need to be scrubbed/resilvered.  When we
> find a BP, we would append it to the appropriate list.  Once we have
> traversed all the metadata to find all the BP's, we would load one chunk's
> list of BP's into memory, sort it, and then issue the resilver i/os in
> sorted order.
>
> As an alternative, it might be better to accumulate as many BP's as fit in
> memory, sort them, and then write that sorted list to disk.  Then remove
> those BP's from memory and start filling memory again, write that list,
> etc.  Then read all the sorted lists in parallel to do a merge sort.  This
> has the advantage that we do not need to append to lots of lists as we are
> traversing the metadata. Instead we have to read from lots of lists as we
> do the scrubs, but this should be more efficient  We also don't have to
> determine beforehand how many chunks to divide each vdev into.
>
> If you'd like to continue working on sorted scrub along these lines, let
> me know.
>
> --matt
>
>
> On Sat, Jul 9, 2016 at 7:10 AM, Gvozden Neskovic <nesko...@gmail.com>
> wrote:
>
>> Dear OpenZFS developers,
>>
>> Since SIMD RAID-Z code was merged to ZoL [1], I started to look into the
>> rest of the scrub/resilvering code path.
>> I've found some existing specs and ideas about how to make the process
>> more rotational drive friendly [2][3][4][5].
>> What I've gathered from these is that scrub should be split to metadata
>> and data traversal phases. As I'm new to ZFS,
>> I've made a quick prototype simulating large elevator using AVL list to
>> sort blocks by DVA offset [6]. It's probably
>> broken in more than few ways, but this is just a quick hack to get a
>> grasp of the code. Solution turned out similar to
>> 'ASYNC_DESTROY' feature, so I'm wondering if this might be a direction to
>> take?
>>
>> At this stage, I would appreciate any input on how to proceed with this
>> project. If you're a core dev and would like
>> to provide any kind of mentorship or willing to answer some questions
>> from time to time, please let me know.
>> Or, if there's a perfect solution for this just waiting to be
>> implemented, even better.
>> For starters, pointers like: read this article, make sure you understand
>> this peace of code, etc., would also be very helpful.
>>
>> Regards,
>>
>> [1]
>> https://github.com/zfsonlinux/zfs/commit/ab9f4b0b824ab4cc64a4fa382c037f4154de12d6
>> [2] https://blogs.oracle.com/roch/entry/sequential_resilvering
>> [3]
>> http://wiki.old.lustre.org/images/f/ff/Rebuild_performance-2009-06-15.pdf
>> [4] https://blogs.oracle.com/ahrens/entry/new_scrub_code
>> [5] http://open-zfs.org/wiki/Projects#Periodic_Data_Validation
>> [6]
>> https://github.com/ironMann/zfs/commit/9a2ec765d2afc38ec76393dd694216fae0221443
>>
>
> *openzfs-developer* | Archives
> <https://www.listbox.com/member/archive/274414/=now>
> <https://www.listbox.com/member/archive/rss/274414/28247773-fd14ec80> |
> Modify
> <https://www.listbox.com/member/?&;>
> Your Subscription <http://www.listbox.com>
>



-------------------------------------------
openzfs-developer
Archives: https://www.listbox.com/member/archive/274414/=now
RSS Feed: https://www.listbox.com/member/archive/rss/274414/28015062-cce53afa
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=28015062&id_secret=28015062-f966d51c
Powered by Listbox: http://www.listbox.com

Reply via email to