On Feb 24, 2014, at 8:20 AM, Prakash Surya <sur...@llnl.gov> wrote:

> On Sun, Feb 23, 2014 at 06:56:39PM -0800, Richard Elling wrote:
>> On Feb 10, 2014, at 11:26 AM, Prakash Surya <sur...@llnl.gov> wrote:
>>     --
>>     Cheers, Prakash
>> 
>>     On Sun, Feb 09, 2014 at 10:28:44PM -0800, Richard Elling wrote:
>>          Hi Prakash,
>> 
>>          On Feb 7, 2014, at 10:41 AM, Prakash Surya
>>          <sur...@llnl.gov> wrote:
>> 
>>               Hey guys,
>> 
>>               I've been working on some ARC performance work
>>               targeted for the ZFS on
>>               Linux implementation, but I think some of the
>>               patches I'm proposing
>>               _might_ be useful in the other implementations as
>>               well.
>> 
>>               As far as I know, the ARC code is largely the
>>               same between
>>               implementations.
>> 
>>          NB, there are several different implementations use
>>          different metadata
>>          management approaches.
>> 
>>               Although, on Linux we try and maintain a hard
>>               limit on
>>               metadata using &quot;arc_meta_limit&quot; and
>>               &quot;arc_meta_used&quot;. Thus, not all of
>>               the patches are relevant outside of ZoL, but my
>>               hunch is many definitely
>>               are.
>> 
>>          Can you explain the reasoning here? Historically, we've
>>          tried to avoid
>>          putting absolute limits because they must be managed and
>>          increasing
>>          management complexity is a bad idea.
>> 
>>     Honestly, I don't particularly like the distinction made between
>>     &quot;data&quot;
>>     and &quot;metadata&quot; in the ARC. I haven't seen any reason why
>>     it's needed,
>>     but it was introduced long ago, and I assume there was a reason for
>>     it
>>     (See illumos commit: 0e8c61582669940ab28fea7e6dd2935372681236).
>> 
>> There is a large amount of experience in the field where some workloads are
>> data-intensive and others are attribute-intensive. At LLNL you probably 
>> mostly
>> see data-intensive workloads, but attribute-intensive workloads are quite
>> common
> 
> How are you classifying "attribute-intensive"? I've never heard that
> term used before, so I don't really know what it means.

It is used quite frequently in describing workloads when studying the 
performance
of file systems.

> 
> We're interested in a variety of use cases actually. Lustre has a MDS
> that is mainly xattrs, zero length files and fat zaps and micro zaps.
> It also has OSS nodes which store the "data" portion of files and zaps.
> But we also care about the ZPL which can be any number of use cases,
> really. I have ZFS running on my desktop, it's not just the Lustre case
> I care about.

cool

> 
>> in the commercial application space. There is no doubt that some workloads 
>> perform much better when properly matched to the caching strategy.
> 
> OK. I'm not quite sure how this relates, though? What are you saying the
> caching strategy of the ARC is?
> 
> My tested workloads weren't "targeting" any "caching strategy". They
> just highlighted some short comings in the ZoL code which I tried to
> address. I didn't try to "change" the ARC, I just wanted to "fix" it for
> certain pathologies that nobody really cared or noticed before.

I see.

> 
>> 
>> 
>>     So, with that said, I'm not adding the &quot;arc_meta_limit&quot;.
>>     It's already
>>     been in the code for awhile, although the Illumos tree and the ZoL
>>     tree
>>     differ in their behavior when that limit is reached. The reason _why_
>>     we
>>     differ isn't cut and dry, and might be simply due to a
>>     misunderstanding.
>> 
>>     I tend to think that maintaining an absolute limit on the metadata is
>>     a
>>     good thing, but solely because we have this arbitrary notion baked
>>     into
>>     the eviction processes that &quot;metadata&quot; is more important
>>     that &quot;data&quot;. I
>>     think that given a specific workload (I haven't tested this), the ARC
>>     could fill up with &quot;not as important&quot; metadata because the
>>     data is
>>     always pitched first (whether or not the data is getting more ghost
>>     hits
>>     or not).
>> 
>> 
>>               To highlight, I think these might be of
>>               particular interest:
>> 
>>                 * 22be556 Disable aggressive arc_p growth by
>>               default
>> 
>>          MRU (p) growth is the result of demand, yes?
>> 
>>     What do you mean by &quot;result of demand&quot;?
>> 
>> ZFS doesn't decide to grow MRU on its own, that is driven by the 
>> application's
>> demand.
> 
> The MRU (i.e. "p") grows as the MRU ghost list receives hits and shrinks
> as the MFU ghost list receives hits. It *also* grows when anonymous data
> is added and arc_size is less than arc_c (at least upstream, we just
> disabled it on ZoL).
> 
> So, while I don't really agree with your sentence, I think(?) you meant
> what I wrote above.
> 
>> 
>> 
>>     According to the paper the implementation was based on, &quot;p&quot;
>>     should
>>     increase as the MRU list receives ghost hits. The actual
>>     implementation
>>     diverges from the paper in a number of places.
>> 
>> If you see lots of ghost use, then something is imbalanced. For many 
>> workloads
>> you should see few or no ghost hits.
> 
> Yes, I agree. But that was not the case for the workloads I tested until
> I "fixed" things. Those workloads *should* have seen few to no ghost
> hits, but the ARC on ZoL was broken, causing useful data to be pitched.
> 
>> 
>> 
>>     In this case, &quot;p&quot; is incremented as we add new anonymous
>>     data in
>>     arc_get_data_buf(). It looks like this is an optimization that should
>>     only be done when the ARC is still &quot;warming up&quot;, but that's
>>     not how it
>>     works in practice.
>> 
>>     What I've seen is, the new anonymous data pushes &quot;p&quot; up to
>>     the upper
>>     limit (due to a constant stream of new dirty data), and this
>>     throttles
>>     the MFU. So, even though the MFU will get an order of magnitude more
>>     ghost list hits, &quot;p&quot; wont properly adjust because the dirty
>>     data is
>>     pushing &quot;p&quot; up to its maximum. That's why I added that
>>     patch.
>> 
>> Again, this is result of demand. 
> 
> Uh, yes, but is it correct? Should "hot" MFU data be ostracized from the
> ARC because we're adding a lot of anonymous data at the same time?
> 
>> 
>> 
>> 
>>                 * 5694b53 Allow &quot;arc_p&quot; to drop to
>>               zero or grow to &quot;arc_c&quot;
>> 
>>          Zero p means zero demand?
>>          Also, can you explain the reasoning for not wanting
>>          anything in the
>>          MFU cache? I suppose if you totally disable the MFU cache,
>>          then you'll
>>          get the behaviour of most other file system caches, but
>>          that isn't a
>>          good thing.
>> 
>>     Again, what do you mean by &quot;demand&quot;?
>> 
>>     I definitely **do not** want to disable the MFU. I'm not sure where
>>     you
>>     got the idea that I'm trying to keep data out of the MFU, because
>>     that's
>>     not what I'm doing at all.
>> 
>>     This patch is simply removing this arbitrary limit on the max and min
>>     size of &quot;p&quot;. If the workload is driving &quot;p&quot; up or
>>     down, I don't
>>     understand why we need to try and override the adaptive logic by
>>     placing
>>     a min and max value.
>> 
>> 
>>                 * 517a0bc Disable arc_p adapt dampener by
>>               default
>>                 * 2d1f779 Remove &quot;arc_meta_used&quot; from
>>               arc_adjust calculatio
>>                 * 32a96d6 Prioritize &quot;metadata&quot; in
>>               arc_get_data_buf
>>                 * b3b7236 Split &quot;data_size&quot; into
>>               &quot;meta&quot; and &quot;data&quot;
>> 
>>               Keep in mind, my expertise with the ARC is still
>>               limited, so if anybody
>>               finds any of these patches as &quot;wrong&quot;
>>               (for a particular workload, maybe)
>>               please let me know. The full patch stack I'm
>>               proposing on Linux is here:
>> 
>>                 * https://github.com/zfsonlinux/zfs/pull/2110
>> 
>>               I posted some graphs of useful arcstat parameters
>>               vs. time for each of
>>               the 14 unique tests run. Those are in this
>>               comment:
>> 
>>                 * https://github.com/zfsonlinux/zfs/pull/
>>               2110#issuecomment-34393733
>> 
>>               And here's a snippet from the pull request
>>               description with a summary of
>>               the benefits this patch stack has shown in my
>>               testing (go check out the
>>               pull request for more info on the tests run and
>>               results gathered):
>> 
>>                 Improve ARC hit rate with metadata heavy
>>               workloads
>> 
>>                 This stack of patches has been empirically
>>               shown to drastically improve
>>                 the hit rate of the ARC for certain workloads.
>>               As a result, fewer reads
>>                 to disk are required, which is generally a good
>>               thing and can
>>                 drastically improve performance if the workload
>>               is disk limited.
>> 
>>                 For the impatient, I'll summarize the results
>>               of the tests performed:
>> 
>>                     * Test 1 - Creating many empty directories.
>>               This test saw 99.9%
>>                                fewer reads and 12.8% more
>>               inodes created when running
>>                                *with* these changes.
>> 
>>                     * Test 2 - Creating many empty files. This
>>               test saw 4% fewer reads
>>                                and 0% more inodes created when
>>               running *with* these
>>                                changes.
>> 
>>                     * Test 3 - Creating many 4 KiB files. This
>>               test saw 96.7% fewer
>>                                reads and 4.9% more inodes
>>               created when running *with*
>>                                these changes.
>> 
>>                     * Test 4 - Creating many 4096 KiB files.
>>               This test saw 99.4% fewer
>>                                reads and 0% more inodes created
>>               (but took 6.9% fewer
>>                                seconds to complete) when
>>               running *with* these changes.
>> 
>>                     * Test 5 - Rsync'ing a dataset with many
>>               empty directories. This
>>                                test saw 36.2% fewer reads and
>>               66.2% more inodes created
>>                                when running *with* these
>>               changes.
>> 
>>                     * Test 6 - Rsync'ing a dataset with many
>>               empty files. This test saw
>>                                30.9% fewer reads and 0% more
>>               inodes created (but took
>>                                24.3% fewer seconds to complete)
>>               when running *with*
>>                                these changes.
>> 
>>                     * Test 7 - Rsync'ing a dataset with many 4
>>               KiB files. This test saw
>>                                30.8% fewer reads and 173.3%
>>               more inodes created when
>>                                running *with* these changes.
>> 
>>          AIUI, the tests will work better with a large, MFU metadata
>>          cache.
>>          Yet the proposed changes can also result in small, MRU-only
>>          metadata
>>          caches -- which would be disasterous to most (all?)
>>          applications.
>>          I'd love to learn more about where you want to go with
>>          this.
>> 
>>     Hmm, I don't quite understand your comment? I'm not trying to disable
>>     the MFU at all? If anything, I'm trying to make sure it works on ZoL
>>     (which I don't think it does at the moment). Depending on the
>>     workload,
>>     the MFU is *very* useful, and it's especially useful in the tests I
>>     looked at.
>> 
>> Perhaps the way you are describing your changes is causing confusion.
>> There are some things I do like, such as separating the accounting for data
>> vs metadata. But I'm not convinced the balancing changes are generally
>> applicable and your tests are very specific to one type of workload. Let's
>> see how it works given wider exposure before pushing upstream.
> 
> OK, sorry if my description is causing confusion. In that case, please
> see the 11 most recent commits to the ZoL tree; the code speaks for
> itself.
> 
> The ARC is complicated and subtle, and as far as I can tell, nobody in
> the open ZFS community really understands it well.

Disagree. There are a lot of people in the OpenZFS community who spend a
lot of time in the ARC. Some of us have also spent lots of time in other cache
designs, including CPUs. We know the big problems that local optimizations
create and are, because of battle scars, cautious when changing cache policies.

I'm not 100% sure of the provenance of the ARC code in ZoL. Are you saying
that the breakage is specific to ZoL and does not affect upstream? Or do you 
believe the issues have been around for a long time?

> So if people are just
> unfamiliar with the code and afraid of change, I get that. But these
> changes are needed for use cases we care about on ZoL. They landed last
> Friday so let's see how they hold up in the hands of the broader ZoL
> community.

yep, looking forward to see how it works for everyone in ZoL.
 -- richard

> 
> -- 
> Cheers, Prakash
> 
>>  -- richard
>> 
>> 
>>          -- richard
>> 
>> 
>>               So, in the interest of collaboration (and
>>               potentially getting much
>>               needed input from people with more ARC expertise
>>               than I have), I wanted
>>               to give this work a broader audience.
>> 
>>               --
>>               Cheers, Prakash
>> 
>>               _______________________________________________
>>               developer mailing list
>>               developer@open-zfs.org
>>               http://lists.open-zfs.org/mailman/listinfo/
>>               developer
>> 
>> --
>> 
>> richard.ell...@richardelling.com
>> +1-760-896-4422
>> 
>> 
>> 

--

richard.ell...@richardelling.com
+1-760-896-4422



_______________________________________________
developer mailing list
developer@open-zfs.org
http://lists.open-zfs.org/mailman/listinfo/developer

Reply via email to