I've been investigating interaction between zfs and uma for a while.
You might remember that there is a noticeable fragmentation in zfs uma zones
when uma use is not enabled for actual data/metadata buffers.

I also noticed that when uma use is enabled for data/metadata buffers
(zio.use_uma=1) amount of memory reserved in free items of zfs uma zones becomes
really huge.  And this is despite the fact that the vast majority of the
data/metadata zone have items with sizes that are multiples of page size.
This couldn't really be because of fragmentation.

Further checks show that the free items are accumulated in per-cpu cache
buckets.  uz_count for those buckets starts with 1, but over time, during bursts
of activity, it grows up to maximum of 128.
Problem with those buckets is that they are not drained on low memory conditions
and uz_count never goes down.

So, after a while, I observe about 300 free items (on a mere two core system)
cached in 4 per-cpu buckets for a single zone with 128KB item size.
That's 30MB right there.
For all data and metadata zones the number goes as high as 500MB on my machine
with 4GB physical RAM.
This seems like a bit too much to me.

Although keeping free items around improves performance, it does consume memory
too.  And the fact that that memory is not freed on lowmem condition makes the
situation worse.

So, I decided to take a look at how they handle this situation in (Open)Solaris.
There is this good book:
http://books.google.com/books?id=r_cecYD4AKkC&printsec=frontcover
Please see section 6.2.4.5 on page 225 and table 6-11 on page 226.
And also this code:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/os/kmem.c#971

It makes sense to me to limit size of per-cpu buckets depending on item size.
I even wrote a little bit hackish patch [attached].
But I didn't go far as they did in Solaris, so minimum bucket size limit is 4.
But perhaps it would make sense to not use the cache at all starting with
certain size.

Another attached hack removes zio zones that have items larger than page size,
but not multiple of page size.  Internally they would still consume multiple of
page size per item, so we potentially can have two zones that use the same
number of pages per zone, but with different item size. With the patch they are
collapsed into a single zone.

-- 
Andriy Gapon
diff --git a/sys/vm/uma_core.c b/sys/vm/uma_core.c
index 3fc5b8a..3b8384b 100644
--- a/sys/vm/uma_core.c
+++ b/sys/vm/uma_core.c
@@ -179,9 +179,12 @@ struct uma_bucket_zone {
        int             ubz_entries;
 };
 
-#define        BUCKET_MAX      128
+#define        BUCKET_SIZE_THRESHOLD   131072
+#define        BUCKET_MAX              128
 
 struct uma_bucket_zone bucket_zones[] = {
+       { NULL, "4 Bucket", 4 },
+       { NULL, "8 Bucket", 8 },
        { NULL, "16 Bucket", 16 },
        { NULL, "32 Bucket", 32 },
        { NULL, "64 Bucket", 64 },
@@ -189,7 +192,7 @@ struct uma_bucket_zone bucket_zones[] = {
        { NULL, NULL, 0}
 };
 
-#define        BUCKET_SHIFT    4
+#define        BUCKET_SHIFT    2
 #define        BUCKET_ZONES    ((BUCKET_MAX >> BUCKET_SHIFT) + 1)
 
 /*
@@ -1463,6 +1466,13 @@ zone_ctor(void *mem, int size, void *udata, int flags)
                zone->uz_count = keg->uk_ipers;
        else
                zone->uz_count = BUCKET_MAX;
+
+       zone->uz_count_max = BUCKET_SIZE_THRESHOLD / zone->uz_size;
+       if (zone->uz_count_max > BUCKET_MAX)
+               zone->uz_count_max = BUCKET_MAX;
+       else if (zone->uz_count_max < (1 << BUCKET_SHIFT))
+               zone->uz_count_max = 1 << BUCKET_SHIFT;
+
        return (0);
 }
 
@@ -2076,7 +2086,7 @@ zalloc_start:
        critical_exit();
 
        /* Bump up our uz_count so we get here less */
-       if (zone->uz_count < BUCKET_MAX)
+       if (zone->uz_count < zone->uz_count_max)
                zone->uz_count++;
 
        /*
diff --git a/sys/vm/uma_int.h b/sys/vm/uma_int.h
index 7713593..6d81e3d 100644
--- a/sys/vm/uma_int.h
+++ b/sys/vm/uma_int.h
@@ -330,6 +330,7 @@ struct uma_zone {
        u_int64_t       uz_sleeps;      /* Total number of alloc sleeps */
        uint16_t        uz_fills;       /* Outstanding bucket fills */
        uint16_t        uz_count;       /* Highest value ub_ptr can have */
+       uint16_t        uz_count_max;   /* Highest value uz_count can have */
 
        /*
         * This HAS to be the last item because we adjust the zone size
diff --git a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c 
b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
index 8ddf7cd..340f676 100644
--- a/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
+++ b/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
@@ -121,10 +121,11 @@ zio_init(void)
                        align = SPA_MINBLOCKSIZE;
                } else if (P2PHASE(size, PAGESIZE) == 0) {
                        align = PAGESIZE;
+#if 0
                } else if (P2PHASE(size, p2 >> 2) == 0) {
                        align = p2 >> 2;
+#endif
                }
-
                if (align != 0) {
                        char name[36];
                        (void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"

Reply via email to