Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 9/11/13 3:36 PM, Thavatchai Makphaibulchoke wrote: > I seem to be seeing the same thing as Eric is seeing. ... > For both filesystems, the security xattr are about 32.17 and 34.87 bytes > respectively. ... Can you triple-check the inode size on your fs, for good measure? dumpe2fs -h /dev/whatever | grep "Inode size" > I also see a similar problem with filefrag. turns out it's not a problem, it's an undocumented & surprising "feature." :( /* For inline data all offsets should be in bytes, not blocks */ if (fm_extent->fe_flags & FIEMAP_EXTENT_DATA_INLINE) blk_shift = 0; because ... ? (the commit which added it didn't mention anything about it). But I guess it does mean for at least those files, the xattr data is actually inline. > At this point, I'm not sure why we get into the mbcache path when > SELinux is enabled. As mentioned in one my earlier replies to > Andreas, I did see actual calls into ext4_xattr_cache. > > There seems to be one difference between 3.11 kernel and 2.6 kernel > in set_inode_init_security(). There is an additional attempt to > initialize evm xattr. But I do not seem to be seeing any evm xattr in > any file. > > I will continue to try to find out how we get into the mbcache path. > Please let me know if anyone has any suggestion. Sorry we got so far off the thread of the original patches. But it seems like a mystery worth solving. Perhaps in ext4_xattr_set_handle() you can instrument the case(s) where it gets into ext4_xattr_block_set(). Or most simply, just printk inode number in ext4_xattr_block_set() so you can look at them later via debugfs. And in here, } else { error = ext4_xattr_ibody_set(handle, inode, &i, &is); if (!error && !bs.s.not_found) { i.value = NULL; error = ext4_xattr_block_set(handle, inode, &i, &bs); } else if (error == -ENOSPC) { if (EXT4_I(inode)->i_file_acl && !bs.s.base) { error = ext4_xattr_block_find(inode, &i, &bs); if (error) goto cleanup; } error = ext4_xattr_block_set(handle, inode, &i, &bs); maybe print out in the ext4_xattr_ibody_set error case what the size of of the xattr is, and probably inode number again for good measure, to get an idea of what's causing it to fail to land in the inode? -Eric > > Thanks, > Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/11/2013 09:25 PM, Theodore Ts'o wrote: > On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote: >> >> So at this point I think it's up to Mak to figure out why on his system, >> aim7 is triggering mbcache codepaths. >> > > Yes, the next thing is to see if on his systems, whether or not he's > seeing external xattr blocks. > > - Ted > I seem to be seeing the same thing as Eric is seeing. On one of my systems, # find / -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c 2725655 # df -i / FilesystemInodes IUsed IFree IUse% Mounted on /dev/mapper/vg_dhg1-lv_root [tmac@lxbuild linux]$ man find 1974272 84737 18895355% / # find /home -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c 274173 # df -i /home FilesystemInodes IUsed IFree IUse% Mounted on /dev/mapper/vg_dhg1-lv_home 1923847862 1845225% /home For both filesystems, the security xattr are about 32.17 and 34.87 bytes respectively. I also see a similar problem with filefrag. # filefrag -xv /bin/sh Filesystem type is: ef53 File size of /bin/sh is 938736 (230 blocks, blocksize 4096) ext logical physical expected length flags 0 0 23622459548 100 not_aligned,inline /bin/sh: 1 extent found # getfattr -m - -d /bin/sh getfattr: Removing leading '/' from absolute path names # file: bin/sh security.selinux="system_u:object_r:shell_exec_t:s0" debugfs: stat /bin/sh Inode: 1441795 Type: symlinkMode: 0777 Flags: 0x0 Generation: 3470616846Version: 0x:0001 User: 0 Group: 0 Size: 4 File ACL: 0Directory ACL: 0 Links: 1 Blockcount: 0 Fragment: Address: 0Number: 0Size: 0 ctime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012 atime: 0x52311211:006d1658 -- Wed Sep 11 19:00:01 2013 mtime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012 crtime: 0x50c2779d:ad792a58 -- Fri Dec 7 16:11:25 2012 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "system_u:object_r:bin_t:s0\000" (27) Fast_link_dest: bash At this point, I'm not sure why we get into the mbcache path when SELinux is enabled. As mentioned in one my earlier replies to Andreas, I did see actual calls into ext4_xattr_cache. There seems to be one difference between 3.11 kernel and 2.6 kernel in set_inode_init_security(). There is an additional attempt to initialize evm xattr. But I do not seem to be seeing any evm xattr in any file. I will continue to try to find out how we get into the mbcache path. Please let me know if anyone has any suggestion. Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote: > > So at this point I think it's up to Mak to figure out why on his system, aim7 > is triggering mbcache codepaths. > Yes, the next thing is to see if on his systems, whether or not he's seeing external xattr blocks. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 9/11/13 3:32 PM, David Lang wrote: > On Wed, 11 Sep 2013, Eric Sandeen wrote: > >>> The reason why I'm pushing here is that mbcache shouldn't be showing >>> up in the profiles at all if there is no external xattr block. And so >>> if newer versions of SELinux (like Adnreas, I've been burned by >>> SELinux too many times in the past, so I don't use SELinux on any of >>> my systems) is somehow causing mbcache to get triggered, we should >>> figure this out and understand what's really going on. >> >> selinux, from an fs allocation behavior perspective, is simply setxattr. > > what you are missing is that Ted is saying that unless you are using xattrs, > the mbcache should not show up at all. > > The fact that you are using SElinux, and SELinux sets the xattrs is > what makes this show up on your system, but other people who don't > use SELinux (and so don't have any xattrs set) don't see the same > bottleneck. Sure, I understand that quite well. But Ted was also saying that perhaps selinux had "gotten piggy" and was causing this. I've showed that it hasn't. This matters because unless the selinux xattrs go out of the inode into their own block, mbcache should STILL not come into it at all. And for attrs < 100 bytes, they stay in the inode. And on inspection, my SELinux boxes have no external attr blocks allocated. mbcache only handles extended attributes that live in separately-allocated blocks, and selinux does not cause that on its own. Soo... selinux by itself should not be triggering any mbcache codepaths. Ted suggested that "selinux had gotten piggy" so I checked, and showed that it hadn't. That's all. So at this point I think it's up to Mak to figure out why on his system, aim7 is triggering mbcache codepaths. -Eric > David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On Wed, 11 Sep 2013, Eric Sandeen wrote: The reason why I'm pushing here is that mbcache shouldn't be showing up in the profiles at all if there is no external xattr block. And so if newer versions of SELinux (like Adnreas, I've been burned by SELinux too many times in the past, so I don't use SELinux on any of my systems) is somehow causing mbcache to get triggered, we should figure this out and understand what's really going on. selinux, from an fs allocation behavior perspective, is simply setxattr. what you are missing is that Ted is saying that unless you are using xattrs, the mbcache should not show up at all. The fact that you are using SElinux, and SELinux sets the xattrs is what makes this show up on your system, but other people who don't use SELinux (and so don't have any xattrs set) don't see the same bottleneck. David Lang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 9/11/13 11:49 AM, Eric Sandeen wrote: > On 9/11/13 6:30 AM, Theodore Ts'o wrote: >> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote: >>> >>> Above doesn't tell us the prevalence of various contexts on the actual >>> system, >>> but they are all under 100 bytes in any case. >> >> OK, so in other words, on your system i_file_acl and i_file_acl_high >> (which is where we store the block # for the external xattr block), >> should always be zero for all inodes, yes? > > Oh, hum - ok, so that would have been the better thing to check (or at > least an additional thing). > > # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq > -c > > Finds quite a lot that claim to have external blocks, but it seems broken: > > # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel > Filesystem type is: ef53 > File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, > blocksize 4096) > ext logical physical expected length flags >0 0 32212996252 100 not_aligned,inline > /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found > > So _filefrag_ says it has a block (at a 120T physical address not on my fs!) Oh, this is the special-but-not-documented "print inline extents in bytes not blocks" :( I'll send a patch to ignore inline extents on fiemap calls to make this easier, but in the meantime, neither my RHEL6 root nor my F17 root have any out-of-inode selinux xattrs on 256-byte-inode filesystems. So selinux alone should not be exercising mbcache much, if at all, w/ 256 byte inodes. -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 9/11/13 6:30 AM, Theodore Ts'o wrote: > On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote: >> >> Above doesn't tell us the prevalence of various contexts on the actual >> system, >> but they are all under 100 bytes in any case. > > OK, so in other words, on your system i_file_acl and i_file_acl_high > (which is where we store the block # for the external xattr block), > should always be zero for all inodes, yes? Oh, hum - ok, so that would have been the better thing to check (or at least an additional thing). # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c Finds quite a lot that claim to have external blocks, but it seems broken: # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel Filesystem type is: ef53 File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 4096) ext logical physical expected length flags 0 0 32212996252 100 not_aligned,inline /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found So _filefrag_ says it has a block (at a 120T physical address not on my fs!) And yet it's a small attr: # getfattr -m - -d /var/lib/yum/repos/x86_64/6Server/epel getfattr: Removing leading '/' from absolute path names # file: var/lib/yum/repos/x86_64/6Server/epel security.selinux="unconfined_u:object_r:rpm_var_lib_t:s0" And debugfs thinks it's stored within the inode: debugfs: stat var/lib/yum/repos/x86_64/6Server/epel Inode: 1968466 Type: directoryMode: 0755 Flags: 0x8 Generation: 2728788146Version: 0x:0001 User: 0 Group: 0 Size: 4096 File ACL: 0Directory ACL: 0 Links: 2 Blockcount: 8 Fragment: Address: 0Number: 0Size: 0 ctime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012 atime: 0x522fc8fa:62eb2d90 -- Tue Sep 10 20:35:54 2013 mtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012 crtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012 Size of extra inode fields: 28 Extended attributes stored in inode body: selinux = "unconfined_u:object_r:rpm_var_lib_t:s0\000" (39) EXTENTS: (0): 7873422 sooo seems like filefrag -x is buggy and can't be trusted. :( > Thavatchai, can you check to see whether or not this is true on your > system? You can use debugfs on the file system, and then use the > "stat" command to sample various inodes in your system. Or I can make > a version of e2fsck which counts the number of inodes with external > xattr blocks --- it sounds like this is something we should do anyway. > > One difference might be that Eric ran this test on RHEL 6, and > Thavatchai is using an upstream kernel, so maybe this is bloat has > been added recently? It's a userspace policy so the kernel shouldn't matter... "bloat" would only come from new, longer contexts (outside the kernel). > The reason why I'm pushing here is that mbcache shouldn't be showing > up in the profiles at all if there is no external xattr block. And so > if newer versions of SELinux (like Adnreas, I've been burned by > SELinux too many times in the past, so I don't use SELinux on any of > my systems) is somehow causing mbcache to get triggered, we should > figure this out and understand what's really going on. selinux, from an fs allocation behavior perspective, is simply setxattr. And as I showed earlier, name+value for all of the attrs set by at least RHEL6 selinux policy are well under 100 bytes. (Add in a bunch of other non-selinux xattrs, and you'll go out of a 256b inode, sure, but selinux on its own should not). > Sigh, I suppose I should figure out how to create a minimal KVM setup > which uses SELinux just so I can see what the heck is going on http://fedoraproject.org/en/get-fedora ;) -Eric >- Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote: > > Above doesn't tell us the prevalence of various contexts on the actual system, > but they are all under 100 bytes in any case. OK, so in other words, on your system i_file_acl and i_file_acl_high (which is where we store the block # for the external xattr block), should always be zero for all inodes, yes? Thavatchai, can you check to see whether or not this is true on your system? You can use debugfs on the file system, and then use the "stat" command to sample various inodes in your system. Or I can make a version of e2fsck which counts the number of inodes with external xattr blocks --- it sounds like this is something we should do anyway. One difference might be that Eric ran this test on RHEL 6, and Thavatchai is using an upstream kernel, so maybe this is bloat has been added recently? The reason why I'm pushing here is that mbcache shouldn't be showing up in the profiles at all if there is no external xattr block. And so if newer versions of SELinux (like Adnreas, I've been burned by SELinux too many times in the past, so I don't use SELinux on any of my systems) is somehow causing mbcache to get triggered, we should figure this out and understand what's really going on. Sigh, I suppose I should figure out how to create a minimal KVM setup which uses SELinux just so I can see what the heck is going on - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 9/10/13 4:02 PM, Theodore Ts'o wrote: > On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote: >> I agree that SELinux is enabled on enterprise distributions by default, >> but I'm also interested to know how much overhead this imposes. I would >> expect that writing large external xattrs for each file would have quite >> a significant performance overhead that should not be ignored. Reducing >> the mbcache overhead is good, but eliminating it entirely is better. > > I was under the impression that using a 256 byte inode (which gives a > bit over 100 bytes worth of xattr space) was plenty for SELinux. If > it turns out that SELinux's use of xattrs have gotten especially > piggy, then we may need to revisit the recommended inode size for > those systems who insist on using SELinux... even if we eliminate the > overhead associated with mbcache, the fact that files are requiring a > separate xattr is going to seriously degrade performance. On my RHEL6 system, # find / -xdev -exec getfattr --only-values -m security.* {} 2>/dev/null \; | wc -c 11082179 bytes of names for: # df -i / FilesystemInodes IUsed IFree IUse% Mounted on /dev/mapper/vg_bp05-lv_root 3276800 280785 29960159% / 280785 inodes used, so: 11082179/280785 = ~39.5 bytes per value on average, plus: # echo -n "security.selinux" | wc -c 16 16 bytes for the name is only about 55-56 bytes per selinux attr on average. So nope, not "especially piggy" on average. Another way to do it is this; list all possible file contexts, and make a histogram of sizes: # for CONTEXT in `semanage fcontext -l | awk '{print $NF}' `; do echo -n $CONTEXT | wc -c; done | sort -n | uniq -c 1 7 33 8 356 26 14 27 14 28 37 29 75 30 237 31 295 32 425 33 324 34 445 35 548 36 229 37 193 38 181 39 259 40 81 41 108 42 96 43 55 44 55 45 16 46 41 47 23 48 28 49 36 50 10 51 10 52 5 54 2 57 so a 57 byte value is max, but there aren't many of the larger values. Above doesn't tell us the prevalence of various contexts on the actual system, but they are all under 100 bytes in any case. -Eric >- Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/10/2013 09:02 PM, Theodore Ts'o wrote: > On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote: >> I agree that SELinux is enabled on enterprise distributions by default, >> but I'm also interested to know how much overhead this imposes. I would >> expect that writing large external xattrs for each file would have quite >> a significant performance overhead that should not be ignored. Reducing >> the mbcache overhead is good, but eliminating it entirely is better. > > I was under the impression that using a 256 byte inode (which gives a > bit over 100 bytes worth of xattr space) was plenty for SELinux. If > it turns out that SELinux's use of xattrs have gotten especially > piggy, then we may need to revisit the recommended inode size for > those systems who insist on using SELinux... even if we eliminate the > overhead associated with mbcache, the fact that files are requiring a > separate xattr is going to seriously degrade performance. > >- Ted > Thank you Andreas and Ted for the explanations and comments. Yes, I see both of your points now. Though we may reduce the mbcache overhead, due to the overhead of additional xattr I/O it would be better to provide some data to help users or distros to determine whether they will be better off completely disabling SELinux or increasing the inode size. I will go ahead and run the suggested experiments and get back with the results. Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 2013-09-06, at 6:23 AM, Thavatchai Makphaibulchoke wrote: > On 09/06/2013 05:10 AM, Andreas Dilger wrote: >> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote: >>> No, I did not do anything special, including changing an inode's size. I >>> just used the profile data, which indicated mb_cache module as one of the >>> bottleneck. Please see below for perf data from one of th new_fserver run, >>> which also shows some mb_cache activities. >>> >>> >>> |--3.51%-- __mb_cache_entry_find >>> | mb_cache_entry_find_first >>> | ext4_xattr_cache_find >>> | ext4_xattr_block_set >>> | ext4_xattr_set_handle >>> | ext4_initxattrs >>> | security_inode_init_security >>> | ext4_init_security >> >> Looks like this is some large security xattr, or enough smaller >> xattrs to exceed the ~120 bytes of in-inode xattr storage. How >> big is the SELinux xattr (assuming that is what it is)? >> >> You could try a few different things here: >> - disable selinux completely (boot with "selinux=0" on the kernel >> command line) and see how much faster it is > Sorry I'm not? > familiar with SELinux enough to say how big its xattr is. Anyway, I'm > positive that SELinux is what is generating these xattrs. With SELinux > disabled, there seems to be no call ext4_xattr_cache_find(). What is the relative performance of your benchmark with SELinux disabled? While the oprofile graphs will be of passing interest to see that the mbcache overhead is gone, they will not show the reduction in disk IO from not writing/updating the external xattr blocks at all. >> - format your ext4 filesystem with larger inodes (-I 512) and see >> if this is an improvement or not. That depends on the size of >> the selinux xattrs and if they will fit into the extra 256 bytes >> of xattr space these larger inodes will give you. The performance >> might also be worse, since there will be more data to read/write >> for each inode, but it would avoid seeking to the xattr blocks. > > Thanks for the above suggestions. Could you please clarify if we are > attempting to look for a workaround here? Since we agree the way > mb_cache uses one global spinlock is incorrect and SELinux exposes > the problem (which is not uncommon with Enterprise installations), > I believe we should look at fixing it (patch 1/2). As you also > mentioned, this will also impact both ext2 and ext3 filesystems. I agree that SELinux is enabled on enterprise distributions by default, but I'm also interested to know how much overhead this imposes. I would expect that writing large external xattrs for each file would have quite a significant performance overhead that should not be ignored. Reducing the mbcache overhead is good, but eliminating it entirely is better. Depending on how much overhead SELinux has, it might be important to spend more time to optimize it (not just the mbcache part), or users may consider disabling SELinux entirely on systems where they care about peak performance. > Anyway, please let me know if you still think any of the above > experiments is relevant. You have already done one of the tests that I'm interested in (the above test which showed that disabling SELinux removed the mbcache overhead). What I'm interested in is the actual performance (or relative performance if you are not allowed to publish the actual numbers) of your AIM7 benchmark between SELinux enabled and SELinux disabled. Next would be a new test that has SELinux enabled, but formatting the filesystem with 512-byte inodes instead of the ext4 default of 256-byte inodes. If this makes a significant improvement, it would potentially mean users and the upstream distros should use different formatting options along with SELinux. This is less clearly a win, since I don't know enough details of how SELinux uses xattrs (I always disable it, so I don't have any systems to check). Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote: > I agree that SELinux is enabled on enterprise distributions by default, > but I'm also interested to know how much overhead this imposes. I would > expect that writing large external xattrs for each file would have quite > a significant performance overhead that should not be ignored. Reducing > the mbcache overhead is good, but eliminating it entirely is better. I was under the impression that using a 256 byte inode (which gives a bit over 100 bytes worth of xattr space) was plenty for SELinux. If it turns out that SELinux's use of xattrs have gotten especially piggy, then we may need to revisit the recommended inode size for those systems who insist on using SELinux... even if we eliminate the overhead associated with mbcache, the fact that files are requiring a separate xattr is going to seriously degrade performance. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/06/2013 05:10 AM, Andreas Dilger wrote: > On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote: >> No, I did not do anything special, including changing an inode's size. I >> just used the profile data, which indicated mb_cache module as one of the >> bottleneck. Please see below for perf data from one of th new_fserver run, >> which also shows some mb_cache activities. >> >> >>|--3.51%-- __mb_cache_entry_find >>| mb_cache_entry_find_first >>| ext4_xattr_cache_find >>| ext4_xattr_block_set >>| ext4_xattr_set_handle >>| ext4_initxattrs >>| security_inode_init_security >>| ext4_init_security > > Looks like this is some large security xattr, or enough smaller > xattrs to exceed the ~120 bytes of in-inode xattr storage. How > big is the SELinux xattr (assuming that is what it is)? > Sorry I'm familiar with SELinux enough to say how big its xattr is. Anyway, I'm positive that SELinux is what is generating these xattrs. With SELinux disabled, there seems to be no call ext4_xattr_cache_find(). >> Looks like it's a bit harder to disable mbcache than I thought. >> I ended up adding code to collect the statics. >> >> With selinux enabled, for new_fserver workload of aim7, there >> are a total of 0x7e054201 ext4_xattr_cache_find() calls >> that result in a hit and 0xc100 calls that are not. >> The number does not seem to favor the complete disabling of >> mbcache in this case. > > This is about a 65% hit rate, which seems reasonable. > > You could try a few different things here: > - disable selinux completely (boot with "selinux=0" on the kernel > command line) and see how much faster it is > - format your ext4 filesystem with larger inodes (-I 512) and see > if this is an improvement or not. That depends on the size of > the selinux xattrs and if they will fit into the extra 256 bytes > of xattr space these larger inodes will give you. The performance > might also be worse, since there will be more data to read/write > for each inode, but it would avoid seeking to the xattr blocks. > Thanks for the above suggestions. Could you please clarify if we are attempting to look for a workaround here? Since we agree the way mb_cache uses one global spinlock is incorrect and SELinux exposes the problem (which is not uncommon with Enterprise installations), I believe we should look at fixing it (patch 1/2). As you also mentioned, this will also impact both ext2 and ext3 filesystems. Anyway, please let me know if you still think any of the above experiments is relevant. Thanks, Mak. > Cheers, Andreas > > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote: > On 09/05/2013 02:35 AM, Theodore Ts'o wrote: >> How did you gather these results? The mbcache is only used if you >> are using extended attributes, and only if the extended attributes don't fit >> in the inode's extra space. >> >> I checked aim7, and it doesn't do any extended attribute operations. >> So why are you seeing differences? Are you doing something like >> deliberately using 128 byte inodes (which is not the default inode >> size), and then enabling SELinux, or some such? > > No, I did not do anything special, including changing an inode's size. I just > used the profile data, which indicated mb_cache module as one of the > bottleneck. Please see below for perf data from one of th new_fserver run, > which also shows some mb_cache activities. > > >|--3.51%-- __mb_cache_entry_find >| mb_cache_entry_find_first >| ext4_xattr_cache_find >| ext4_xattr_block_set >| ext4_xattr_set_handle >| ext4_initxattrs >| security_inode_init_security >| ext4_init_security Looks like this is some large security xattr, or enough smaller xattrs to exceed the ~120 bytes of in-inode xattr storage. How big is the SELinux xattr (assuming that is what it is)? > Looks like it's a bit harder to disable mbcache than I thought. > I ended up adding code to collect the statics. > > With selinux enabled, for new_fserver workload of aim7, there > are a total of 0x7e054201 ext4_xattr_cache_find() calls > that result in a hit and 0xc100 calls that are not. > The number does not seem to favor the complete disabling of > mbcache in this case. This is about a 65% hit rate, which seems reasonable. You could try a few different things here: - disable selinux completely (boot with "selinux=0" on the kernel command line) and see how much faster it is - format your ext4 filesystem with larger inodes (-I 512) and see if this is an improvement or not. That depends on the size of the selinux xattrs and if they will fit into the extra 256 bytes of xattr space these larger inodes will give you. The performance might also be worse, since there will be more data to read/write for each inode, but it would avoid seeking to the xattr blocks. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/04/2013 08:00 PM, Andreas Dilger wrote: > > In the past, I've raised the question of whether mbcache is even > useful on real-world systems. Essentially, this is providing a > "deduplication" service for ext2/3/4 xattr blocks that are identical. > The question is how often this is actually the case in modern use? > The original design was for allowing external ACL blocks to be > shared between inodes, at a time when ACLs where pretty much the > only xattrs stored on inodes. > > The question now is whether there are common uses where all of the > xattrs stored on multiple inodes are identical? If that is not the > case, mbcache is just adding overhead and should just be disabled > entirely instead of just adding less overhead. > > There aren't good statistics on the hit rate for mbcache, but it > might be possible to generate some with systemtap or similar to > see how often ext4_xattr_cache_find() returns NULL vs. non-NULL. > > Cheers, Andreas > Looks like it's a bit harder to disable mbcache than I thought. I ended up adding code to collect the statics. With selinux enabled, for new_fserver workload of aim7, there are a total of 0x7e054201 ext4_xattr_cache_find() calls that result in a hit and 0xc100 calls that are not. The number does not seem to favor the complete disabling of mbcache in this case. Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/05/2013 02:35 AM, Theodore Ts'o wrote: > How did you gather these results? The mbcache is only used if you are > using extended attributes, and only if the extended attributes don't > fit in the inode's extra space. > > I checked aim7, and it doesn't do any extended attribute operations. > So why are you seeing differences? Are you doing something like > deliberately using 128 byte inodes (which is not the default inode > size), and then enabling SELinux, or some such? > > - Ted > No, I did not do anything special, including changing an inode's size. I just used the profile data, which indicated mb_cache module as one of the bottleneck. Please see below for perf data from one of th new_fserver run, which also shows some mb_cache activities. |--3.51%-- __mb_cache_entry_find | mb_cache_entry_find_first | ext4_xattr_cache_find | ext4_xattr_block_set | ext4_xattr_set_handle | ext4_initxattrs | security_inode_init_security | ext4_init_security | __ext4_new_inode | ext4_create | vfs_create | lookup_open | do_last | path_openat | do_filp_open | do_sys_open | SyS_open | sys_creat | system_call | __creat_nocancel | | | |--16.67%-- 0x11fe2c0 | | Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On Wed, Sep 04, 2013 at 10:39:14AM -0600, T Makphaibulchoke wrote: > > Here are the performance improvements in some of the aim7 workloads, How did you gather these results? The mbcache is only used if you are using extended attributes, and only if the extended attributes don't fit in the inode's extra space. I checked aim7, and it doesn't do any extended attribute operations. So why are you seeing differences? Are you doing something like deliberately using 128 byte inodes (which is not the default inode size), and then enabling SELinux, or some such? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 09/04/2013 08:00 PM, Andreas Dilger wrote: > > In the past, I've raised the question of whether mbcache is even > useful on real-world systems. Essentially, this is providing a > "deduplication" service for ext2/3/4 xattr blocks that are identical. > The question is how often this is actually the case in modern use? > The original design was for allowing external ACL blocks to be > shared between inodes, at a time when ACLs where pretty much the > only xattrs stored on inodes. > > The question now is whether there are common uses where all of the > xattrs stored on multiple inodes are identical? If that is not the > case, mbcache is just adding overhead and should just be disabled > entirely instead of just adding less overhead. > > There aren't good statistics on the hit rate for mbcache, but it > might be possible to generate some with systemtap or similar to > see how often ext4_xattr_cache_find() returns NULL vs. non-NULL. > > Cheers, Andreas > Thanks Andreas for the comments. Since I'm not familiar with systemtap, I'm thinking probably the quickest and simplest way is to re-run aim7 and swing bench with mbcache disabled for comparison. Please let me know if you have any other benchmark suggestion or if you think systemtap on ext4_xattr_cache_find() would give a more accurate measurement. Thanks, Mak. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 0/2] ext4: increase mbcache scalability
On 2013-09-04, at 10:39 AM, T Makphaibulchoke wrote: > This patch intends to improve the scalability of an ext filesystem, > particularly ext4. In the past, I've raised the question of whether mbcache is even useful on real-world systems. Essentially, this is providing a "deduplication" service for ext2/3/4 xattr blocks that are identical. The question is how often this is actually the case in modern use? The original design was for allowing external ACL blocks to be shared between inodes, at a time when ACLs where pretty much the only xattrs stored on inodes. The question now is whether there are common uses where all of the xattrs stored on multiple inodes are identical? If that is not the case, mbcache is just adding overhead and should just be disabled entirely instead of just adding less overhead. There aren't good statistics on the hit rate for mbcache, but it might be possible to generate some with systemtap or similar to see how often ext4_xattr_cache_find() returns NULL vs. non-NULL. Cheers, Andreas > The patch consists of two parts. The first part introduces higher > degree of parallelism to the usages of the mb_cache and > mb_cache_entries and impacts all ext filesystems. > > The second part of the patch further increases the scalablity of > an ext4 filesystem by having each ext4 fielsystem allocate and use > its own private mbcache structure, instead of sharing a single > mcache structures across all ext4 filesystems > > Here are some of the benchmark results with the changes. > > On a 90 core machine: > > Here are the performance improvements in some of the aim7 workloads, > > --- > | | % increase | > --- > | alltests| 11.85 | > --- > | custom | 14.42 | > --- > | fserver | 21.36 | > --- > | new_dbase | 5.59 | > --- > | new_fserver | 21.45 | > --- > | shared | 12.84 | > --- > For Swingbench dss workload, with 16 GB database, > > --- > | Users| 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 > | > --- > | % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 > | > --- > | % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 > | > | without using| | | | | | | | | > | > | shared memory| | | | | | | | | > | > --- > For SPECjbb2013, composite run, > > > | | max-jOPS | critical-jOPS | > > | % improvement | 5.99 | N/A | > > > > On an 80 core machine: > > The aim7's results for most of the workloads turn out to the same. > > Here are the results of Swingbench dss workload, > > --- > | Users| 100 | 200 | 300 | 400 | 500 | 600 | 700 | 800 | 900 > | > --- > | % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 > | > --- > > The changes have been tested with ext4 xfstests to verify that no regression > has been introduced. > > Changed in v3: > - New diff summary > > Changed in v2: > - New performance data > - New diff summary > > T Makphaibulchoke (2): > mbcache: decoupling the locking of local from global data > ext4: each filesystem creates and uses its own mb_cache > > fs/ext4/ext4.h | 1 + > fs/ext4/super.c | 24 ++-- > fs/ext4/xattr.c | 51 > fs/ext4/xattr.h | 6 +- > fs/mbcache.c| 306 +++- > include/linux/mbcache.h | 10 +- > 6 files changed, 277 insertions(+), 121 deletions(-) > > -- > 1.7.11.3 > Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/