Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen
On 9/11/13 3:36 PM, Thavatchai Makphaibulchoke wrote:

> I seem to be seeing the same thing as Eric is seeing.

...
> For both filesystems, the security xattr are about 32.17 and 34.87 bytes 
> respectively.
...

Can you triple-check the inode size on your fs, for good measure?

dumpe2fs -h /dev/whatever | grep "Inode size"

> I also see a similar problem with filefrag.

turns out it's not a problem, it's an undocumented & surprising "feature."  :(

/* For inline data all offsets should be in bytes, not blocks */
if (fm_extent->fe_flags & FIEMAP_EXTENT_DATA_INLINE)
blk_shift = 0;

because ... ?  (the commit which added it didn't mention anything about it).

But I guess it does mean for at least those files, the xattr data is actually 
inline.

> At this point, I'm not sure why we get into the mbcache path when
> SELinux is enabled. As mentioned in one my earlier replies to
> Andreas, I did see actual calls into ext4_xattr_cache.
> 
> There seems to be one difference between 3.11 kernel and 2.6 kernel
> in set_inode_init_security(). There is an additional attempt to
> initialize evm xattr. But I do not seem to be seeing any evm xattr in
> any file.
> 
> I will continue to try to find out how we get into the mbcache path.
> Please let me know if anyone has any suggestion.

Sorry we got so far off the thread of the original patches.

But it seems like a mystery worth solving.

Perhaps in ext4_xattr_set_handle() you can instrument the case(s) where it
gets into ext4_xattr_block_set().

Or most simply, just printk inode number in ext4_xattr_block_set() so
you can look at them later via debugfs.

And in here,

} else {
error = ext4_xattr_ibody_set(handle, inode, &i, &is);
if (!error && !bs.s.not_found) {
i.value = NULL;
error = ext4_xattr_block_set(handle, inode, &i, &bs);
} else if (error == -ENOSPC) {
if (EXT4_I(inode)->i_file_acl && !bs.s.base) {
error = ext4_xattr_block_find(inode, &i, &bs);
if (error)
goto cleanup;
}
error = ext4_xattr_block_set(handle, inode, &i, &bs);

maybe print out in the ext4_xattr_ibody_set error case what the size of
of the xattr is, and probably inode number again for good measure, to get
an idea of what's causing it to fail to land in the inode?

-Eric

> 
> Thanks,
> Mak.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Thavatchai Makphaibulchoke
On 09/11/2013 09:25 PM, Theodore Ts'o wrote:
> On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
>>
>> So at this point I think it's up to Mak to figure out why on his system, 
>> aim7 is triggering mbcache codepaths.
>>
> 
> Yes, the next thing is to see if on his systems, whether or not he's
> seeing external xattr blocks.
> 
>   - Ted
> 

I seem to be seeing the same thing as Eric is seeing.

On one of my systems,

# find / -mount -exec getfattr --only-values -m security.* {} 2>/dev/null \; | 
wc -c
2725655
# df -i /
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_root
[tmac@lxbuild linux]$ man find
 1974272   84737 18895355% /
# find /home -mount -exec getfattr --only-values -m security.* {} 2>/dev/null 
\; | wc -c
274173
# df -i /home
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_dhg1-lv_home
  1923847862  1845225% /home

For both filesystems, the security xattr are about 32.17 and 34.87 bytes 
respectively.

I also see a similar problem with filefrag.

# filefrag -xv /bin/sh
Filesystem type is: ef53
File size of /bin/sh is 938736 (230 blocks, blocksize 4096)
 ext logical physical expected length flags
   0   0 23622459548 100 not_aligned,inline
/bin/sh: 1 extent found
 
# getfattr -m - -d /bin/sh
getfattr: Removing leading '/' from absolute path names
# file: bin/sh
security.selinux="system_u:object_r:shell_exec_t:s0"

debugfs:  stat /bin/sh
Inode: 1441795   Type: symlinkMode:  0777   Flags: 0x0
Generation: 3470616846Version: 0x:0001
User: 0   Group: 0   Size: 4
File ACL: 0Directory ACL: 0
Links: 1   Blockcount: 0
Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
 atime: 0x52311211:006d1658 -- Wed Sep 11 19:00:01 2013
 mtime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
crtime: 0x50c2779d:ad792a58 -- Fri Dec  7 16:11:25 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "system_u:object_r:bin_t:s0\000" (27)
Fast_link_dest: bash

At this point, I'm not sure why we get into the mbcache path when SELinux is 
enabled.  As mentioned in one my earlier replies to Andreas, I did see actual 
calls into ext4_xattr_cache.

There seems to be one difference between 3.11 kernel and 2.6 kernel in 
set_inode_init_security(). There is an additional attempt to initialize evm 
xattr.  But I do not seem to be seeing any evm xattr in any file.

I will continue to try to find out how we get into the mbcache path.  Please 
let me know if anyone has any suggestion.

Thanks,
Mak.



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Theodore Ts'o
On Wed, Sep 11, 2013 at 03:48:57PM -0500, Eric Sandeen wrote:
> 
> So at this point I think it's up to Mak to figure out why on his system, aim7 
> is triggering mbcache codepaths.
> 

Yes, the next thing is to see if on his systems, whether or not he's
seeing external xattr blocks.

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen
On 9/11/13 3:32 PM, David Lang wrote:
> On Wed, 11 Sep 2013, Eric Sandeen wrote:
> 
>>> The reason why I'm pushing here is that mbcache shouldn't be showing
>>> up in the profiles at all if there is no external xattr block.  And so
>>> if newer versions of SELinux (like Adnreas, I've been burned by
>>> SELinux too many times in the past, so I don't use SELinux on any of
>>> my systems) is somehow causing mbcache to get triggered, we should
>>> figure this out and understand what's really going on.
>>
>> selinux, from an fs allocation behavior perspective, is simply setxattr.
> 
> what you are missing is that Ted is saying that unless you are using xattrs, 
> the mbcache should not show up at all.
> 
> The fact that you are using SElinux, and SELinux sets the xattrs is
> what makes this show up on your system, but other people who don't
> use SELinux (and so don't have any xattrs set) don't see the same
> bottleneck.

Sure, I understand that quite well.  But Ted was also saying that perhaps 
selinux had "gotten piggy" and was causing this.  I've showed that it hasn't.


This matters because unless the selinux xattrs go out of the inode into their 
own block, mbcache should STILL not come into it at all.

And for attrs < 100 bytes, they stay in the inode.  And on inspection, my 
SELinux boxes have no external attr blocks allocated.

mbcache only handles extended attributes that live in separately-allocated 
blocks, and selinux does not cause that on its own.

Soo... selinux by itself should not be triggering any mbcache codepaths.

Ted suggested that "selinux had gotten piggy" so I checked, and showed that it 
hadn't.  That's all.

So at this point I think it's up to Mak to figure out why on his system, aim7 
is triggering mbcache codepaths.

-Eric

> David Lang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread David Lang

On Wed, 11 Sep 2013, Eric Sandeen wrote:


The reason why I'm pushing here is that mbcache shouldn't be showing
up in the profiles at all if there is no external xattr block.  And so
if newer versions of SELinux (like Adnreas, I've been burned by
SELinux too many times in the past, so I don't use SELinux on any of
my systems) is somehow causing mbcache to get triggered, we should
figure this out and understand what's really going on.


selinux, from an fs allocation behavior perspective, is simply setxattr.


what you are missing is that Ted is saying that unless you are using xattrs, the 
mbcache should not show up at all.


The fact that you are using SElinux, and SELinux sets the xattrs is what makes 
this show up on your system, but other people who don't use SELinux (and so 
don't have any xattrs set) don't see the same bottleneck.


David Lang
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen
On 9/11/13 11:49 AM, Eric Sandeen wrote:
> On 9/11/13 6:30 AM, Theodore Ts'o wrote:
>> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>>
>>> Above doesn't tell us the prevalence of various contexts on the actual 
>>> system,
>>> but they are all under 100 bytes in any case.
>>
>> OK, so in other words, on your system i_file_acl and i_file_acl_high
>> (which is where we store the block # for the external xattr block),
>> should always be zero for all inodes, yes?
> 
> Oh, hum - ok, so that would have been the better thing to check (or at
> least an additional thing).
> 
> # find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq 
> -c
> 
> Finds quite a lot that claim to have external blocks, but it seems broken:
> 
> # filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
> Filesystem type is: ef53
> File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, 
> blocksize 4096)
>  ext logical physical expected length flags
>0   0 32212996252 100 not_aligned,inline
> /var/lib/yum/repos/x86_64/6Server/epel: 1 extent found
> 
> So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

Oh, this is the special-but-not-documented "print inline extents in bytes
not blocks"  :(

I'll send a patch to ignore inline extents on fiemap calls to make this
easier, but in the meantime, neither my RHEL6 root nor my F17 root have
any out-of-inode selinux xattrs on 256-byte-inode filesystems.

So selinux alone should not be exercising mbcache much, if at all, w/ 256 byte
inodes.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Eric Sandeen
On 9/11/13 6:30 AM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
>>
>> Above doesn't tell us the prevalence of various contexts on the actual 
>> system,
>> but they are all under 100 bytes in any case.
> 
> OK, so in other words, on your system i_file_acl and i_file_acl_high
> (which is where we store the block # for the external xattr block),
> should always be zero for all inodes, yes?

Oh, hum - ok, so that would have been the better thing to check (or at
least an additional thing).

# find / -xdev -exec filefrag -x {} \; | awk -F : '{print $NF}' | sort | uniq -c

Finds quite a lot that claim to have external blocks, but it seems broken:

# filefrag -xv /var/lib/yum/repos/x86_64/6Server/epel
Filesystem type is: ef53
File size of /var/lib/yum/repos/x86_64/6Server/epel is 4096 (1 block, blocksize 
4096)
 ext logical physical expected length flags
   0   0 32212996252 100 not_aligned,inline
/var/lib/yum/repos/x86_64/6Server/epel: 1 extent found

So _filefrag_ says it has a block (at a 120T physical address not on my fs!)

And yet it's a small attr:

# getfattr -m - -d /var/lib/yum/repos/x86_64/6Server/epel
getfattr: Removing leading '/' from absolute path names
# file: var/lib/yum/repos/x86_64/6Server/epel
security.selinux="unconfined_u:object_r:rpm_var_lib_t:s0"

And debugfs thinks it's stored within the inode:

debugfs:  stat var/lib/yum/repos/x86_64/6Server/epel
Inode: 1968466   Type: directoryMode:  0755   Flags: 0x8
Generation: 2728788146Version: 0x:0001
User: 0   Group: 0   Size: 4096
File ACL: 0Directory ACL: 0
Links: 2   Blockcount: 8
Fragment:  Address: 0Number: 0Size: 0
 ctime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
 atime: 0x522fc8fa:62eb2d90 -- Tue Sep 10 20:35:54 2013
 mtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
crtime: 0x50b4d808:cb7dd9a8 -- Tue Nov 27 09:11:04 2012
Size of extra inode fields: 28
Extended attributes stored in inode body: 
  selinux = "unconfined_u:object_r:rpm_var_lib_t:s0\000" (39)
EXTENTS:
(0): 7873422

sooo seems like filefrag -x is buggy and can't be trusted.  :(

> Thavatchai, can you check to see whether or not this is true on your
> system?  You can use debugfs on the file system, and then use the
> "stat" command to sample various inodes in your system.  Or I can make
> a version of e2fsck which counts the number of inodes with external
> xattr blocks --- it sounds like this is something we should do anyway.
> 
> One difference might be that Eric ran this test on RHEL 6, and
> Thavatchai is using an upstream kernel, so maybe this is bloat has
> been added recently?

It's a userspace policy so the kernel shouldn't matter... "bloat" would
only come from new, longer contexts (outside the kernel).

> The reason why I'm pushing here is that mbcache shouldn't be showing
> up in the profiles at all if there is no external xattr block.  And so
> if newer versions of SELinux (like Adnreas, I've been burned by
> SELinux too many times in the past, so I don't use SELinux on any of
> my systems) is somehow causing mbcache to get triggered, we should
> figure this out and understand what's really going on.

selinux, from an fs allocation behavior perspective, is simply setxattr.

And as I showed earlier, name+value for all of the attrs set by at least RHEL6
selinux policy are well under 100 bytes.

(Add in a bunch of other non-selinux xattrs, and you'll go out of a 256b inode,
sure, but selinux on its own should not).

> Sigh, I suppose I should figure out how to create a minimal KVM setup
> which uses SELinux just so I can see what the heck is going on

http://fedoraproject.org/en/get-fedora ;)

-Eric

>- Ted
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-11 Thread Theodore Ts'o
On Tue, Sep 10, 2013 at 10:13:16PM -0500, Eric Sandeen wrote:
> 
> Above doesn't tell us the prevalence of various contexts on the actual system,
> but they are all under 100 bytes in any case.

OK, so in other words, on your system i_file_acl and i_file_acl_high
(which is where we store the block # for the external xattr block),
should always be zero for all inodes, yes?

Thavatchai, can you check to see whether or not this is true on your
system?  You can use debugfs on the file system, and then use the
"stat" command to sample various inodes in your system.  Or I can make
a version of e2fsck which counts the number of inodes with external
xattr blocks --- it sounds like this is something we should do anyway.

One difference might be that Eric ran this test on RHEL 6, and
Thavatchai is using an upstream kernel, so maybe this is bloat has
been added recently?

The reason why I'm pushing here is that mbcache shouldn't be showing
up in the profiles at all if there is no external xattr block.  And so
if newer versions of SELinux (like Adnreas, I've been burned by
SELinux too many times in the past, so I don't use SELinux on any of
my systems) is somehow causing mbcache to get triggered, we should
figure this out and understand what's really going on.

Sigh, I suppose I should figure out how to create a minimal KVM setup
which uses SELinux just so I can see what the heck is going on

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-10 Thread Eric Sandeen
On 9/10/13 4:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes.  I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored.  Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
> 
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux...  even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.

On my RHEL6 system,

# find / -xdev -exec getfattr --only-values -m security.* {} 2>/dev/null \; | 
wc -c
11082179

bytes of names for:

# df -i /
FilesystemInodes   IUsed   IFree IUse% Mounted on
/dev/mapper/vg_bp05-lv_root
 3276800  280785 29960159% /

280785 inodes used,

so:
11082179/280785 = ~39.5 bytes per value on average, plus:

# echo -n "security.selinux" | wc -c
16

16 bytes for the name is only about 55-56 bytes per selinux attr on average.

So nope, not "especially piggy" on average.

Another way to do it is this; list all possible file contexts, and make
a histogram of sizes:

# for CONTEXT in `semanage fcontext -l | awk '{print $NF}' `; do echo -n 
$CONTEXT | wc -c; done | sort -n | uniq -c
  1 7
 33 8
356 26
 14 27
 14 28
 37 29
 75 30
237 31
295 32
425 33
324 34
445 35
548 36
229 37
193 38
181 39
259 40
 81 41
108 42
 96 43
 55 44
 55 45
 16 46
 41 47
 23 48
 28 49
 36 50
 10 51
 10 52
  5 54
  2 57

so a 57 byte value is max, but there aren't many of the larger values.

Above doesn't tell us the prevalence of various contexts on the actual system,
but they are all under 100 bytes in any case.

-Eric

>- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-10 Thread Thavatchai Makphaibulchoke
On 09/10/2013 09:02 PM, Theodore Ts'o wrote:
> On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
>> I agree that SELinux is enabled on enterprise distributions by default,
>> but I'm also interested to know how much overhead this imposes.  I would
>> expect that writing large external xattrs for each file would have quite
>> a significant performance overhead that should not be ignored.  Reducing
>> the mbcache overhead is good, but eliminating it entirely is better.
> 
> I was under the impression that using a 256 byte inode (which gives a
> bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
> it turns out that SELinux's use of xattrs have gotten especially
> piggy, then we may need to revisit the recommended inode size for
> those systems who insist on using SELinux...  even if we eliminate the
> overhead associated with mbcache, the fact that files are requiring a
> separate xattr is going to seriously degrade performance.
> 
>- Ted
> 

Thank you Andreas and Ted for the explanations and comments.  Yes, I see both 
of your points now.  Though we may reduce the mbcache overhead, due to the 
overhead of additional xattr I/O it would be better to provide some data to 
help users or distros to determine whether they will be better off completely 
disabling SELinux or increasing the inode size.  I will go ahead and run the 
suggested experiments and get back with the results.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-10 Thread Andreas Dilger
On 2013-09-06, at 6:23 AM, Thavatchai Makphaibulchoke wrote:
> On 09/06/2013 05:10 AM, Andreas Dilger wrote:
>> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>>> No, I did not do anything special, including changing an inode's size. I 
>>> just used the profile data, which indicated mb_cache module as one of the 
>>> bottleneck.  Please see below for perf data from one of th new_fserver run, 
>>> which also shows some mb_cache activities.
>>> 
>>> 
>>>   |--3.51%-- __mb_cache_entry_find
>>>   |  mb_cache_entry_find_first
>>>   |  ext4_xattr_cache_find
>>>   |  ext4_xattr_block_set
>>>   |  ext4_xattr_set_handle
>>>   |  ext4_initxattrs
>>>   |  security_inode_init_security
>>>   |  ext4_init_security
>> 
>> Looks like this is some large security xattr, or enough smaller
>> xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
>> big is the SELinux xattr (assuming that is what it is)?
>> 
>> You could try a few different things here:
>> - disable selinux completely (boot with "selinux=0" on the kernel
>>  command line) and see how much faster it is

> Sorry I'm

not?

> familiar with SELinux enough to say how big its xattr is. Anyway, I'm 
> positive that SELinux is what is generating these xattrs.  With SELinux 
> disabled, there seems to be no call ext4_xattr_cache_find().

What is the relative performance of your benchmark with SELinux disabled?
While the oprofile graphs will be of passing interest to see that the
mbcache overhead is gone, they will not show the reduction in disk IO
from not writing/updating the external xattr blocks at all.

>> - format your ext4 filesystem with larger inodes (-I 512) and see
>>  if this is an improvement or not.  That depends on the size of
>>  the selinux xattrs and if they will fit into the extra 256 bytes
>>  of xattr space these larger inodes will give you.  The performance
>>  might also be worse, since there will be more data to read/write
>>  for each inode, but it would avoid seeking to the xattr blocks.
> 
> Thanks for the above suggestions. Could you please clarify if we are
> attempting to look for a workaround here? Since we agree the way
> mb_cache uses one global spinlock is incorrect and SELinux exposes
> the problem (which is not uncommon with Enterprise installations),
> I believe we should look at fixing it (patch 1/2). As you also
> mentioned, this will also impact both ext2 and ext3 filesystems.

I agree that SELinux is enabled on enterprise distributions by default,
but I'm also interested to know how much overhead this imposes.  I would
expect that writing large external xattrs for each file would have quite
a significant performance overhead that should not be ignored.  Reducing
the mbcache overhead is good, but eliminating it entirely is better.

Depending on how much overhead SELinux has, it might be important to
spend more time to optimize it (not just the mbcache part), or users
may consider disabling SELinux entirely on systems where they care
about peak performance.

> Anyway, please let me know if you still think any of the above
> experiments is relevant.

You have already done one of the tests that I'm interested in (the above
test which showed that disabling SELinux removed the mbcache overhead).
What I'm interested in is the actual performance (or relative performance
if you are not allowed to publish the actual numbers) of your AIM7
benchmark between SELinux enabled and SELinux disabled.

Next would be a new test that has SELinux enabled, but formatting the
filesystem with 512-byte inodes instead of the ext4 default of 256-byte
inodes.  If this makes a significant improvement, it would potentially
mean users and the upstream distros should use different formatting
options along with SELinux.  This is less clearly a win, since I don't
know enough details of how SELinux uses xattrs (I always disable it,
so I don't have any systems to check).

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-10 Thread Theodore Ts'o
On Tue, Sep 10, 2013 at 02:47:33PM -0600, Andreas Dilger wrote:
> I agree that SELinux is enabled on enterprise distributions by default,
> but I'm also interested to know how much overhead this imposes.  I would
> expect that writing large external xattrs for each file would have quite
> a significant performance overhead that should not be ignored.  Reducing
> the mbcache overhead is good, but eliminating it entirely is better.

I was under the impression that using a 256 byte inode (which gives a
bit over 100 bytes worth of xattr space) was plenty for SELinux.  If
it turns out that SELinux's use of xattrs have gotten especially
piggy, then we may need to revisit the recommended inode size for
those systems who insist on using SELinux...  even if we eliminate the
overhead associated with mbcache, the fact that files are requiring a
separate xattr is going to seriously degrade performance.

 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-06 Thread Thavatchai Makphaibulchoke
On 09/06/2013 05:10 AM, Andreas Dilger wrote:
> On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
>> No, I did not do anything special, including changing an inode's size. I 
>> just used the profile data, which indicated mb_cache module as one of the 
>> bottleneck.  Please see below for perf data from one of th new_fserver run, 
>> which also shows some mb_cache activities.
>>
>>
>>|--3.51%-- __mb_cache_entry_find
>>|  mb_cache_entry_find_first
>>|  ext4_xattr_cache_find
>>|  ext4_xattr_block_set
>>|  ext4_xattr_set_handle
>>|  ext4_initxattrs
>>|  security_inode_init_security
>>|  ext4_init_security
> 
> Looks like this is some large security xattr, or enough smaller
> xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
> big is the SELinux xattr (assuming that is what it is)?
> 

Sorry I'm familiar with SELinux enough to say how big its xattr is. Anyway, I'm 
positive that SELinux is what is generating these xattrs.  With SELinux 
disabled, there seems to be no call ext4_xattr_cache_find().

>> Looks like it's a bit harder to disable mbcache than I thought.
>> I ended up adding code to collect the statics.
>>
>> With selinux enabled, for new_fserver workload of aim7, there
>> are a total of 0x7e054201 ext4_xattr_cache_find() calls
>> that result in a hit and 0xc100 calls that are not.
>> The number does not seem to favor the complete disabling of
>> mbcache in this case.
> 
> This is about a 65% hit rate, which seems reasonable.
> 
> You could try a few different things here:
> - disable selinux completely (boot with "selinux=0" on the kernel
>   command line) and see how much faster it is
> - format your ext4 filesystem with larger inodes (-I 512) and see
>   if this is an improvement or not.  That depends on the size of
>   the selinux xattrs and if they will fit into the extra 256 bytes
>   of xattr space these larger inodes will give you.  The performance
>   might also be worse, since there will be more data to read/write
>   for each inode, but it would avoid seeking to the xattr blocks.
> 

Thanks for the above suggestions. Could you please clarify if we are attempting 
to look for a workaround here? Since we agree the way mb_cache uses one global 
spinlock is incorrect and SELinux exposes the problem (which is not uncommon 
with Enterprise installations), I believe we should look at fixing it (patch 
1/2). As you also mentioned, this will also impact both ext2 and ext3 
filesystems.

Anyway, please let me know if you still think any of the above experiments is 
relevant.

Thanks,
Mak.


> Cheers, Andreas
> 
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-05 Thread Andreas Dilger
On 2013-09-05, at 3:49 AM, Thavatchai Makphaibulchoke wrote:
> On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
>> How did you gather these results?  The mbcache is only used if you
>> are using extended attributes, and only if the extended attributes don't fit 
>> in the inode's extra space.
>> 
>> I checked aim7, and it doesn't do any extended attribute operations.
>> So why are you seeing differences?  Are you doing something like
>> deliberately using 128 byte inodes (which is not the default inode
>> size), and then enabling SELinux, or some such?
> 
> No, I did not do anything special, including changing an inode's size. I just 
> used the profile data, which indicated mb_cache module as one of the 
> bottleneck.  Please see below for perf data from one of th new_fserver run, 
> which also shows some mb_cache activities.
> 
> 
>|--3.51%-- __mb_cache_entry_find
>|  mb_cache_entry_find_first
>|  ext4_xattr_cache_find
>|  ext4_xattr_block_set
>|  ext4_xattr_set_handle
>|  ext4_initxattrs
>|  security_inode_init_security
>|  ext4_init_security

Looks like this is some large security xattr, or enough smaller
xattrs to exceed the ~120 bytes of in-inode xattr storage.  How
big is the SELinux xattr (assuming that is what it is)?

> Looks like it's a bit harder to disable mbcache than I thought.
> I ended up adding code to collect the statics.
> 
> With selinux enabled, for new_fserver workload of aim7, there
> are a total of 0x7e054201 ext4_xattr_cache_find() calls
> that result in a hit and 0xc100 calls that are not.
> The number does not seem to favor the complete disabling of
> mbcache in this case.

This is about a 65% hit rate, which seems reasonable.

You could try a few different things here:
- disable selinux completely (boot with "selinux=0" on the kernel
  command line) and see how much faster it is
- format your ext4 filesystem with larger inodes (-I 512) and see
  if this is an improvement or not.  That depends on the size of
  the selinux xattrs and if they will fit into the extra 256 bytes
  of xattr space these larger inodes will give you.  The performance
  might also be worse, since there will be more data to read/write
  for each inode, but it would avoid seeking to the xattr blocks.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-05 Thread Thavatchai Makphaibulchoke
On 09/04/2013 08:00 PM, Andreas Dilger wrote:
> 
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems.  Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
> 
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical?  If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
> 
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
> 
> Cheers, Andreas
> 

Looks like it's a bit harder to disable mbcache than I thought. I ended up 
adding code to collect the statics.

With selinux enabled, for new_fserver workload of aim7, there are a total of 
0x7e054201 ext4_xattr_cache_find() calls that result in a hit and 
0xc100 calls that are not.  The number does not seem to favor the 
complete disabling of mbcache in this case.

Thanks,
Mak.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-05 Thread Thavatchai Makphaibulchoke
On 09/05/2013 02:35 AM, Theodore Ts'o wrote:
> How did you gather these results?  The mbcache is only used if you are
> using extended attributes, and only if the extended attributes don't
> fit in the inode's extra space.
> 
> I checked aim7, and it doesn't do any extended attribute operations.
> So why are you seeing differences?  Are you doing something like
> deliberately using 128 byte inodes (which is not the default inode
> size), and then enabling SELinux, or some such?
> 
>   - Ted
> 

No, I did not do anything special, including changing an inode's size. I just 
used the profile data, which indicated mb_cache module as one of the 
bottleneck.  Please see below for perf data from one of th new_fserver run, 
which also shows some mb_cache activities.


|--3.51%-- __mb_cache_entry_find
|  mb_cache_entry_find_first
|  ext4_xattr_cache_find
|  ext4_xattr_block_set
|  ext4_xattr_set_handle
|  ext4_initxattrs
|  security_inode_init_security
|  ext4_init_security
|  __ext4_new_inode
|  ext4_create
|  vfs_create
|  lookup_open
|  do_last
|  path_openat
|  do_filp_open
|  do_sys_open
|  SyS_open
|  sys_creat
|  system_call
|  __creat_nocancel
|  |
|  |--16.67%-- 0x11fe2c0
|  |

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-04 Thread Theodore Ts'o
On Wed, Sep 04, 2013 at 10:39:14AM -0600, T Makphaibulchoke wrote:
> 
> Here are the performance improvements in some of the aim7 workloads,

How did you gather these results?  The mbcache is only used if you are
using extended attributes, and only if the extended attributes don't
fit in the inode's extra space.

I checked aim7, and it doesn't do any extended attribute operations.
So why are you seeing differences?  Are you doing something like
deliberately using 128 byte inodes (which is not the default inode
size), and then enabling SELinux, or some such?

- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-04 Thread Thavatchai Makphaibulchoke
On 09/04/2013 08:00 PM, Andreas Dilger wrote:

> 
> In the past, I've raised the question of whether mbcache is even
> useful on real-world systems.  Essentially, this is providing a
> "deduplication" service for ext2/3/4 xattr blocks that are identical.
> The question is how often this is actually the case in modern use?
> The original design was for allowing external ACL blocks to be
> shared between inodes, at a time when ACLs where pretty much the
> only xattrs stored on inodes.
> 
> The question now is whether there are common uses where all of the
> xattrs stored on multiple inodes are identical?  If that is not the
> case, mbcache is just adding overhead and should just be disabled
> entirely instead of just adding less overhead.
> 
> There aren't good statistics on the hit rate for mbcache, but it
> might be possible to generate some with systemtap or similar to
> see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.
> 
> Cheers, Andreas
> 

Thanks Andreas for the comments.  Since I'm not familiar with systemtap, I'm 
thinking probably the quickest and simplest way is to re-run aim7 and swing 
bench with mbcache disabled for comparison. Please let me know if you have any 
other benchmark suggestion or if you think systemtap on ext4_xattr_cache_find() 
would give a more accurate measurement.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 0/2] ext4: increase mbcache scalability

2013-09-04 Thread Andreas Dilger
On 2013-09-04, at 10:39 AM, T Makphaibulchoke wrote:
> This patch intends to improve the scalability of an ext filesystem,
> particularly ext4.

In the past, I've raised the question of whether mbcache is even
useful on real-world systems.  Essentially, this is providing a
"deduplication" service for ext2/3/4 xattr blocks that are identical.
The question is how often this is actually the case in modern use?
The original design was for allowing external ACL blocks to be
shared between inodes, at a time when ACLs where pretty much the
only xattrs stored on inodes.

The question now is whether there are common uses where all of the
xattrs stored on multiple inodes are identical?  If that is not the
case, mbcache is just adding overhead and should just be disabled
entirely instead of just adding less overhead.

There aren't good statistics on the hit rate for mbcache, but it
might be possible to generate some with systemtap or similar to
see how often ext4_xattr_cache_find() returns NULL vs. non-NULL.

Cheers, Andreas

> The patch consists of two parts.  The first part introduces higher
> degree of parallelism to the usages of the mb_cache and
> mb_cache_entries and impacts all ext filesystems.
> 
> The second part of the patch further increases the scalablity of
> an ext4 filesystem by having each ext4 fielsystem allocate and use
> its own private mbcache structure, instead of sharing a single
> mcache structures across all ext4 filesystems
> 
> Here are some of the benchmark results with the changes. 
> 
> On a 90 core machine:
> 
> Here are the performance improvements in some of the aim7 workloads,
> 
> ---
> | | % increase |
> ---
> | alltests| 11.85  |
> ---
> | custom  | 14.42  |
> ---
> | fserver | 21.36  |  
> ---
> | new_dbase   |  5.59 |  
> ---
> | new_fserver | 21.45  |  
> ---
> | shared  | 12.84  |  
> ---
> For Swingbench dss workload, with 16 GB database,
> 
> ---
> | Users| 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  
> |
> ---
> | % imprvoment | 8.46 | 8.00 | 7.35 | -.313| 1.09 | 0.69 | 0.30 | 2.18 | 5.23 
> |
> ---
> | % imprvoment |45.66 |47.62 |34.54 |25.15 |15.29 | 3.38 | -8.7 |-4.98 |-7.86 
> |
> | without using|  |  |  |  |  |  |  |  |  
> |
> | shared memory|  |  |  |  |  |  |  |  |  
> |
> ---
> For SPECjbb2013, composite run,
> 
> 
> |   | max-jOPS | critical-jOPS |
> 
> | % improvement |   5.99   | N/A   |
> 
> 
> 
> On an 80 core machine:
> 
> The aim7's results for most of the workloads turn out to the same.
> 
> Here are the results of Swingbench dss workload,
> 
> ---
> | Users| 100  | 200  | 300  | 400  | 500  | 600  | 700  | 800  | 900  
> |
> ---
> | % imprvoment |-1.79 | 0.37 | 1.36 | 0.08 | 1.66 | 2.09 | 1.16 | 1.48 | 1.92 
> |
> ---
> 
> The changes have been tested with ext4 xfstests to verify that no regression
> has been introduced. 
> 
> Changed in v3:
>   - New diff summary
> 
> Changed in v2:
>   - New performance data
>   - New diff summary
> 
> T Makphaibulchoke (2):
>  mbcache: decoupling the locking of local from global data
>  ext4: each filesystem creates and uses its own mb_cache
> 
> fs/ext4/ext4.h  |   1 +
> fs/ext4/super.c |  24 ++--
> fs/ext4/xattr.c |  51 
> fs/ext4/xattr.h |   6 +-
> fs/mbcache.c| 306 +++-
> include/linux/mbcache.h |  10 +-
> 6 files changed, 277 insertions(+), 121 deletions(-)
> 
> -- 
> 1.7.11.3
> 


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/