Re: [ceph-users] xattrs vs omap

Jan Schermer Tue, 14 Jul 2015 05:25:03 -0700

Instead of guessing I took a look at one of my OSDs.

TL;DR: I’m going to bump the inode size to 512 which should fit majority of 
xattrs, no need to touch filestore parameters.


Short news first - I can’t find a file with more than 2 xattrs. (and that’s 
good)

Then I extracted all the xattrs on all the ~100K files, counted their size and 
counted the occurences.

The largest xattrs I have are 705 chars in base64 (so let’s say it’s half), and 
that particular file has about 512B total in xattr (that’s more than was 
expected with RBD-only workload, right?)

# file: 
var/lib/ceph/osd/ceph-55//current/4.1ad7_head/rbd134udata.1a785181f15746a.000000000005a578__head_E5C51AD7__4
 117
user.ceph._=0sCwjyAAAABANKAAAAAAAAACkAAAByYmRfZGF0YS4xYTc4NTE4MWYxNTc0NmEuMDAwMDAwMDAwMDA1YTU3OP7/////////1xrF5QAAAAAABAAAAAAAAAAFAxQAAAAEAAAAAAAAAP////8AAAAAAAAAAAAAAADrEKMAAAAAADB2DQAiDaMA
AAAAAG11DQACAhUAAAAI1xSoAQAAAAD9CwAMAAAAAAAAAAAAAEAAAAAAABAgpFWoa6QVAgIVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA6xCjAAAAAAAwdg0AAAAAAAA=
 347
user.ceph.snapset=0sAgL5AQAAgt8HAAAAAAABBgAAAILfBwAAAAAAb94HAAAAAAC23AcAAAAAAEnPBwAAAAAA470HAAAAAAB4ugcAAAAAAAQAAAC1ugcAAAAAAOO9BwAAAAAAStAHAAAAAACC3wcAAAAAAAQAAAC1ugcAAAAAAAQAAAAAAAAAAAAAAABQFAAAAAAAAGAUAAAAAAAAwAoAAAAAAAAwHwAAAAAAAJAZAAAAAAAA4DgAAAAAAAAgBwAAAAAA470HAAAAAAAFAAAAAAAAAAAAAAAAEA8AAAAAAAAgDwAAAAAAACAFAAAAAAAASBQAAAAAAABADgAAAAAAAJAiAAAAAAAAoAIAAAAAAAA4JQAAAAAAAMgaAAAAAABK0AcAAAAAAAQAAAAAAAAAAAAAAADgAQAAAAAAAOgBAAAAAAAAeCYAAAAAAACAKAAAAAAAAHAAAAAAAAAAACkAAAAAAAAAFwAAAAAAgt8HAAAAAAAFAAAAAAAAAAAAAAAAoAEAAAAAAADAAQAAAAAAAIAMAAAAAAAAUA4AAAAAAAAQBgAAAAAAAIAUAAAAAAAA4AAAAAAAAACAFQAAAAAAAIAqAAAAAAAEAAAAtboHAAAAAAAAAEAAAAAAAOO9BwAAAAAAAABAAAAAAABK0AcAAAAAAAAAQAAAAAAAgt8HAAAAAAAAAEAAAAAAAA==
 705

(If anyone wants to enlighten me on the contents that would be great - is this 
expected to grow much?)


BUT most of the files have much smaller xattrs, and if I researched it 
correctly it seems ext4 uses free space in inode (which should be something 
like inode_size-128-28=free) and if that’s not enough it will allocate one more 
block.

In other words, if I format ext4 with 2048 inode size and 4096 block size, 
there will be 2048-(128+28)=1892 bytes available in the inode, and 4096 bytes 
can be allocated  from another block. With default format, there will be just 
256-(128+28)=100 bytes in the inode + 4096 bytes in another block.


In my case, majority of the files have xattr size <200B, which is larger than 
fits inside one inode, but not really that large, so it should be beneficial to 
bump the inode size to 512B (that leaves plenty of 356 bytes for xattrs).

Jan


> On 14 Jul 2015, at 12:18, Gregory Farnum <g...@gregs42.com> wrote:
> 
> On Tue, Jul 14, 2015 at 10:53 AM, Jan Schermer <j...@schermer.cz> wrote:
>> Thank you for your reply.
>> Comments inline.
>> 
>> I’m still hoping to get some more input, but there are many people running 
>> ceph on ext4, and it sounds like it works pretty good out of the box. Maybe 
>> I’m overthinking this, then?
> 
> I think so — somebody did a lot of work making sure we were well-tuned
> on the standard filesystems; I believe it was David.
> -Greg
> 
>> 
>> Jan
>> 
>>> On 13 Jul 2015, at 21:04, Somnath Roy <somnath....@sandisk.com> wrote:
>>> 
>>> <<inline
>>> 
>>> -----Original Message-----
>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
>>> Jan Schermer
>>> Sent: Monday, July 13, 2015 2:32 AM
>>> To: ceph-users@lists.ceph.com
>>> Subject: Re: [ceph-users] xattrs vs omap
>>> 
>>> Sorry for reviving an old thread, but could I get some input on this, 
>>> pretty please?
>>> 
>>> ext4 has 256-byte inodes by default (at least according to docs) but the 
>>> fragment below says:
>>> OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)
>>> 
>>> The default 512b is too much if the inode is just 256b, so shouldn’t that 
>>> be 256b in case people use the default ext4 inode size?
>>> 
>>> Anyway, is it better to format ext4 with larger inodes (say 2048b) and set 
>>> filestore_max_inline_xattr_size_other=1536, or leave it at defaults?
>>> [Somnath] Why 1536 ? why not 1024 or any power of 2 ? I am not seeing any 
>>> harm though, but, curious.
>> 
>> AFAIK there is other information in the inode other than xattrs, also you 
>> need to count the xattra labels into this - so if I want to store 1536B of 
>> “values” it would cost more, and there still needs to be some space left.
>> 
>>> (As I understand it, on ext4 xattrs ale limited to one block, inode size + 
>>> something can spill to one different inode - maybe someone knows better).
>>> 
>>> 
>>> [Somnath] The xttr size ("_") is now more than 256 bytes and it will spill 
>>> over, so, bigger inode  size will be good. But, I would suggest do your 
>>> benchmark before putting it into production.
>>> 
>> 
>> Good poin and I am going to do that, but I’d like to avoid the guesswork. 
>> Also, not all patterns are always replicable….
>> 
>>> Is filestore_max_inline_xattr_size and absolute limit, or is it 
>>> filestore_max_inline_xattr_size*filestore_max_inline_xattrs in reality?
>>> 
>>> [Somnath] The *_size is tracking the xttr size per attribute and 
>>> *inline_xattrs keep track of max number of inline attributes allowed. So, 
>>> if a xattr size is > *_size , it will go to omap and also if the total 
>>> number of xattra > *inline_xattrs , it will go to omap.
>>> If you are only using rbd, the number of inline xattrs will be always 2 and 
>>> it will not cross that default max limit.
>> 
>> If I’m reading this correctly then with my setting of  
>> filestore_max_inline_xattr_size_other=1536, it could actually consume 3072B 
>> (2 xattrs), so I should in reality use 4K inodes…?
>> 
>> 
>>> 
>>> Does OSD do the sane thing if for some reason the xattrs do not fit? What 
>>> are the performance implications of storing the xattrs in leveldb?
>>> 
>>> [Somnath] Even though I don't have the exact numbers, but, it has a 
>>> significant overhead if the xattrs go to leveldb.
>>> 
>>> And lastly - what size of xattrs should I really expect if all I use is RBD 
>>> for OpenStack instances? (No radosgw, no cephfs, but heavy on rbd image and 
>>> pool snapshots). This overhead is quite large
>>> 
>>> [Somnath] It will be 2 xattrs, default "_" will be little bigger than 256 
>>> bytes and "_snapset" is small depends on number of snaps/clones, but 
>>> unlikely will cross 256 bytes range.
>> 
>> I have few pool snapshots and lots (hundreds) of (nested) snapshots for rbd 
>> volumes. Does this come into play somehow?
>> 
>>> 
>>> My plan so far is to format the drives like this:
>>> mkfs.ext4 -I 2048 -b 4096 -i 524288 -E stride=32,stripe-width=256 (2048b 
>>> inode, 4096b block size, one inode for 512k of space and set  
>>> filestore_max_inline_xattr_size_other=1536
>>> [Somnath] Not much idea on ext4, sorry..
>>> 
>>> Does that make sense?
>>> 
>>> Thanks!
>>> 
>>> Jan
>>> 
>>> 
>>> 
>>>> On 02 Jul 2015, at 12:18, Jan Schermer <j...@schermer.cz> wrote:
>>>> 
>>>> Does anyone have a known-good set of parameters for ext4? I want to try it 
>>>> as well but I’m a bit worried what happnes if I get it wrong.
>>>> 
>>>> Thanks
>>>> 
>>>> Jan
>>>> 
>>>> 
>>>> 
>>>>> On 02 Jul 2015, at 09:40, Nick Fisk <n...@fisk.me.uk> wrote:
>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>> Behalf Of Christian Balzer
>>>>>> Sent: 02 July 2015 02:23
>>>>>> To: Ceph Users
>>>>>> Subject: Re: [ceph-users] xattrs vs omap
>>>>>> 
>>>>>> On Thu, 2 Jul 2015 00:36:18 +0000 Somnath Roy wrote:
>>>>>> 
>>>>>>> It is replaced with the following config option..
>>>>>>> 
>>>>>>> // Use omap for xattrs for attrs over //
>>>>>>> filestore_max_inline_xattr_size or
>>>>>>> OPTION(filestore_max_inline_xattr_size, OPT_U32, 0)     //Override
>>>>>>> OPTION(filestore_max_inline_xattr_size_xfs, OPT_U32, 65536)
>>>>>>> OPTION(filestore_max_inline_xattr_size_btrfs, OPT_U32, 2048)
>>>>>>> OPTION(filestore_max_inline_xattr_size_other, OPT_U32, 512)
>>>>>>> 
>>>>>>> // for more than filestore_max_inline_xattrs attrs
>>>>>>> OPTION(filestore_max_inline_xattrs, OPT_U32, 0) //Override
>>>>>>> OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10)
>>>>>>> OPTION(filestore_max_inline_xattrs_btrfs, OPT_U32, 10)
>>>>>>> OPTION(filestore_max_inline_xattrs_other, OPT_U32, 2)
>>>>>>> 
>>>>>>> 
>>>>>>> If these limits crossed, xattrs will be stored in omap..
>>>>>>> 
>>>>>> Sounds fair.
>>>>>> 
>>>>>> Since I only use RBD I don't think it will ever exceed this.
>>>>> 
>>>>> Possibly, see my thread  about performance difference between new and
>>>>> old pools. Still not quite sure what's going on, but for some reasons
>>>>> some of the objects behind RBD's have larger xattrs which is causing
>>>>> really poor performance.
>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Chibi
>>>>>>> For ext4, you can use either filestore_max*_other or
>>>>>>> filestore_max_inline_xattrs/ filestore_max_inline_xattr_size. I any
>>>>>>> case, later two will override everything.
>>>>>>> 
>>>>>>> Thanks & Regards
>>>>>>> Somnath
>>>>>>> 
>>>>>>> -----Original Message-----
>>>>>>> From: Christian Balzer [mailto:ch...@gol.com]
>>>>>>> Sent: Wednesday, July 01, 2015 5:26 PM
>>>>>>> To: Ceph Users
>>>>>>> Cc: Somnath Roy
>>>>>>> Subject: Re: [ceph-users] xattrs vs omap
>>>>>>> 
>>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> On Wed, 1 Jul 2015 15:24:13 +0000 Somnath Roy wrote:
>>>>>>> 
>>>>>>>> It doesn't matter, I think filestore_xattr_use_omap is a 'noop'
>>>>>>>> and not used in the Hammer.
>>>>>>>> 
>>>>>>> Then what was this functionality replaced with, esp. considering
>>>>>>> EXT4 based OSDs?
>>>>>>> 
>>>>>>> Chibi
>>>>>>>> Thanks & Regards
>>>>>>>> Somnath
>>>>>>>> 
>>>>>>>> -----Original Message-----
>>>>>>>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
>>>>>>>> Behalf Of Adam Tygart Sent: Wednesday, July 01, 2015 8:20 AM
>>>>>>>> To: Ceph Users
>>>>>>>> Subject: [ceph-users] xattrs vs omap
>>>>>>>> 
>>>>>>>> Hello all,
>>>>>>>> 
>>>>>>>> I've got a coworker who put "filestore_xattr_use_omap = true" in
>>>>>>>> the ceph.conf when we first started building the cluster. Now he
>>>>>>>> can't remember why. He thinks it may be a holdover from our first
>>>>>>>> Ceph cluster (running dumpling on ext4, iirc).
>>>>>>>> 
>>>>>>>> In the newly built cluster, we are using XFS with 2048 byte
>>>>>>>> inodes, running Ceph 0.94.2. It currently has production data in it.
>>>>>>>> 
>>>>>>>> From my reading of other threads, it looks like this is probably
>>>>>>>> not something you want set to true (at least on XFS), due to
>>>>>>>> performance implications. Is this something you can change on a 
>>>>>>>> running cluster?
>>>>>>>> Is it worth the hassle?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Adam
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>>>> ________________________________
>>>>>>>> 
>>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>>> message is intended only for the use of the designated
>>>>>>>> recipient(s) named above. If the reader of this message is not the
>>>>>>>> intended recipient, you are hereby notified that you have received
>>>>>>>> this message in error and that any review, dissemination,
>>>>>>>> distribution, or copying of this message is strictly prohibited.
>>>>>>>> If you have received this communication in error, please notify
>>>>>>>> the sender by telephone or e-mail (as shown above) immediately and
>>>>>>>> destroy any and all copies of this message in your possession
>>>>>>>> (whether hard copies or electronically stored copies).
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list
>>>>>>>> ceph-users@lists.ceph.com
>>>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>>> ch...@gol.com           Global OnLine Japan/Fusion Communications
>>>>>>> http://www.gol.com/
>>>>>>> 
>>>>>>> ________________________________
>>>>>>> 
>>>>>>> PLEASE NOTE: The information contained in this electronic mail
>>>>>>> message is intended only for the use of the designated recipient(s) 
>>>>>>> named above.
>>>>>>> If the reader of this message is not the intended recipient, you
>>>>>>> are hereby notified that you have received this message in error
>>>>>>> and that any review, dissemination, distribution, or copying of
>>>>>>> this message is strictly prohibited. If you have received this
>>>>>>> communication in error, please notify the sender by telephone or
>>>>>>> e-mail (as shown above) immediately and destroy any and all copies
>>>>>>> of this message in your possession (whether hard copies or 
>>>>>>> electronically stored copies).
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Christian Balzer        Network/Systems Engineer
>>>>>> ch...@gol.com       Global OnLine Japan/Fusion Communications
>>>>>> http://www.gol.com/
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list
>>>>>> ceph-users@lists.ceph.com
>>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> 
>>> 
>>> _______________________________________________
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> ________________________________
>>> 
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please notify 
>>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>>> any and all copies of this message in your possession (whether hard copies 
>>> or electronically stored copies).
>>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] xattrs vs omap

Reply via email to