Re: [ceph-users] Collecting BlueStore per Object DB overhead

David Turner Tue, 01 May 2018 07:22:42 -0700

Primary RGW usage.  270M objects, 857TB data/1195TB raw, EC 8+3 in the RGW
data pool, less than 200K objects in all other pools.  OSDs 366 and 367 are
NVMe OSDs, the rest are 10TB disks for data/DB and 2GB WAL NVMe partition.
The only things on the NVMe OSDs are the RGW metadata pools.  I only have 2
servers with bluestore, the rest are currently filestore in the cluster.


osd.319 onodes=164010 db_used_bytes=14433648640 avg_obj_size=23392454
overhead_per_obj=88004
osd.352 onodes=162395 db_used_bytes=12957253632 avg_obj_size=23440441
overhead_per_obj=79788
osd.357 onodes=159920 db_used_bytes=14039384064 avg_obj_size=24208736
overhead_per_obj=87790
osd.356 onodes=164420 db_used_bytes=13006536704 avg_obj_size=23155304
overhead_per_obj=79105
osd.355 onodes=164086 db_used_bytes=13021216768 avg_obj_size=23448898
overhead_per_obj=79356
osd.354 onodes=164665 db_used_bytes=13026459648 avg_obj_size=23357786
overhead_per_obj=79108
osd.353 onodes=164575 db_used_bytes=14099152896 avg_obj_size=23377114
overhead_per_obj=85670
osd.359 onodes=163922 db_used_bytes=13991149568 avg_obj_size=23397323
overhead_per_obj=85352
osd.358 onodes=164805 db_used_bytes=12706643968 avg_obj_size=23160121
overhead_per_obj=77101
osd.364 onodes=163009 db_used_bytes=14926479360 avg_obj_size=23552838
overhead_per_obj=91568
osd.365 onodes=163639 db_used_bytes=13615759360 avg_obj_size=23541130
overhead_per_obj=83206
osd.362 onodes=164505 db_used_bytes=13152288768 avg_obj_size=23324698
overhead_per_obj=79950
osd.363 onodes=164395 db_used_bytes=13104054272 avg_obj_size=23157437
overhead_per_obj=79710
osd.360 onodes=163484 db_used_bytes=14292090880 avg_obj_size=23347543
overhead_per_obj=87421
osd.361 onodes=164140 db_used_bytes=12977176576 avg_obj_size=23498778
overhead_per_obj=79061
osd.366 onodes=1516 db_used_bytes=7509901312 avg_obj_size=5743370
overhead_per_obj=4953760
osd.367 onodes=1435 db_used_bytes=7992246272 avg_obj_size=6419719
overhead_per_obj=5569509

On Tue, May 1, 2018 at 1:57 AM Wido den Hollander <w...@42on.com> wrote:

>
>
> On 04/30/2018 10:25 PM, Gregory Farnum wrote:
> >
> >
> > On Thu, Apr 26, 2018 at 11:36 AM Wido den Hollander <w...@42on.com
> > <mailto:w...@42on.com>> wrote:
> >
> >     Hi,
> >
> >     I've been investigating the per object overhead for BlueStore as I've
> >     seen this has become a topic for a lot of people who want to store a
> lot
> >     of small objects in Ceph using BlueStore.
> >
> >     I've writting a piece of Python code which can be run on a server
> >     running OSDs and will print the overhead.
> >
> >     https://gist.github.com/wido/b1328dd45aae07c45cb8075a24de9f1f
> >
> >     Feedback on this script is welcome, but also the output of what
> people
> >     are observing.
> >
> >     The results from my tests are below, but what I see is that the
> overhead
> >     seems to range from 10kB to 30kB per object.
> >
> >     On RBD-only clusters the overhead seems to be around 11kB, but on
> >     clusters with a RGW workload the overhead goes higher to 20kB.
> >
> >
> > This change seems implausible as RGW always writes full objects, whereas
> > RBD will frequently write pieces of them and do overwrites.
> > I'm not sure what all knobs are available and which diagnostics
> > BlueStore exports, but is it possible you're looking at the total
> > RocksDB data store rather than the per-object overhead? The distinction
> > here being that the RocksDB instance will also store "client" (ie, RGW)
> > omap data and xattrs, in addition to the actual BlueStore onodes.
>
> Yes, that is possible. But in the end, the amount of onodes is the
> objects you store and then you want to know how many bytes the RocksDB
> database uses.
>
> I do agree that RGW doesn't do partial writes and has more metadata, but
> eventually that all has to be stored.
>
> We just need to come up with some good numbers on how to size the DB.
>
> Currently I assume a 10GB:1TB ratio and that is working out, but with
> people wanting to use 12TB disks we need to drill those numbers down
> even more. Otherwise you will need a lot of SSD space to store the DB in
> SSD if you want to.
>
> Wido
>
> > -Greg
> >
> >
> >
> >     I know that partial overwrites and appends contribute to higher
> overhead
> >     on objects and I'm trying to investigate this and share my
> information
> >     with the community.
> >
> >     I have two use-cases who want to store >2 billion objects with a avg
> >     object size of 50kB (8 - 80kB) and the RocksDB overhead is likely to
> >     become a big problem.
> >
> >     Anybody willing to share the overhead they are seeing with what
> >     use-case?
> >
> >     The more data we have on this the better we can estimate how DBs
> need to
> >     be sized for BlueStore deployments.
> >
> >     Wido
> >
> >     # Cluster #1
> >     osd.25 onodes=178572 db_used_bytes=2188378112 <(218)%20837-8112>
> <tel:(218)%20837-8112>
> >     avg_obj_size=6196529
> >     overhead=12254
> >     osd.20 onodes=209871 db_used_bytes=2307915776 avg_obj_size=5452002
> >     overhead=10996
> >     osd.10 onodes=195502 db_used_bytes=2395996160 <(239)%20599-6160>
> <tel:(239)%20599-6160>
> >     avg_obj_size=6013645
> >     overhead=12255
> >     osd.30 onodes=186172 db_used_bytes=2393899008 <(239)%20389-9008>
> <tel:(239)%20389-9008>
> >     avg_obj_size=6359453
> >     overhead=12858
> >     osd.1 onodes=169911 db_used_bytes=1799356416 avg_obj_size=4890883
> >     overhead=10589
> >     osd.0 onodes=199658 db_used_bytes=2028994560 <(202)%20899-4560>
> <tel:(202)%20899-4560>
> >     avg_obj_size=4835928
> >     overhead=10162
> >     osd.15 onodes=204015 db_used_bytes=2384461824 avg_obj_size=5722715
> >     overhead=11687
> >
> >     # Cluster #2
> >     osd.1 onodes=221735 db_used_bytes=2773483520 avg_obj_size=5742992
> >     overhead_per_obj=12508
> >     osd.0 onodes=196817 db_used_bytes=2651848704 avg_obj_size=6454248
> >     overhead_per_obj=13473
> >     osd.3 onodes=212401 db_used_bytes=2745171968 avg_obj_size=6004150
> >     overhead_per_obj=12924
> >     osd.2 onodes=185757 db_used_bytes=3567255552 avg_obj_size=5359974
> >     overhead_per_obj=19203
> >     osd.5 onodes=198822 db_used_bytes=3033530368 <(303)%20353-0368>
> <tel:(303)%20353-0368>
> >     avg_obj_size=6765679
> >     overhead_per_obj=15257
> >     osd.4 onodes=161142 db_used_bytes=2136997888 <(213)%20699-7888>
> <tel:(213)%20699-7888>
> >     avg_obj_size=6377323
> >     overhead_per_obj=13261
> >     osd.7 onodes=158951 db_used_bytes=1836056576 avg_obj_size=5247527
> >     overhead_per_obj=11551
> >     osd.6 onodes=178874 db_used_bytes=2542796800 <(254)%20279-6800>
> <tel:(254)%20279-6800>
> >     avg_obj_size=6539688
> >     overhead_per_obj=14215
> >     osd.9 onodes=195166 db_used_bytes=2538602496 <(253)%20860-2496>
> <tel:(253)%20860-2496>
> >     avg_obj_size=6237672
> >     overhead_per_obj=13007
> >     osd.8 onodes=203946 db_used_bytes=3279945728 avg_obj_size=6523555
> >     overhead_per_obj=16082
> >
> >     # Cluster 3
> >     osd.133 onodes=68558 db_used_bytes=15868100608 <(586)%20810-0608>
> >     <tel:(586)%20810-0608> avg_obj_size=14743206
> >     overhead_per_obj=231455
> >     osd.132 onodes=60164 db_used_bytes=13911457792 avg_obj_size=14539445
> >     overhead_per_obj=231225
> >     osd.137 onodes=62259 db_used_bytes=15597568000 <(559)%20756-8000>
> >     <tel:(559)%20756-8000> avg_obj_size=15138484
> >     overhead_per_obj=250527
> >     osd.136 onodes=70361 db_used_bytes=14540603392 avg_obj_size=13729154
> >     overhead_per_obj=206657
> >     osd.135 onodes=68003 db_used_bytes=12285116416 <(228)%20511-6416>
> >     <tel:(228)%20511-6416> avg_obj_size=12877744
> >     overhead_per_obj=180655
> >     osd.134 onodes=64962 db_used_bytes=14056161280 <(405)%20616-1280>
> >     <tel:(405)%20616-1280> avg_obj_size=15923550
> >     overhead_per_obj=216375
> >     osd.139 onodes=68016 db_used_bytes=20782776320 avg_obj_size=13619345
> >     overhead_per_obj=305557
> >     osd.138 onodes=66209 db_used_bytes=12850298880 avg_obj_size=14593418
> >     overhead_per_obj=194086
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Collecting BlueStore per Object DB overhead

Reply via email to