Re: [ceph-users] Bluestore performance 50% of filestore
Just a couple of points. There is no way you can be writing over 7000 iops to 27x7200rpm disks at a replica level of 3. As Mark has suggested, with a 1GB test file, you are only touching a tiny area on each physical disk and so you are probably getting a combination of short stroking from the disks and Filestore/XFS buffering up your writes, coalescing them and actually writing a lot less out to the disks than what the benchmark is suggesting. I'm not 100% sure on how the allocations work in Bluestore, especially when it comes to overwriting with tiny 4kb objects, but I wondering if Bluestore is starting to spread the data out further across the disk so you lose some benefit of short stroking? There maybe other factors coming into play with the deferred writes which was implemented/fixed after the investigation Mark mentioned. The simple reproducer at the time was to coalesce a stream of small sequential writes, the scenario where a larger number of small random writes potentially covering the same small area was not tested. I would suggest trying to use fio with the librbd engine directly and create a RBD of around a TB in size to rule out any disk locality issues first. If that brings the figures more in line, then that could potentially steer the investigation towards why Bluestore struggles to coalesce as well as the Linux FS system. Nick > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of > Milanov, Radoslav Nikiforov > Sent: 17 November 2017 22:56 > To: Mark Nelson <mnel...@redhat.com>; David Turner > <drakonst...@gmail.com> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Here's some more results, I'm reading 12.2.2 will have performance > improvements for bluestore and should be released soon? > > Iodepth=not specified > Filestore > write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec > write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec > write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec > > read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec > read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec > read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec > > Bluestore > write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec > write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec > write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec > > read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec > read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec > read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec > > Iodepth=10 > Filestore > write: io=5045.1MB, bw=28706KB/s, iops=7176, runt=180001msec > write: io=4764.7MB, bw=27099KB/s, iops=6774, runt=180021msec > write: io=4626.2MB, bw=26318KB/s, iops=6579, runt=180031msec > > read : io=1745.3MB, bw=9928.6KB/s, iops=2482, runt=180001msec > read : io=1933.7MB, bw=11000KB/s, iops=2749, runt=180001msec > read : io=1952.7MB, bw=11108KB/s, iops=2777, runt=180001msec > > Bluestore > write: io=1578.8MB, bw=8980.9KB/s, iops=2245, runt=180006msec > write: io=1583.9MB, bw=9010.2KB/s, iops=2252, runt=180002msec > write: io=1591.5MB, bw=9050.9KB/s, iops=2262, runt=180009msec > > read : io=412104KB, bw=2289.5KB/s, iops=572, runt=180002msec > read : io=718108KB, bw=3989.5KB/s, iops=997, runt=180003msec > read : io=968388KB, bw=5379.7KB/s, iops=1344, runt=180009msec > > Iodpeth=20 > Filestore > write: io=4671.2MB, bw=26574KB/s, iops=6643, runt=180001msec > write: io=4583.4MB, bw=26066KB/s, iops=6516, runt=180054msec > write: io=4641.6MB, bw=26347KB/s, iops=6586, runt=180395msec > > read : io=2094.3MB, bw=11914KB/s, iops=2978, runt=180001msec > read : io=1997.6MB, bw=11364KB/s, iops=2840, runt=180001msec > read : io=2028.4MB, bw=11539KB/s, iops=2884, runt=180001msec > > Bluestore > write: io=1595.8MB, bw=9078.2KB/s, iops=2269, runt=180001msec > write: io=1596.2MB, bw=9080.6KB/s, iops=2270, runt=180001msec > write: io=1588.3MB, bw=9035.4KB/s, iops=2258, runt=180002msec > > read : io=1126.9MB, bw=6410.5KB/s, iops=1602, runt=180004msec > read : io=1282.4MB, bw=7295.3KB/s, iops=1823, runt=180003msec > read : io=1380.9MB, bw=7854.1KB/s, iops=1963, runt=180007msec > > > - Rado > > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Thursday, November 16, 2017 2:04 PM > To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner > <drakonst...@gmail.com> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > It depends on what yo
Re: [ceph-users] Bluestore performance 50% of filestore
Here's some more results, I'm reading 12.2.2 will have performance improvements for bluestore and should be released soon? Iodepth=not specified Filestore write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec Bluestore write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec Iodepth=10 Filestore write: io=5045.1MB, bw=28706KB/s, iops=7176, runt=180001msec write: io=4764.7MB, bw=27099KB/s, iops=6774, runt=180021msec write: io=4626.2MB, bw=26318KB/s, iops=6579, runt=180031msec read : io=1745.3MB, bw=9928.6KB/s, iops=2482, runt=180001msec read : io=1933.7MB, bw=11000KB/s, iops=2749, runt=180001msec read : io=1952.7MB, bw=11108KB/s, iops=2777, runt=180001msec Bluestore write: io=1578.8MB, bw=8980.9KB/s, iops=2245, runt=180006msec write: io=1583.9MB, bw=9010.2KB/s, iops=2252, runt=180002msec write: io=1591.5MB, bw=9050.9KB/s, iops=2262, runt=180009msec read : io=412104KB, bw=2289.5KB/s, iops=572, runt=180002msec read : io=718108KB, bw=3989.5KB/s, iops=997, runt=180003msec read : io=968388KB, bw=5379.7KB/s, iops=1344, runt=180009msec Iodpeth=20 Filestore write: io=4671.2MB, bw=26574KB/s, iops=6643, runt=180001msec write: io=4583.4MB, bw=26066KB/s, iops=6516, runt=180054msec write: io=4641.6MB, bw=26347KB/s, iops=6586, runt=180395msec read : io=2094.3MB, bw=11914KB/s, iops=2978, runt=180001msec read : io=1997.6MB, bw=11364KB/s, iops=2840, runt=180001msec read : io=2028.4MB, bw=11539KB/s, iops=2884, runt=180001msec Bluestore write: io=1595.8MB, bw=9078.2KB/s, iops=2269, runt=180001msec write: io=1596.2MB, bw=9080.6KB/s, iops=2270, runt=180001msec write: io=1588.3MB, bw=9035.4KB/s, iops=2258, runt=180002msec read : io=1126.9MB, bw=6410.5KB/s, iops=1602, runt=180004msec read : io=1282.4MB, bw=7295.3KB/s, iops=1823, runt=180003msec read : io=1380.9MB, bw=7854.1KB/s, iops=1963, runt=180007msec - Rado -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Thursday, November 16, 2017 2:04 PM To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner <drakonst...@gmail.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore It depends on what you expect your typical workload to be like. Ceph (and distributed storage in general) likes high io depths so writes can hit all of the drives at the same time. There are tricks (like journals, writahead logs, centralized caches, etc) that can help mitigate this, but I suspect you'll see much better performance with more concurrent writes. Regarding file size, the smaller the file, the more likely those tricks mentioned above are to help you. Based on your results, it appears filestore may be doing a better job of it than bluestore. The question you have to ask is whether or not this kind of test represents what you are likely to see for real on your cluster. Doing writes over a much larger file, say 3-4x over the total amount of RAM in all of the nodes, helps you get a better idea of what the behavior is like when those tricks are less effective. I think that's probably a more likely scenario in most production environments, but it's up to you which workload you think better represents what you are going to see in practice. A while back Nick Fisk showed some results wehre bluestore was slower than filestore at small sync writes and it could be that we simply have more work to do in this area. On the other hand, we pretty consistently see bluestore doing better than filestore with 4k random writes and higher IO depths, which is why I'd be curious to see how it goes if you try that. Mark On 11/16/2017 10:11 AM, Milanov, Radoslav Nikiforov wrote: > No, > What test parameters (iodepth/file size/numjobs) would make sense for 3 > node/27OSD@4TB ? > - Rado > > -Original Message- > From: Mark Nelson [mailto:mnel...@redhat.com] > Sent: Thursday, November 16, 2017 10:56 AM > To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner > <drakonst...@gmail.com> > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Did you happen to have a chance to try with a higher io depth? > > Mark > > On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov
Re: [ceph-users] Bluestore performance 50% of filestore
It depends on what you expect your typical workload to be like. Ceph (and distributed storage in general) likes high io depths so writes can hit all of the drives at the same time. There are tricks (like journals, writahead logs, centralized caches, etc) that can help mitigate this, but I suspect you'll see much better performance with more concurrent writes. Regarding file size, the smaller the file, the more likely those tricks mentioned above are to help you. Based on your results, it appears filestore may be doing a better job of it than bluestore. The question you have to ask is whether or not this kind of test represents what you are likely to see for real on your cluster. Doing writes over a much larger file, say 3-4x over the total amount of RAM in all of the nodes, helps you get a better idea of what the behavior is like when those tricks are less effective. I think that's probably a more likely scenario in most production environments, but it's up to you which workload you think better represents what you are going to see in practice. A while back Nick Fisk showed some results wehre bluestore was slower than filestore at small sync writes and it could be that we simply have more work to do in this area. On the other hand, we pretty consistently see bluestore doing better than filestore with 4k random writes and higher IO depths, which is why I'd be curious to see how it goes if you try that. Mark On 11/16/2017 10:11 AM, Milanov, Radoslav Nikiforov wrote: No, What test parameters (iodepth/file size/numjobs) would make sense for 3 node/27OSD@4TB ? - Rado -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Thursday, November 16, 2017 10:56 AM To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner <drakonst...@gmail.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore Did you happen to have a chance to try with a higher io depth? Mark On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote: FYI Having 50GB bock.db made no difference on the performance. - Rado *From:*David Turner [mailto:drakonst...@gmail.com] *Sent:* Tuesday, November 14, 2017 6:13 PM *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu> *Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore I'd probably say 50GB to leave some extra space over-provisioned. 50GB should definitely prevent any DB operations from spilling over to the HDD. On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: Thank you, It is 4TB OSDs and they might become full someday, I’ll try 60GB db partition – this is the max OSD capacity. - Rado *From:*David Turner [mailto:drakonst...@gmail.com <mailto:drakonst...@gmail.com>] *Sent:* Tuesday, November 14, 2017 5:38 PM *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> *Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore You have to configure the size of the db partition in the config file for the cluster. If you're db partition is 1GB, then I can all but guarantee that you're using your HDD for your blocks.db very quickly into your testing. There have been multiple threads recently about what size the db partition should be and it seems to be based on how many objects your OSD is likely to have on it. The recommendation has been to err on the side of bigger. If you're running 10TB OSDs and anticipate filling them up, then you probably want closer to an 80GB+ db partition. That's why I asked how full your cluster was and how large your HDDs are. Here's a link to one of the recent ML threads on this topic. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020 822.html On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: Block-db partition is the default 1GB (is there a way to modify this? journals are 5GB in filestore case) and usage is low: [root@kumo-ceph02 ~]# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 100602G 99146G1455G 1.45 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS kumo-vms 1 19757M 0.02 31147G5067 kumo-volumes 2214G 0.18 31147G 55248 kumo-images 3203G 0.17 31147G 66486 kumo-vms3 11 45824M 0.
Re: [ceph-users] Bluestore performance 50% of filestore
No, What test parameters (iodepth/file size/numjobs) would make sense for 3 node/27OSD@4TB ? - Rado -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Thursday, November 16, 2017 10:56 AM To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; David Turner <drakonst...@gmail.com> Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore Did you happen to have a chance to try with a higher io depth? Mark On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote: > FYI > > Having 50GB bock.db made no difference on the performance. > > > > - Rado > > > > *From:*David Turner [mailto:drakonst...@gmail.com] > *Sent:* Tuesday, November 14, 2017 6:13 PM > *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu> > *Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com > *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore > > > > I'd probably say 50GB to leave some extra space over-provisioned. > 50GB should definitely prevent any DB operations from spilling over to the > HDD. > > > > On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov > <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: > > Thank you, > > It is 4TB OSDs and they might become full someday, I’ll try 60GB db > partition – this is the max OSD capacity. > > > > - Rado > > > > *From:*David Turner [mailto:drakonst...@gmail.com > <mailto:drakonst...@gmail.com>] > *Sent:* Tuesday, November 14, 2017 5:38 PM > > > *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu > <mailto:rad...@bu.edu>> > > *Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>; > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> > > > *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore > > > > You have to configure the size of the db partition in the config > file for the cluster. If you're db partition is 1GB, then I can all > but guarantee that you're using your HDD for your blocks.db very > quickly into your testing. There have been multiple threads > recently about what size the db partition should be and it seems to > be based on how many objects your OSD is likely to have on it. The > recommendation has been to err on the side of bigger. If you're > running 10TB OSDs and anticipate filling them up, then you probably > want closer to an 80GB+ db partition. That's why I asked how full > your cluster was and how large your HDDs are. > > > > Here's a link to one of the recent ML threads on this > topic. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020 > 822.html > > On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov > <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: > > Block-db partition is the default 1GB (is there a way to modify > this? journals are 5GB in filestore case) and usage is low: > > > > [root@kumo-ceph02 ~]# ceph df > > GLOBAL: > > SIZEAVAIL RAW USED %RAW USED > > 100602G 99146G1455G 1.45 > > POOLS: > > NAME ID USED %USED MAX AVAIL > OBJECTS > > kumo-vms 1 19757M 0.02 > 31147G5067 > > kumo-volumes 2214G 0.18 > 31147G 55248 > > kumo-images 3203G 0.17 > 31147G 66486 > > kumo-vms3 11 45824M 0.04 > 31147G 11643 > > kumo-volumes3 13 10837M 0 > 31147G2724 > > kumo-images3 15 82450M 0.09 > 31147G 10320 > > > > - Rado > > > > *From:*David Turner [mailto:drakonst...@gmail.com > <mailto:drakonst...@gmail.com>] > *Sent:* Tuesday, November 14, 2017 4:40 PM > *To:* Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>> > *Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu > <mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com > <mailto:ceph-users@lists.ceph.com> > > > *Subject:* Re: [ceph-users] Bluestore performance 50% of > filestore > > > > How big was your blocks.db partition for each OSD and what size > are your HDDs? Also how full is your cluster? It's possible > that your blocks.db partition wasn't l
Re: [ceph-users] Bluestore performance 50% of filestore
Did you happen to have a chance to try with a higher io depth? Mark On 11/16/2017 09:53 AM, Milanov, Radoslav Nikiforov wrote: FYI Having 50GB bock.db made no difference on the performance. - Rado *From:*David Turner [mailto:drakonst...@gmail.com] *Sent:* Tuesday, November 14, 2017 6:13 PM *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu> *Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore I'd probably say 50GB to leave some extra space over-provisioned. 50GB should definitely prevent any DB operations from spilling over to the HDD. On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: Thank you, It is 4TB OSDs and they might become full someday, I’ll try 60GB db partition – this is the max OSD capacity. - Rado *From:*David Turner [mailto:drakonst...@gmail.com <mailto:drakonst...@gmail.com>] *Sent:* Tuesday, November 14, 2017 5:38 PM *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> *Cc:*Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>>; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore You have to configure the size of the db partition in the config file for the cluster. If you're db partition is 1GB, then I can all but guarantee that you're using your HDD for your blocks.db very quickly into your testing. There have been multiple threads recently about what size the db partition should be and it seems to be based on how many objects your OSD is likely to have on it. The recommendation has been to err on the side of bigger. If you're running 10TB OSDs and anticipate filling them up, then you probably want closer to an 80GB+ db partition. That's why I asked how full your cluster was and how large your HDDs are. Here's a link to one of the recent ML threads on this topic. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>> wrote: Block-db partition is the default 1GB (is there a way to modify this? journals are 5GB in filestore case) and usage is low: [root@kumo-ceph02 ~]# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 100602G 99146G1455G 1.45 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS kumo-vms 1 19757M 0.02 31147G5067 kumo-volumes 2214G 0.18 31147G 55248 kumo-images 3203G 0.17 31147G 66486 kumo-vms3 11 45824M 0.04 31147G 11643 kumo-volumes3 13 10837M 0 31147G2724 kumo-images3 15 82450M 0.09 31147G 10320 - Rado *From:*David Turner [mailto:drakonst...@gmail.com <mailto:drakonst...@gmail.com>] *Sent:* Tuesday, November 14, 2017 4:40 PM *To:* Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>> *Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu <mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore How big was your blocks.db partition for each OSD and what size are your HDDs? Also how full is your cluster? It's possible that your blocks.db partition wasn't large enough to hold the entire db and it had to spill over onto the HDD which would definitely impact performance. On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com <mailto:mnel...@redhat.com>> wrote: How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > Hi Mark, > Yes RBD is in write back, and the only thing that changed was converting OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get same results (bluestore 2
Re: [ceph-users] Bluestore performance 50% of filestore
FYI Having 50GB bock.db made no difference on the performance. - Rado From: David Turner [mailto:drakonst...@gmail.com] Sent: Tuesday, November 14, 2017 6:13 PM To: Milanov, Radoslav Nikiforov <rad...@bu.edu> Cc: Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore I'd probably say 50GB to leave some extra space over-provisioned. 50GB should definitely prevent any DB operations from spilling over to the HDD. On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>> wrote: Thank you, It is 4TB OSDs and they might become full someday, I’ll try 60GB db partition – this is the max OSD capacity. - Rado From: David Turner [mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>] Sent: Tuesday, November 14, 2017 5:38 PM To: Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>> Cc: Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>>; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Bluestore performance 50% of filestore You have to configure the size of the db partition in the config file for the cluster. If you're db partition is 1GB, then I can all but guarantee that you're using your HDD for your blocks.db very quickly into your testing. There have been multiple threads recently about what size the db partition should be and it seems to be based on how many objects your OSD is likely to have on it. The recommendation has been to err on the side of bigger. If you're running 10TB OSDs and anticipate filling them up, then you probably want closer to an 80GB+ db partition. That's why I asked how full your cluster was and how large your HDDs are. Here's a link to one of the recent ML threads on this topic. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>> wrote: Block-db partition is the default 1GB (is there a way to modify this? journals are 5GB in filestore case) and usage is low: [root@kumo-ceph02 ~]# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 100602G 99146G1455G 1.45 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS kumo-vms 1 19757M 0.0231147G5067 kumo-volumes 2214G 0.1831147G 55248 kumo-images 3203G 0.1731147G 66486 kumo-vms3 11 45824M 0.0431147G 11643 kumo-volumes3 13 10837M 031147G2724 kumo-images3 15 82450M 0.0931147G 10320 - Rado From: David Turner [mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>] Sent: Tuesday, November 14, 2017 4:40 PM To: Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>> Cc: Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Bluestore performance 50% of filestore How big was your blocks.db partition for each OSD and what size are your HDDs? Also how full is your cluster? It's possible that your blocks.db partition wasn't large enough to hold the entire db and it had to spill over onto the HDD which would definitely impact performance. On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>> wrote: How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > Hi Mark, > Yes RBD is in write back, and the only thing that changed was converting OSDs > to bluestore. It is 7200 rpm drives and triple replication. I also get same > results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > Right now I'm going back to filestore for the OSDs so additional tests are > possible if that helps. > > - Rado > > -Original Message- > From: ceph-users > [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] > On Behalf Of Mark Nelson > Sent: Tuesday, November 14, 2017 4:04 PM > To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Hi Radoslav, > > Is RBD cache enabled and in writeback mode? Do you have cli
Re: [ceph-users] Bluestore performance 50% of filestore
On 2017-11-14 21:54, Milanov, Radoslav Nikiforov wrote: > Hi > > We have 3 node, 27 OSDs cluster running Luminous 12.2.1 > > In filestore configuration there are 3 SSDs used for journals of 9 OSDs on > each hosts (1 SSD has 3 journal paritions for 3 OSDs). > > I've converted filestore to bluestore by wiping 1 host a time and waiting for > recovery. SSDs now contain block-db - again one SSD serving 3 OSDs. > > Cluster is used as storage for Openstack. > > Running fio on a VM in that Openstack reveals bluestore performance almost > twice slower than filestore. > > fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G > --numjobs=2 --time_based --runtime=180 --group_reporting > > fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G > --numjobs=2 --time_based --runtime=180 --group_reporting > > Filestore > > write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec > > write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec > > write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec > > read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec > > read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec > > read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec > > Bluestore > > write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec > > write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec > > write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec > > read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec > > read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec > > read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec > > - Rado > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com It will be useful to see how this filestore edge would perform when you increase your queue depth (threads/jobs). For example to 32 or 64. This would represent a more practical load. I can see an extreme case if you have a cluster with a large number of OSDs and only 1 client thread that filestore may be faster: in this case when the client io hits an OSD it will not be as busy syncing its journal to hdd (which is the case under normal load), but again this is not a practical setup. /Maged___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore performance 50% of filestore
I'd probably say 50GB to leave some extra space over-provisioned. 50GB should definitely prevent any DB operations from spilling over to the HDD. On Tue, Nov 14, 2017, 5:43 PM Milanov, Radoslav Nikiforov <rad...@bu.edu> wrote: > Thank you, > > It is 4TB OSDs and they might become full someday, I’ll try 60GB db > partition – this is the max OSD capacity. > > > > - Rado > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Tuesday, November 14, 2017 5:38 PM > > > *To:* Milanov, Radoslav Nikiforov <rad...@bu.edu> > > *Cc:* Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com > > > *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore > > > > You have to configure the size of the db partition in the config file for > the cluster. If you're db partition is 1GB, then I can all but guarantee > that you're using your HDD for your blocks.db very quickly into your > testing. There have been multiple threads recently about what size the db > partition should be and it seems to be based on how many objects your OSD > is likely to have on it. The recommendation has been to err on the side of > bigger. If you're running 10TB OSDs and anticipate filling them up, then > you probably want closer to an 80GB+ db partition. That's why I asked how > full your cluster was and how large your HDDs are. > > > > Here's a link to one of the recent ML threads on this topic. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html > > On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu> > wrote: > > Block-db partition is the default 1GB (is there a way to modify this? > journals are 5GB in filestore case) and usage is low: > > > > [root@kumo-ceph02 ~]# ceph df > > GLOBAL: > > SIZEAVAIL RAW USED %RAW USED > > 100602G 99146G1455G 1.45 > > POOLS: > > NAME ID USED %USED MAX AVAIL OBJECTS > > kumo-vms 1 19757M 0.0231147G5067 > > kumo-volumes 2214G 0.1831147G 55248 > > kumo-images 3203G 0.1731147G 66486 > > kumo-vms3 11 45824M 0.0431147G 11643 > > kumo-volumes3 13 10837M 031147G2724 > > kumo-images3 15 82450M 0.0931147G 10320 > > > > - Rado > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Tuesday, November 14, 2017 4:40 PM > *To:* Mark Nelson <mnel...@redhat.com> > *Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu>; > ceph-users@lists.ceph.com > > > *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore > > > > How big was your blocks.db partition for each OSD and what size are your > HDDs? Also how full is your cluster? It's possible that your blocks.db > partition wasn't large enough to hold the entire db and it had to spill > over onto the HDD which would definitely impact performance. > > > > On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com> wrote: > > How big were the writes in the windows test and how much concurrency was > there? > > Historically bluestore does pretty well for us with small random writes > so your write results surprise me a bit. I suspect it's the low queue > depth. Sometimes bluestore does worse with reads, especially if > readahead isn't enabled on the client. > > Mark > > On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > > Hi Mark, > > Yes RBD is in write back, and the only thing that changed was converting > OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get > same results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > > > Right now I'm going back to filestore for the OSDs so additional tests > are possible if that helps. > > > > - Rado > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Mark Nelson > > Sent: Tuesday, November 14, 2017 4:04 PM > > To: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > > > Hi Radoslav, > > > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you > are using 3
Re: [ceph-users] Bluestore performance 50% of filestore
Thank you, It is 4TB OSDs and they might become full someday, I’ll try 60GB db partition – this is the max OSD capacity. - Rado From: David Turner [mailto:drakonst...@gmail.com] Sent: Tuesday, November 14, 2017 5:38 PM To: Milanov, Radoslav Nikiforov <rad...@bu.edu> Cc: Mark Nelson <mnel...@redhat.com>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore You have to configure the size of the db partition in the config file for the cluster. If you're db partition is 1GB, then I can all but guarantee that you're using your HDD for your blocks.db very quickly into your testing. There have been multiple threads recently about what size the db partition should be and it seems to be based on how many objects your OSD is likely to have on it. The recommendation has been to err on the side of bigger. If you're running 10TB OSDs and anticipate filling them up, then you probably want closer to an 80GB+ db partition. That's why I asked how full your cluster was and how large your HDDs are. Here's a link to one of the recent ML threads on this topic. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>> wrote: Block-db partition is the default 1GB (is there a way to modify this? journals are 5GB in filestore case) and usage is low: [root@kumo-ceph02 ~]# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 100602G 99146G1455G 1.45 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS kumo-vms 1 19757M 0.0231147G5067 kumo-volumes 2214G 0.1831147G 55248 kumo-images 3203G 0.1731147G 66486 kumo-vms3 11 45824M 0.0431147G 11643 kumo-volumes3 13 10837M 031147G2724 kumo-images3 15 82450M 0.0931147G 10320 - Rado From: David Turner [mailto:drakonst...@gmail.com<mailto:drakonst...@gmail.com>] Sent: Tuesday, November 14, 2017 4:40 PM To: Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>> Cc: Milanov, Radoslav Nikiforov <rad...@bu.edu<mailto:rad...@bu.edu>>; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> Subject: Re: [ceph-users] Bluestore performance 50% of filestore How big was your blocks.db partition for each OSD and what size are your HDDs? Also how full is your cluster? It's possible that your blocks.db partition wasn't large enough to hold the entire db and it had to spill over onto the HDD which would definitely impact performance. On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>> wrote: How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > Hi Mark, > Yes RBD is in write back, and the only thing that changed was converting OSDs > to bluestore. It is 7200 rpm drives and triple replication. I also get same > results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > Right now I'm going back to filestore for the OSDs so additional tests are > possible if that helps. > > - Rado > > -Original Message- > From: ceph-users > [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] > On Behalf Of Mark Nelson > Sent: Tuesday, November 14, 2017 4:04 PM > To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Hi Radoslav, > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you are > using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file > size, I'd expect that you might be getting better journal coalescing in > filestore. > > Sadly I imagine you can't do a comparison test at this point, but I'd be > curious how it would look if you used libaio with a high iodepth and a much > bigger partition to do random writes over. > > Mark > > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: >> Hi >> >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 >>
Re: [ceph-users] Bluestore performance 50% of filestore
You have to configure the size of the db partition in the config file for the cluster. If you're db partition is 1GB, then I can all but guarantee that you're using your HDD for your blocks.db very quickly into your testing. There have been multiple threads recently about what size the db partition should be and it seems to be based on how many objects your OSD is likely to have on it. The recommendation has been to err on the side of bigger. If you're running 10TB OSDs and anticipate filling them up, then you probably want closer to an 80GB+ db partition. That's why I asked how full your cluster was and how large your HDDs are. Here's a link to one of the recent ML threads on this topic. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-September/020822.html On Tue, Nov 14, 2017 at 4:44 PM Milanov, Radoslav Nikiforov <rad...@bu.edu> wrote: > Block-db partition is the default 1GB (is there a way to modify this? > journals are 5GB in filestore case) and usage is low: > > > > [root@kumo-ceph02 ~]# ceph df > > GLOBAL: > > SIZEAVAIL RAW USED %RAW USED > > 100602G 99146G1455G 1.45 > > POOLS: > > NAME ID USED %USED MAX AVAIL OBJECTS > > kumo-vms 1 19757M 0.0231147G5067 > > kumo-volumes 2214G 0.1831147G 55248 > > kumo-images 3203G 0.1731147G 66486 > > kumo-vms3 11 45824M 0.0431147G 11643 > > kumo-volumes3 13 10837M 031147G2724 > > kumo-images3 15 82450M 0.0931147G 10320 > > > > - Rado > > > > *From:* David Turner [mailto:drakonst...@gmail.com] > *Sent:* Tuesday, November 14, 2017 4:40 PM > *To:* Mark Nelson <mnel...@redhat.com> > *Cc:* Milanov, Radoslav Nikiforov <rad...@bu.edu>; > ceph-users@lists.ceph.com > > > *Subject:* Re: [ceph-users] Bluestore performance 50% of filestore > > > > How big was your blocks.db partition for each OSD and what size are your > HDDs? Also how full is your cluster? It's possible that your blocks.db > partition wasn't large enough to hold the entire db and it had to spill > over onto the HDD which would definitely impact performance. > > > > On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com> wrote: > > How big were the writes in the windows test and how much concurrency was > there? > > Historically bluestore does pretty well for us with small random writes > so your write results surprise me a bit. I suspect it's the low queue > depth. Sometimes bluestore does worse with reads, especially if > readahead isn't enabled on the client. > > Mark > > On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > > Hi Mark, > > Yes RBD is in write back, and the only thing that changed was converting > OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get > same results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > > > Right now I'm going back to filestore for the OSDs so additional tests > are possible if that helps. > > > > - Rado > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Mark Nelson > > Sent: Tuesday, November 14, 2017 4:04 PM > > To: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > > > Hi Radoslav, > > > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you > are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small > file size, I'd expect that you might be getting better journal coalescing > in filestore. > > > > Sadly I imagine you can't do a comparison test at this point, but I'd be > curious how it would look if you used libaio with a high iodepth and a much > bigger partition to do random writes over. > > > > Mark > > > > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: > >> Hi > >> > >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 > >> > >> In filestore configuration there are 3 SSDs used for journals of 9 > >> OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). > >> > >> I've converted filestore to bluestore by wiping 1 host a time and > >&
Re: [ceph-users] Bluestore performance 50% of filestore
Block-db partition is the default 1GB (is there a way to modify this? journals are 5GB in filestore case) and usage is low: [root@kumo-ceph02 ~]# ceph df GLOBAL: SIZEAVAIL RAW USED %RAW USED 100602G 99146G1455G 1.45 POOLS: NAME ID USED %USED MAX AVAIL OBJECTS kumo-vms 1 19757M 0.0231147G5067 kumo-volumes 2214G 0.1831147G 55248 kumo-images 3203G 0.1731147G 66486 kumo-vms3 11 45824M 0.0431147G 11643 kumo-volumes3 13 10837M 031147G2724 kumo-images3 15 82450M 0.0931147G 10320 - Rado From: David Turner [mailto:drakonst...@gmail.com] Sent: Tuesday, November 14, 2017 4:40 PM To: Mark Nelson <mnel...@redhat.com> Cc: Milanov, Radoslav Nikiforov <rad...@bu.edu>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore How big was your blocks.db partition for each OSD and what size are your HDDs? Also how full is your cluster? It's possible that your blocks.db partition wasn't large enough to hold the entire db and it had to spill over onto the HDD which would definitely impact performance. On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com<mailto:mnel...@redhat.com>> wrote: How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > Hi Mark, > Yes RBD is in write back, and the only thing that changed was converting OSDs > to bluestore. It is 7200 rpm drives and triple replication. I also get same > results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > Right now I'm going back to filestore for the OSDs so additional tests are > possible if that helps. > > - Rado > > -Original Message- > From: ceph-users > [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>] > On Behalf Of Mark Nelson > Sent: Tuesday, November 14, 2017 4:04 PM > To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Hi Radoslav, > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you are > using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file > size, I'd expect that you might be getting better journal coalescing in > filestore. > > Sadly I imagine you can't do a comparison test at this point, but I'd be > curious how it would look if you used libaio with a high iodepth and a much > bigger partition to do random writes over. > > Mark > > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: >> Hi >> >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 >> >> In filestore configuration there are 3 SSDs used for journals of 9 >> OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). >> >> I've converted filestore to bluestore by wiping 1 host a time and >> waiting for recovery. SSDs now contain block-db - again one SSD >> serving >> 3 OSDs. >> >> >> >> Cluster is used as storage for Openstack. >> >> Running fio on a VM in that Openstack reveals bluestore performance >> almost twice slower than filestore. >> >> fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G >> --numjobs=2 --time_based --runtime=180 --group_reporting >> >> fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G >> --numjobs=2 --time_based --runtime=180 --group_reporting >> >> >> >> >> >> Filestore >> >> write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec >> >> write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec >> >> write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec >> >> >> >> read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec >> >> read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec >> >> read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec >> >> >> >>
Re: [ceph-users] Bluestore performance 50% of filestore
16 MB block, single thread, sequential writes, this is [cid:image001.emz@01D35D67.61AF9D30] - Rado -Original Message- From: Mark Nelson [mailto:mnel...@redhat.com] Sent: Tuesday, November 14, 2017 4:36 PM To: Milanov, Radoslav Nikiforov <rad...@bu.edu>; ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > Hi Mark, > Yes RBD is in write back, and the only thing that changed was converting OSDs > to bluestore. It is 7200 rpm drives and triple replication. I also get same > results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > Right now I'm going back to filestore for the OSDs so additional tests are > possible if that helps. > > - Rado > > -Original Message- > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Mark Nelson > Sent: Tuesday, November 14, 2017 4:04 PM > To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > Hi Radoslav, > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you are > using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file > size, I'd expect that you might be getting better journal coalescing in > filestore. > > Sadly I imagine you can't do a comparison test at this point, but I'd be > curious how it would look if you used libaio with a high iodepth and a much > bigger partition to do random writes over. > > Mark > > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: >> Hi >> >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 >> >> In filestore configuration there are 3 SSDs used for journals of 9 >> OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). >> >> I've converted filestore to bluestore by wiping 1 host a time and >> waiting for recovery. SSDs now contain block-db - again one SSD >> serving >> 3 OSDs. >> >> >> >> Cluster is used as storage for Openstack. >> >> Running fio on a VM in that Openstack reveals bluestore performance >> almost twice slower than filestore. >> >> fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G >> --numjobs=2 --time_based --runtime=180 --group_reporting >> >> fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G >> --numjobs=2 --time_based --runtime=180 --group_reporting >> >> >> >> >> >> Filestore >> >> write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec >> >> write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec >> >> write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec >> >> >> >> read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec >> >> read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec >> >> read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec >> >> >> >> Bluestore >> >> write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec >> >> write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec >> >> write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec >> >> >> >> read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec >> >> read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec >> >> read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec >> >> >> >> >> >> - Rado >> >> >> >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > image001.emz Description: image001.emz ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore performance 50% of filestore
How big was your blocks.db partition for each OSD and what size are your HDDs? Also how full is your cluster? It's possible that your blocks.db partition wasn't large enough to hold the entire db and it had to spill over onto the HDD which would definitely impact performance. On Tue, Nov 14, 2017 at 4:36 PM Mark Nelson <mnel...@redhat.com> wrote: > How big were the writes in the windows test and how much concurrency was > there? > > Historically bluestore does pretty well for us with small random writes > so your write results surprise me a bit. I suspect it's the low queue > depth. Sometimes bluestore does worse with reads, especially if > readahead isn't enabled on the client. > > Mark > > On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: > > Hi Mark, > > Yes RBD is in write back, and the only thing that changed was converting > OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get > same results (bluestore 2 times slower) testing continuous writes on a 40GB > partition on a Windows VM, completely different tool. > > > > Right now I'm going back to filestore for the OSDs so additional tests > are possible if that helps. > > > > - Rado > > > > -Original Message- > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf > Of Mark Nelson > > Sent: Tuesday, November 14, 2017 4:04 PM > > To: ceph-users@lists.ceph.com > > Subject: Re: [ceph-users] Bluestore performance 50% of filestore > > > > Hi Radoslav, > > > > Is RBD cache enabled and in writeback mode? Do you have client side > readahead? > > > > Both are doing better for writes than you'd expect from the native > performance of the disks assuming they are typical 7200RPM drives and you > are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small > file size, I'd expect that you might be getting better journal coalescing > in filestore. > > > > Sadly I imagine you can't do a comparison test at this point, but I'd be > curious how it would look if you used libaio with a high iodepth and a much > bigger partition to do random writes over. > > > > Mark > > > > On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: > >> Hi > >> > >> We have 3 node, 27 OSDs cluster running Luminous 12.2.1 > >> > >> In filestore configuration there are 3 SSDs used for journals of 9 > >> OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). > >> > >> I've converted filestore to bluestore by wiping 1 host a time and > >> waiting for recovery. SSDs now contain block-db - again one SSD > >> serving > >> 3 OSDs. > >> > >> > >> > >> Cluster is used as storage for Openstack. > >> > >> Running fio on a VM in that Openstack reveals bluestore performance > >> almost twice slower than filestore. > >> > >> fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G > >> --numjobs=2 --time_based --runtime=180 --group_reporting > >> > >> fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G > >> --numjobs=2 --time_based --runtime=180 --group_reporting > >> > >> > >> > >> > >> > >> Filestore > >> > >> write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec > >> > >> write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec > >> > >> write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec > >> > >> > >> > >> read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec > >> > >> read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec > >> > >> read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec > >> > >> > >> > >> Bluestore > >> > >> write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec > >> > >> write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec > >> > >> write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec > >> > >> > >> > >> read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec > >> > >> read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec > >> > >> read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec > >> > >> > >> > >> > >> > >> - Rado > >> > >> > >> > >> > >> > >> ___ > >> ceph-users mailing list > >> ceph-users@lists.ceph.com > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore performance 50% of filestore
How big were the writes in the windows test and how much concurrency was there? Historically bluestore does pretty well for us with small random writes so your write results surprise me a bit. I suspect it's the low queue depth. Sometimes bluestore does worse with reads, especially if readahead isn't enabled on the client. Mark On 11/14/2017 03:14 PM, Milanov, Radoslav Nikiforov wrote: Hi Mark, Yes RBD is in write back, and the only thing that changed was converting OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get same results (bluestore 2 times slower) testing continuous writes on a 40GB partition on a Windows VM, completely different tool. Right now I'm going back to filestore for the OSDs so additional tests are possible if that helps. - Rado -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: Tuesday, November 14, 2017 4:04 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore Hi Radoslav, Is RBD cache enabled and in writeback mode? Do you have client side readahead? Both are doing better for writes than you'd expect from the native performance of the disks assuming they are typical 7200RPM drives and you are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file size, I'd expect that you might be getting better journal coalescing in filestore. Sadly I imagine you can't do a comparison test at this point, but I'd be curious how it would look if you used libaio with a high iodepth and a much bigger partition to do random writes over. Mark On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: Hi We have 3 node, 27 OSDs cluster running Luminous 12.2.1 In filestore configuration there are 3 SSDs used for journals of 9 OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). I've converted filestore to bluestore by wiping 1 host a time and waiting for recovery. SSDs now contain block-db - again one SSD serving 3 OSDs. Cluster is used as storage for Openstack. Running fio on a VM in that Openstack reveals bluestore performance almost twice slower than filestore. fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting Filestore write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec Bluestore write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec - Rado ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore performance 50% of filestore
Hi Mark, Yes RBD is in write back, and the only thing that changed was converting OSDs to bluestore. It is 7200 rpm drives and triple replication. I also get same results (bluestore 2 times slower) testing continuous writes on a 40GB partition on a Windows VM, completely different tool. Right now I'm going back to filestore for the OSDs so additional tests are possible if that helps. - Rado -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: Tuesday, November 14, 2017 4:04 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Bluestore performance 50% of filestore Hi Radoslav, Is RBD cache enabled and in writeback mode? Do you have client side readahead? Both are doing better for writes than you'd expect from the native performance of the disks assuming they are typical 7200RPM drives and you are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file size, I'd expect that you might be getting better journal coalescing in filestore. Sadly I imagine you can't do a comparison test at this point, but I'd be curious how it would look if you used libaio with a high iodepth and a much bigger partition to do random writes over. Mark On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: > Hi > > We have 3 node, 27 OSDs cluster running Luminous 12.2.1 > > In filestore configuration there are 3 SSDs used for journals of 9 > OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). > > I've converted filestore to bluestore by wiping 1 host a time and > waiting for recovery. SSDs now contain block-db - again one SSD > serving > 3 OSDs. > > > > Cluster is used as storage for Openstack. > > Running fio on a VM in that Openstack reveals bluestore performance > almost twice slower than filestore. > > fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G > --numjobs=2 --time_based --runtime=180 --group_reporting > > fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G > --numjobs=2 --time_based --runtime=180 --group_reporting > > > > > > Filestore > > write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec > > write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec > > write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec > > > > read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec > > read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec > > read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec > > > > Bluestore > > write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec > > write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec > > write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec > > > > read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec > > read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec > > read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec > > > > > > - Rado > > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Bluestore performance 50% of filestore
Hi Radoslav, Is RBD cache enabled and in writeback mode? Do you have client side readahead? Both are doing better for writes than you'd expect from the native performance of the disks assuming they are typical 7200RPM drives and you are using 3X replication (~150IOPS * 27 / 3 = ~1350 IOPS). Given the small file size, I'd expect that you might be getting better journal coalescing in filestore. Sadly I imagine you can't do a comparison test at this point, but I'd be curious how it would look if you used libaio with a high iodepth and a much bigger partition to do random writes over. Mark On 11/14/2017 01:54 PM, Milanov, Radoslav Nikiforov wrote: Hi We have 3 node, 27 OSDs cluster running Luminous 12.2.1 In filestore configuration there are 3 SSDs used for journals of 9 OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). I’ve converted filestore to bluestore by wiping 1 host a time and waiting for recovery. SSDs now contain block-db – again one SSD serving 3 OSDs. Cluster is used as storage for Openstack. Running fio on a VM in that Openstack reveals bluestore performance almost twice slower than filestore. fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting Filestore write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec Bluestore write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec - Rado ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Bluestore performance 50% of filestore
Hi We have 3 node, 27 OSDs cluster running Luminous 12.2.1 In filestore configuration there are 3 SSDs used for journals of 9 OSDs on each hosts (1 SSD has 3 journal paritions for 3 OSDs). I've converted filestore to bluestore by wiping 1 host a time and waiting for recovery. SSDs now contain block-db - again one SSD serving 3 OSDs. Cluster is used as storage for Openstack. Running fio on a VM in that Openstack reveals bluestore performance almost twice slower than filestore. fio --name fio_test_file --direct=1 --rw=randwrite --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting fio --name fio_test_file --direct=1 --rw=randread --bs=4k --size=1G --numjobs=2 --time_based --runtime=180 --group_reporting Filestore write: io=3511.9MB, bw=19978KB/s, iops=4994, runt=180001msec write: io=3525.6MB, bw=20057KB/s, iops=5014, runt=180001msec write: io=3554.1MB, bw=20222KB/s, iops=5055, runt=180016msec read : io=1995.7MB, bw=11353KB/s, iops=2838, runt=180001msec read : io=1824.5MB, bw=10379KB/s, iops=2594, runt=180001msec read : io=1966.5MB, bw=11187KB/s, iops=2796, runt=180001msec Bluestore write: io=1621.2MB, bw=9222.3KB/s, iops=2305, runt=180002msec write: io=1576.3MB, bw=8965.6KB/s, iops=2241, runt=180029msec write: io=1531.9MB, bw=8714.3KB/s, iops=2178, runt=180001msec read : io=1279.4MB, bw=7276.5KB/s, iops=1819, runt=180006msec read : io=773824KB, bw=4298.9KB/s, iops=1074, runt=180010msec read : io=1018.5MB, bw=5793.7KB/s, iops=1448, runt=180001msec - Rado ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com