[ceph-users] [SSD NVM FOR JOURNAL] Performance issues

2017-08-23 Thread Guilherme Steinmüller
Hello!

I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of my OSD
nodes.

First of all, here's is an schema describing how my cluster is:

[image: Imagem inline 1]

[image: Imagem inline 2]

I primarily use my ceph as a beckend for OpenStack nova, glance, swift and
cinder. My crushmap is configured to have rulesets for SAS disks, SATA
disks and another ruleset that resides in HPE nodes using SATA disks too.

Before installing the new journal in HPE nodes, i was using one of the
disks that today are OSDs (osd.35, osd.34 and osd.33). After upgrading the
journal, i noticed that a dd command writing 1gb blocks in openstack nova
instances doubled the throughput but the value expected was actually 400%
or 500% since in the Dell nodes that we have another nova pool the
throughput is around this value.

Here is a demonstration of the scenario and the difference in performance
between Dell nodes and HPE nodes:



Scenario:


   -Using pools to store instance disks for OpenStack


   - Pool nova in "ruleset SAS" placed on c4-osd201, c4-osd202 and
   c4-osd203 with 5 osds per hosts


   - Pool nova_hpedl180 in "ruleset NOVA_HPEDL180" placed on c4-osd204,
   c4-osd205, c4-osd206 with 3 osds per hosts


   - Every OSD has one partition of 35GB in a INTEL SSD 400GB 750
   SERIES PCIE 3.0 X4


   - Internal link for cluster and public network of 10Gbps


   - Deployment via ceph-ansible. Same configuration define in ansible
   for every host on cluster



*Instance on pool nova in ruleset SAS:*


   # dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
   1+0 records in
   1+0 records out
   1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.56255 s, 419 MB/s


*Instance on pool nova in ruleset NOVA_HPEDL180:*

 #  dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
 1+0 records in
 1+0 records out
 1073741824 bytes (1.1 GB, 1.0 GiB) copied, 11.8243 s, 90.8 MB/s


I made some FIO benchmarks as suggested by Sebastien (
https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
test-if-your-ssd-is-suitable-as-a-journal-device/ ) and the command with 1
job returned me about 180MB/s of throughput in recently installed nodes
(HPE nodes). I made some hdparm benchmark in all SSDs and everything seems
normal.


I can't see what is causing this difference of throughput since the network
is not a problem and i think that cpu and memory are not crucial since i
was monitoring the cluster with atop command and i didn't notice saturation
of resources. My only though is that I have less workload in nova_hpedl180
pool in HPE nodes and less disks per node and this ca influence in the
throughput of the journal.


Any clue about what is missing or what is happening?

Thanks in advance.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [SSD NVM FOR JOURNAL] Performance issues

2017-08-24 Thread Guilherme Steinmüller
Hello Christian.

First of all, thanks for your considerations, I really appreciate it.

2017-08-23 21:34 GMT-03:00 Christian Balzer :

>
> Hello,
>
> On Wed, 23 Aug 2017 09:11:18 -0300 Guilherme Steinmüller wrote:
>
> > Hello!
> >
> > I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of my
> OSD
> > nodes.
> >
> Well, you know what's coming now, don't you?
>
> That's a consumer device, with 70GB writes per day endurance.
> unless you're essentially having a read-only cluster, you're throwing away
> money.
>

Yes, we knew that we were going to buy a consumer device due to our limited
budget and our objective of constructing a small plan of a production
cloud. This model seemed acceptable. It was the top list of the consumer
models on Sebastien's benchmarks

We are a lab that depends on different budget sources to accquire
equipments, so they can vary and most of the time we are limited by
different budget ranges.

>
> > First of all, here's is an schema describing how my cluster is:
> >
> > [image: Imagem inline 1]
> >
> > [image: Imagem inline 2]
> >
> > I primarily use my ceph as a beckend for OpenStack nova, glance, swift
> and
> > cinder. My crushmap is configured to have rulesets for SAS disks, SATA
> > disks and another ruleset that resides in HPE nodes using SATA disks too.
> >
> > Before installing the new journal in HPE nodes, i was using one of the
> > disks that today are OSDs (osd.35, osd.34 and osd.33). After upgrading
> the
> > journal, i noticed that a dd command writing 1gb blocks in openstack nova
> > instances doubled the throughput but the value expected was actually 400%
> > or 500% since in the Dell nodes that we have another nova pool the
> > throughput is around this value.
> >
> Apples, oranges and bananas.
> You're comparing different HW (and no, I'm not going to look this up)
> which may or may not have vastly different capabilities (like HW cache),
> RAM and (unlikely relevant) CPU.
>


Indeed, we took this into account. The HP server were cheaper and have a
poor configuration due that limited budget source.


> Your NVMe may also be plugged into a different, insufficient PCIe slot for
> all we know.
>

I checked this. I compared the slots identifying the slot information
between the 3 dell nodes and 3 hp nodes by running:

# ls -l /sys/block/nvme0n1
# lspci -vvv -s :06:00.0 <- slot identifier

The only difference is:

Dell has a parameter called *Cache Line Size: 32 bytes* and HP doesn't have
this.



> You're also using very different HDDs, which definitely will be a factor.
>
>
I thought that the backend disks would not interfer that much. For example,
the ceph journal has a parameter called filestore max sync interval, which
means that ceph journal will commit the transactions to the backend OSDs in
a defined interval, ours is set to 35. So the client requests go first to
SSD and than is commited to the OSDs.


> But most importanly you're comparing 2 pools of vastly different ODS
> count, no wonder a pool with 15 OSDs is faster in sequential writes than
> one with 9.
>
> Here is a demonstration of the scenario and the difference in performance
> > between Dell nodes and HPE nodes:
> >
> >
> >
> > Scenario:
> >
> >
> >-Using pools to store instance disks for OpenStack
> >
> >
> >- Pool nova in "ruleset SAS" placed on c4-osd201, c4-osd202 and
> >c4-osd203 with 5 osds per hosts
> >
> SAS
> >
> >- Pool nova_hpedl180 in "ruleset NOVA_HPEDL180" placed on
> c4-osd204,
> >c4-osd205, c4-osd206 with 3 osds per hosts
> >
> SATA
> >
> >- Every OSD has one partition of 35GB in a INTEL SSD 400GB 750
> >SERIES PCIE 3.0 X4
> >
> Overkill, but since your NVMe will die shortly anyway...
>
> With large sequential tests, the journal will have nearly NO impact on the
> result, even if tuned to that effect.
>
> >
> >- Internal link for cluster and public network of 10Gbps
> >
> >
> >- Deployment via ceph-ansible. Same configuration define in
> ansible
> >for every host on cluster
> >
> >
> >
> > *Instance on pool nova in ruleset SAS:*
> >
> >
> ># dd if=/dev/zero of=/mnt/bench bs=1G count=1 oflag=direct
> >1+0 records in
> >1+0 records out
> >1073741824 bytes (1.1 GB, 1.0 GiB) copied, 2.56255 s, 419 MB/s
> >
> This is a very small test for what you're trying to determine and not
> going to be very representative.
> If f

Re: [ceph-users] [SSD NVM FOR JOURNAL] Performance issues

2017-08-25 Thread Guilherme Steinmüller
Hello Christian.

2017-08-24 22:43 GMT-03:00 Christian Balzer :

>
> Hello,
>
> On Thu, 24 Aug 2017 14:49:24 -0300 Guilherme Steinmüller wrote:
>
> > Hello Christian.
> >
> > First of all, thanks for your considerations, I really appreciate it.
> >
> > 2017-08-23 21:34 GMT-03:00 Christian Balzer :
> >
> > >
> > > Hello,
> > >
> > > On Wed, 23 Aug 2017 09:11:18 -0300 Guilherme Steinmüller wrote:
> > >
> > > > Hello!
> > > >
> > > > I recently installed INTEL SSD 400GB 750 SERIES PCIE 3.0 X4 in 3 of
> my
> > > OSD
> > > > nodes.
> > > >
> > > Well, you know what's coming now, don't you?
> > >
> > > That's a consumer device, with 70GB writes per day endurance.
> > > unless you're essentially having a read-only cluster, you're throwing
> away
> > > money.
> > >
> >
> > Yes, we knew that we were going to buy a consumer device due to our
> limited
> > budget and our objective of constructing a small plan of a production
> > cloud. This model seemed acceptable. It was the top list of the consumer
> > models on Sebastien's benchmarks
> >
> > We are a lab that depends on different budget sources to accquire
> > equipments, so they can vary and most of the time we are limited by
> > different budget ranges.
> >
> Noted, I hope your tests won't last too long or move a lot of data. ^o^
>
> > >
> > > > First of all, here's is an schema describing how my cluster is:
> > > >
> > > > [image: Imagem inline 1]
> > > >
> > > > [image: Imagem inline 2]
> > > >
> > > > I primarily use my ceph as a beckend for OpenStack nova, glance,
> swift
> > > and
> > > > cinder. My crushmap is configured to have rulesets for SAS disks,
> SATA
> > > > disks and another ruleset that resides in HPE nodes using SATA disks
> too.
> > > >
> > > > Before installing the new journal in HPE nodes, i was using one of
> the
> > > > disks that today are OSDs (osd.35, osd.34 and osd.33). After
> upgrading
> > > the
> > > > journal, i noticed that a dd command writing 1gb blocks in openstack
> nova
> > > > instances doubled the throughput but the value expected was actually
> 400%
> > > > or 500% since in the Dell nodes that we have another nova pool the
> > > > throughput is around this value.
> > > >
> > > Apples, oranges and bananas.
> > > You're comparing different HW (and no, I'm not going to look this up)
> > > which may or may not have vastly different capabilities (like HW
> cache),
> > > RAM and (unlikely relevant) CPU.
> > >
> >
> >
> > Indeed, we took this into account. The HP server were cheaper and have a
> > poor configuration due that limited budget source.
> >
> >
> > > Your NVMe may also be plugged into a different, insufficient PCIe slot
> for
> > > all we know.
> > >
> >
> > I checked this. I compared the slots identifying the slot information
> > between the 3 dell nodes and 3 hp nodes by running:
> >
> > # ls -l /sys/block/nvme0n1
> > # lspci -vvv -s :06:00.0 <- slot identifier
> >
> > The only difference is:
> >
> > Dell has a parameter called *Cache Line Size: 32 bytes* and HP doesn't
> have
> > this.
> >
> That shouldn't be relevant, AFAIK.
>
> >
> >
> > > You're also using very different HDDs, which definitely will be a
> factor.
> > >
> > >
> > I thought that the backend disks would not interfer that much. For
> example,
> > the ceph journal has a parameter called filestore max sync interval,
> which
> > means that ceph journal will commit the transactions to the backend OSDs
> in
> > a defined interval, ours is set to 35. So the client requests go first to
> > SSD and than is commited to the OSDs.
> >
> As I wrote before, the journal comes not into play for any large amounts
> of data unless massively tuned and/or under extreme pressure.
>
> You need to touch much more of the journal and filestore parameters than
> max_sync, which will do nothing to prevent from min_sync and other values
> to start flushing more or less immediately.
>
> And tuning things so the journal is used extensively by default will
> result in I/O storms slowing things to a crawl when it finally flushes to
> the HDDs.

Re: [ceph-users] ceph cluster monitoring tool

2018-07-24 Thread Guilherme Steinmüller
Satish,

I'm currently working on monasca's roles for openstack-ansible.

We have plugins that monitors ceph as well and I use in production. Bellow
you can see an example:

https://imgur.com/a/6l6Q2K6



Em ter, 24 de jul de 2018 às 02:02, Satish Patel 
escreveu:

> My 5 node ceph cluster is ready for production, now i am looking for
> good monitoring tool (Open source), what majority of folks using in
> their production?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com