subject:"SSD journal suggestion"


On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:

2012/11/7 Sage Weil :

On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:

I'm evaluating some SSD drives as journal.
Samsung 840 Pro seems to be the fastest in sequential reads and write.


The 840 Pro seems to reach 485MB/s in sequential write:
http://www.storagereview.com/samsung_ssd_840_pro_review
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm using Intel 510s in a test node and can do about 450MB/s per drive. 
 Right now I'm doing 3 journals per SSD, but topping out at about 
1.2-1.4GB/s from the client perspective for the node with 15+ drives and 
5 SSDs.  It's possible newer versions of the code and tuning may 
increase that.


TV pointed me at the new Intel DC S3700 which looks like a very 
interesting option (the 100GB model for $240).


http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On 11/07/2012 10:12 AM, Atchley, Scott wrote:

On Nov 7, 2012, at 10:01 AM, Mark Nelson wrote:

On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:

2012/11/7 Sage Weil :

On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:

I'm evaluating some SSD drives as journal.
Samsung 840 Pro seems to be the fastest in sequential reads and write.

The 840 Pro seems to reach 485MB/s in sequential write:
http://www.storagereview.com/samsung_ssd_840_pro_review
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

I'm using Intel 510s in a test node and can do about 450MB/s per drive.

Is that sequential read or write? Intel lists them at 210-315 MB/s for
sequential write. The 520s are rated at 475-520 MB/s seq. write.

Doh, wrote that too early in the morning after staying all night
watching the elections. :) You are correct, it's the 520, not the 510.

Right now I'm doing 3 journals per SSD, but topping out at about
1.2-1.4GB/s from the client perspective for the node with 15+ drives and
5 SSDs. It's possible newer versions of the code and tuning may
increase that.

What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G
Ethernet?

Scott

This is 8 concurrent instances of rados bench running on localhost.
Ceph is configured with 1x replication. 1.2-1.4GB/s is the aggregate
throughput of all of the rados bench instances.

TV pointed me at the new Intel DC S3700 which looks like a very
interesting option (the 100GB model for $240).

http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-07 Thread Atchley, Scott

On Nov 7, 2012, at 10:01 AM, Mark Nelson  wrote:

> On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:
>> 2012/11/7 Sage Weil :
>>> On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:
 I'm evaluating some SSD drives as journal.
 Samsung 840 Pro seems to be the fastest in sequential reads and write.
>> 
>> The 840 Pro seems to reach 485MB/s in sequential write:
>> http://www.storagereview.com/samsung_ssd_840_pro_review
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> I'm using Intel 510s in a test node and can do about 450MB/s per drive. 

Is that sequential read or write? Intel lists them at 210-315 MB/s for 
sequential write. The 520s are rated at 475-520 MB/s seq. write.

>  Right now I'm doing 3 journals per SSD, but topping out at about 
> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and 
> 5 SSDs.  It's possible newer versions of the code and tuning may 
> increase that.

What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
Ethernet?

Scott

> TV pointed me at the new Intel DC S3700 which looks like a very 
> interesting option (the 100GB model for $240).
> 
> http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed
> 
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-07 Thread Atchley, Scott

On Nov 7, 2012, at 11:20 AM, Mark Nelson  wrote:

>>>  Right now I'm doing 3 journals per SSD, but topping out at about
>>> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and
>>> 5 SSDs.  It's possible newer versions of the code and tuning may
>>> increase that.
>> 
>> What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
>> expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
>> Ethernet?
> 
> This is 8 concurrent instances of rados bench running on localhost. 
> Ceph is configured with 1x replication.  1.2-1.4GB/s is the aggregate 
> throughput of all of the rados bench instances.

Ok, all local with no communication. Given this level of local performance, 
what does that translate into when talking over the network?

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


On 11/07/2012 10:35 AM, Atchley, Scott wrote:

On Nov 7, 2012, at 11:20 AM, Mark Nelson  wrote:


  Right now I'm doing 3 journals per SSD, but topping out at about
1.2-1.4GB/s from the client perspective for the node with 15+ drives and
5 SSDs.  It's possible newer versions of the code and tuning may
increase that.


What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
Ethernet?


This is 8 concurrent instances of rados bench running on localhost.
Ceph is configured with 1x replication.  1.2-1.4GB/s is the aggregate
throughput of all of the rados bench instances.


Ok, all local with no communication. Given this level of local performance, 
what does that translate into when talking over the network?

Scott



Well, local, but still over tcp.  Right now I'm focusing on pushing the 
osds/filestores as far as I can, and after that I'm going to setup a 
bonded 10GbE network to see what kind of messenger bottlenecks I run 
into.  Sadly the testing is going slower than I would like.


Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


Hi,

I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the 
node has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as 
test I use rados bench on each client. The aggregated write speed is 
around 1,6GB/s with single replication.


In the first configuration, I had the SSDs on the raidcontroller as 
well, but then I saturated the PCIe 2.0 x8 interface of the 
raidcontroller, therefore I use a second controller for the SSDs.



-martin


Am 07.11.2012 17:41, schrieb Mark Nelson:

Well, local, but still over tcp.  Right now I'm focusing on pushing the
osds/filestores as far as I can, and after that I'm going to setup a
bonded 10GbE network to see what kind of messenger bottlenecks I run
into.  Sadly the testing is going slower than I would like.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a 
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox 
ConnectX-3. My favourite is the Intel X520-DA2.


-martin

Am 07.11.2012 22:14, schrieb Gandalf Corvotempesta:

2012/11/7 Martin Mailand :

I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node
has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I
use rados bench on each client. The aggregated write speed is around 1,6GB/s
with single replication.


Just for curiosity, which switches do you have?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-07 Thread Stefan Priebe


Am 07.11.2012 22:35, schrieb Martin Mailand:

Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox
ConnectX-3. My favourite is the Intel X520-DA2.


That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks. 
HP told me the deep packet buffers of the HP5920 will burst the 
performance and should be used for storage related stuff.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


Hi Stefan,

deep buffers means latency spikes, you should go for fast switching 
latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns.

And I you should think at the price the HP5900 cost 3 times of the Mellanox.

-martin

Am 07.11.2012 22:44, schrieb Stefan Priebe:

Am 07.11.2012 22:35, schrieb Martin Mailand:

Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox
ConnectX-3. My favourite is the Intel X520-DA2.


That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks.
HP told me the deep packet buffers of the HP5920 will burst the
performance and should be used for storage related stuff.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-07 Thread Stefan Priebe


Am 07.11.2012 22:55, schrieb Martin Mailand:

Hi Stefan,

deep buffers means latency spikes, you should go for fast switching
latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns.


HP told me they all use the same ships and Arista measures latency while 
only one port is in use. HP guarentees the latency when all ports are in 
use. If this is correct or just somehing hp told me - i don't know. They 
told me the arista is slower and the statistics are not comporable...


> And I you should think at the price the HP5900 cost 3 times of the
> Mellanox.
Don't know what the Mellanox coests. I get the HP for a really good 
price below 10.000 €.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


Hi,

I *think* the HP is Broadcom based, the Arista is Fulcrum based, and I 
don't know which chips Mellanox is using.


Our NOC tested both of them, an the Arista was the clear winner, at 
least in our workload.


-martin

Am 07.11.2012 22:59, schrieb Stefan Priebe:

HP told me they all use the same ships and Arista measures latency while
only one port is in use. HP guarentees the latency when all ports are in
use. If this is correct or just somehing hp told me - i don't know. They
told me the arista is slower and the statistics are not comporable...

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


good question, probably we do not have enough experience with IPoIB.
But it looks good on paper, so it's definitely a try worth.

-martin

Am 07.11.2012 23:28, schrieb Gandalf Corvotempesta:

2012/11/7 Martin Mailand :

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.


Why not infiniband?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion


On 11/07/2012 04:51 PM, Gandalf Corvotempesta wrote:

2012/11/7 Martin Mailand :

But it looks good on paper, so it's definitely a try worth.


is at least 4x times faster than 10gbe and AFAIK should have a lower latency.
I'm planning to use infiniband as backend storage network, used for
OSD replication. 2 HBA for each OSD should give me 80Gbps and full
redundancy



I haven't done much with IPoIB (just RDMA), but my understanding is that 
it tends to top out at like 15Gb/s.  Some others on this mailing list 
can probably speak more authoritatively.  Even with RDMA you are going 
to top out at around 3.1-3.2GB/s.


This thread may be helpful/interesting:
http://comments.gmane.org/gmane.linux.drivers.rdma/12279

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
 wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
> 
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
handlers to cores 0 and 1 and we will not using process binding. For single 
stream Netperf, we do use process binding and bind it to the same core (i.e. 0) 
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
process binding but we still see ~22 Gb/s.

We used all of the Mellanox tuning recommendations for IPoIB available in their 
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected 
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. 
Mellanox claims that we should get identical performance with both modes and we 
are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as 
well.

Scott--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-08 Thread Mark Nelson

On 11/08/2012 07:55 AM, Atchley, Scott wrote:

On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta
wrote:

2012/11/8 Mark Nelson :

I haven't done much with IPoIB (just RDMA), but my understanding is that it
tends to top out at like 15Gb/s. Some others on this mailing list can
probably speak more authoritatively. Even with RDMA you are going to top
out at around 3.1-3.2GB/s.

15Gb/s is still faster than 10Gbe
But this speed limit seems to be kernel-related and should be the same
even in a 10Gbe environment, or not?

We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs
(the native IB API), I see ~27 Gb/s between two hosts. When running Sockets
over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use
interrupt affinity and process binding.

For our Ceph testing, we will set the affinity of two of the mlx4 interrupt
handlers to cores 0 and 1 and we will not using process binding. For single
stream Netperf, we do use process binding and bind it to the same core (i.e. 0)
and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use
process binding but we still see ~22 Gb/s.

Scott, this is very interesting! Does setting the interrupt affinity
make the biggest difference then when you have concurrent netperf
processes going? For some reason I thought that setting interrupt
affinity wasn't even guaranteed in linux any more, but this is just some
half-remembered recollection from a year or two ago.

We used all of the Mellanox tuning recommendations for IPoIB available in their
tuning pdf:

http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

We looked at their interrupt affinity setting scripts and then wrote our own.

Our testing is with IPoIB in "connected" mode, not "datagram" mode. Connected
mode is less scalable, but currently I only get ~3 Gb/s with datagram mode. Mellanox claims that we
should get identical performance with both modes and we are looking into it.

We are getting a new test cluster with FDR HCAs and I will look into those as
well.

Nice! At some point I'll probably try to justify getting some FDR cards
in house. I'd definitely like to hear how FDR ends up working for you.

Scott

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:

> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>  wrote:
>> 
>>> 2012/11/8 Mark Nelson :
 I haven't done much with IPoIB (just RDMA), but my understanding is that it
 tends to top out at like 15Gb/s.  Some others on this mailing list can
 probably speak more authoritatively.  Even with RDMA you are going to top
 out at around 3.1-3.2GB/s.
>>> 
>>> 15Gb/s is still faster than 10Gbe
>>> But this speed limit seems to be kernel-related and should be the same
>>> even in a 10Gbe environment, or not?
>> 
>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>> interrupt affinity and process binding.
>> 
>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>> handlers to cores 0 and 1 and we will not using process binding. For single 
>> stream Netperf, we do use process binding and bind it to the same core (i.e. 
>> 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do not use 
>> process binding but we still see ~22 Gb/s.
> 
> Scott, this is very interesting!  Does setting the interrupt affinity 
> make the biggest difference then when you have concurrent netperf 
> processes going?  For some reason I thought that setting interrupt 
> affinity wasn't even guaranteed in linux any more, but this is just some 
> half-remembered recollection from a year or two ago.

We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with and 
without affinity:

Default (irqbalance running)   12.8 Gb/s
IRQ balance off13.0 Gb/s
Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script

When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get ~22 
Gb/s for a single stream.

>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>> their tuning pdf:
>> 
>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>> 
>> We looked at their interrupt affinity setting scripts and then wrote our own.
>> 
>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>> datagram mode. Mellanox claims that we should get identical performance with 
>> both modes and we are looking into it.
>> 
>> We are getting a new test cluster with FDR HCAs and I will look into those 
>> as well.
> 
> Nice!  At some point I'll probably try to justify getting some FDR cards 
> in house.  I'd definitely like to hear how FDR ends up working for you.

I'll post the numbers when I get access after they are set up.

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:

> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
> 
>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
>>> On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
>>>  wrote:
>>> 
 2012/11/8 Mark Nelson :
> I haven't done much with IPoIB (just RDMA), but my understanding is that 
> it
> tends to top out at like 15Gb/s.  Some others on this mailing list can
> probably speak more authoritatively.  Even with RDMA you are going to top
> out at around 3.1-3.2GB/s.
 
 15Gb/s is still faster than 10Gbe
 But this speed limit seems to be kernel-related and should be the same
 even in a 10Gbe environment, or not?
>>> 
>>> We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using Verbs 
>>> (the native IB API), I see ~27 Gb/s between two hosts. When running Sockets 
>>> over these devices using IPoIB, I see 13-22 Gb/s depending on whether I use 
>>> interrupt affinity and process binding.
>>> 
>>> For our Ceph testing, we will set the affinity of two of the mlx4 interrupt 
>>> handlers to cores 0 and 1 and we will not using process binding. For single 
>>> stream Netperf, we do use process binding and bind it to the same core 
>>> (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf runs, we do 
>>> not use process binding but we still see ~22 Gb/s.
>> 
>> Scott, this is very interesting!  Does setting the interrupt affinity 
>> make the biggest difference then when you have concurrent netperf 
>> processes going?  For some reason I thought that setting interrupt 
>> affinity wasn't even guaranteed in linux any more, but this is just some 
>> half-remembered recollection from a year or two ago.
> 
> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
> and without affinity:
> 
> Default (irqbalance running)   12.8 Gb/s
> IRQ balance off13.0 Gb/s
> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
> 
> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
> ~22 Gb/s for a single stream.

Note, I used hwloc to determine which socket was closer to the mlx4 device on 
our dual socket machines. On these nodes, hwloc reported that both sockets were 
equally close, but a colleague has machines where one socket is closer than the 
other. In that case, bind to the closer socket (or to cores within the closer 
socket).

> 
>>> We used all of the Mellanox tuning recommendations for IPoIB available in 
>>> their tuning pdf:
>>> 
>>> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>>> 
>>> We looked at their interrupt affinity setting scripts and then wrote our 
>>> own.
>>> 
>>> Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
>>> Connected mode is less scalable, but currently I only get ~3 Gb/s with 
>>> datagram mode. Mellanox claims that we should get identical performance 
>>> with both modes and we are looking into it.
>>> 
>>> We are getting a new test cluster with FDR HCAs and I will look into those 
>>> as well.
>> 
>> Nice!  At some point I'll probably try to justify getting some FDR cards 
>> in house.  I'd definitely like to hear how FDR ends up working for you.
> 
> I'll post the numbers when I get access after they are set up.
> 
> Scott
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion

2012-11-08 Thread Andrey Korolyov

On Thu, Nov 8, 2012 at 7:02 PM, Atchley, Scott  wrote:
> On Nov 8, 2012, at 10:00 AM, Scott Atchley  wrote:
>
>> On Nov 8, 2012, at 9:39 AM, Mark Nelson  wrote:
>>
>>> On 11/08/2012 07:55 AM, Atchley, Scott wrote:
 On Nov 8, 2012, at 3:22 AM, Gandalf Corvotempesta 
  wrote:

> 2012/11/8 Mark Nelson :
>> I haven't done much with IPoIB (just RDMA), but my understanding is that 
>> it
>> tends to top out at like 15Gb/s.  Some others on this mailing list can
>> probably speak more authoritatively.  Even with RDMA you are going to top
>> out at around 3.1-3.2GB/s.
>
> 15Gb/s is still faster than 10Gbe
> But this speed limit seems to be kernel-related and should be the same
> even in a 10Gbe environment, or not?

 We have a test cluster with Mellanox QDR HCAs (i.e. NICs). When using 
 Verbs (the native IB API), I see ~27 Gb/s between two hosts. When running 
 Sockets over these devices using IPoIB, I see 13-22 Gb/s depending on 
 whether I use interrupt affinity and process binding.

 For our Ceph testing, we will set the affinity of two of the mlx4 
 interrupt handlers to cores 0 and 1 and we will not using process binding. 
 For single stream Netperf, we do use process binding and bind it to the 
 same core (i.e. 0) and we see ~22 Gb/s. For multiple, concurrent Netperf 
 runs, we do not use process binding but we still see ~22 Gb/s.
>>>
>>> Scott, this is very interesting!  Does setting the interrupt affinity
>>> make the biggest difference then when you have concurrent netperf
>>> processes going?  For some reason I thought that setting interrupt
>>> affinity wasn't even guaranteed in linux any more, but this is just some
>>> half-remembered recollection from a year or two ago.
>>
>> We are using RHEL6 with a 3.5.1 kernel. I tested single stream Netperf with 
>> and without affinity:
>>
>> Default (irqbalance running)   12.8 Gb/s
>> IRQ balance off13.0 Gb/s
>> Set IRQ affinity to socket 0   17.3 Gb/s   # using the Mellanox script
>>
>> When I set the affinity to cores 0-1 _and_ I bind Netperf to core 0, I get 
>> ~22 Gb/s for a single stream.
>

Did you tried Mellanox-baked modules for 2.6.32 before that?

> Note, I used hwloc to determine which socket was closer to the mlx4 device on 
> our dual socket machines. On these nodes, hwloc reported that both sockets 
> were equally close, but a colleague has machines where one socket is closer 
> than the other. In that case, bind to the closer socket (or to cores within 
> the closer socket).
>
>>
 We used all of the Mellanox tuning recommendations for IPoIB available in 
 their tuning pdf:

 http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf

 We looked at their interrupt affinity setting scripts and then wrote our 
 own.

 Our testing is with IPoIB in "connected" mode, not "datagram" mode. 
 Connected mode is less scalable, but currently I only get ~3 Gb/s with 
 datagram mode. Mellanox claims that we should get identical performance 
 with both modes and we are looking into it.

 We are getting a new test cluster with FDR HCAs and I will look into those 
 as well.
>>>
>>> Nice!  At some point I'll probably try to justify getting some FDR cards
>>> in house.  I'd definitely like to hear how FDR ends up working for you.
>>
>> I'll post the numbers when I get access after they are set up.
>>
>> Scott
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD journal suggestion