Re: What would a good OSD node hardware configuration look like?

2012-11-07 Thread Wido den Hollander



On 07-11-12 09:17, Gandalf Corvotempesta wrote:

2012/11/7 Wido den Hollander :

Except that SSDs will mainly fail due to the amount of write cycles they had
to endure.

So in RAID-1 your SSDs will fail at almost the same time.

With for example 8 OSDs in a server you better spread them out 50/50 over
two SSDs.


But the OS should be installed on SSD, to allow the use of the whole
SATA/SAS disk as OSDs.
Without RAID-1 between SSDs, a disk failure will result it the whole node down.
--


True. You might want to use a small internal USB stick like those from 
Transcend for installing your OS. They are available in 4GB and 8GB.


Let syslog handle your logging to an external device and you have almost 
no write I/O on that thing.


This way you can use the SSDs completely for journaling and shrink them 
using HPA with hdparm.


Still, I think that in the distributed nature it shouldn't matter if you 
loose a node. If loosing a node is a to big impact to your cluster, go 
with smaller nodes.


Wido


To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


syncfs slower than without syncfs

2012-11-07 Thread Stefan Priebe - Profihost AG

Hello list,

syncfs is much slower than without syncfs to me.

If i compile latest ceph master with wip-rbd-read:

with syncfs:
rand 4K:
  write: io=1133MB, bw=12853KB/s, iops=3213, runt= 90277msec
  read : io=1239MB, bw=14046KB/s, iops=3511, runt= 90325msec
seq 4M:
  write: io=37560MB, bw=423874KB/s, iops=103, runt= 90738msec
  read : io=103852MB, bw=1151MB/s, iops=287, runt= 90237msec

without syncfs:
rand 4K:
  write: io=3733MB, bw=42459KB/s, iops=10614, runt= 90039msec
  read : io=6018MB, bw=68446KB/s, iops=17111, runt= 90038msec
seq 4M:
  write: io=51204MB, bw=577328KB/s, iops=140, runt= 90820msec
  read : io=150320MB, bw=1666MB/s, iops=416, runt= 90228msec

I thought syncfs should boost the performance.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Sage Weil
On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:
> I'm evaluating some SSD drives as journal.
> Samsung 840 Pro seems to be the fastest in sequential reads and write.
> 
> What parameter should I consider for a journal? I think that none of
> read benchmark are influent because when dumping journal to disk, the
> bottleneck will always be the SAS/SATA write speed. (in this case, the
> SSD will never reach it's read best performance)
> So, should I evaluate only the write speed when datas are wrote to the
> journal ? Sequential or random ?

Sequential write is the only thing that matters for the osd journal.  I'd 
look at both large writes and small writes.

sage

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syncfs slower than without syncfs

2012-11-07 Thread Stefan Priebe - Profihost AG

Am 07.11.2012 10:41, schrieb Stefan Priebe - Profihost AG:

Hello list,

syncfs is much slower than without syncfs to me.

If i compile latest ceph master with wip-rbd-read:

with syncfs:
rand 4K:
   write: io=1133MB, bw=12853KB/s, iops=3213, runt= 90277msec
   read : io=1239MB, bw=14046KB/s, iops=3511, runt= 90325msec
seq 4M:
   write: io=37560MB, bw=423874KB/s, iops=103, runt= 90738msec
   read : io=103852MB, bw=1151MB/s, iops=287, runt= 90237msec

without syncfs:
rand 4K:
   write: io=3733MB, bw=42459KB/s, iops=10614, runt= 90039msec
   read : io=6018MB, bw=68446KB/s, iops=17111, runt= 90038msec
seq 4M:
   write: io=51204MB, bw=577328KB/s, iops=140, runt= 90820msec
   read : io=150320MB, bw=1666MB/s, iops=416, runt= 90228msec

I thought syncfs should boost the performance.


Fixed - my chroot env with glibc syncfs support didn't had the libcrypto 
installed and it isn't in the control dependencies. Now with libcrypto i 
get the same results as without syncfs.


Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Unexpected behavior by ceph 0.48.2argonaut.

2012-11-07 Thread hemant surale
I am not sure about my judgments but please help me out understanding
the result of following experiment carried out : -


Experiment  : (3 node cluster,  all have ceph v.0.48.2agonaut( after
building ceph from src code) + UBUNTU 12.04 + kernel 3.2.0 )
--
VM1 ( mon.0+ osd.0 +mds.0 )  VM2 (mon.1 + osd.1 + mds.1)  VM3(mon.2)
 - Cluster is up and HEALTH_OK
 - Replication factor is 2. (by default all pools have
replication factor set to 2)
 - After mounting "mount.ceph mon_addr:port :/ ~/cephfs "
, I created file inside mounted Dir "cephfs" .
 - And able to see data on both OSD i.e. VM1(osd.0) and on
VM2(osd.1)  as well as file is accessible .
 - Then VM2 is made down & VM2 absence is verified with ceph -s  .
 - Even after VM1(osd.0 mds.0 mon.0) + VM3 (mon.2)  was
live , I am unable to access the file .
 - I tried to remount the data on different Dir with
mount.ceph currently_live_mons:/ /home/hemant/xyz
 - Even after that I was unable to access the file stored
on cluster.
--

-
Hemant Surale.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


unexpected problem with radosgw fcgi

2012-11-07 Thread Sławomir Skowron
I have realize that requests from fastcgi in nginx from radosgw returning:

HTTP/1.1 200, not a HTTP/1.1 200 OK

Any other cgi that i run, for example php via fastcgi return this like
RFC says, with OK.

Is someone experience this problem ??

I see in code:

./src/rgw/rgw_rest.cc line 36

const static struct rgw_html_errors RGW_HTML_ERRORS[] = {
{ 0, 200, "" },


What if i change this into:

{ 0, 200, "OK" },

--
-
Regards

Sławek Skowron
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Mark Nelson

On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:

2012/11/7 Sage Weil :

On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:

I'm evaluating some SSD drives as journal.
Samsung 840 Pro seems to be the fastest in sequential reads and write.


The 840 Pro seems to reach 485MB/s in sequential write:
http://www.storagereview.com/samsung_ssd_840_pro_review
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm using Intel 510s in a test node and can do about 450MB/s per drive. 
 Right now I'm doing 3 journals per SSD, but topping out at about 
1.2-1.4GB/s from the client perspective for the node with 15+ drives and 
5 SSDs.  It's possible newer versions of the code and tuning may 
increase that.


TV pointed me at the new Intel DC S3700 which looks like a very 
interesting option (the 100GB model for $240).


http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syncfs slower than without syncfs

2012-11-07 Thread Mark Nelson
Whew, glad you found the problem Stefan!  I was starting to wonder what 
was going on. :)  Do you mind filling a bug about the control dependencies?


Mark

On 11/07/2012 07:31 AM, Stefan Priebe - Profihost AG wrote:

Am 07.11.2012 10:41, schrieb Stefan Priebe - Profihost AG:

Hello list,

syncfs is much slower than without syncfs to me.

If i compile latest ceph master with wip-rbd-read:

with syncfs:
rand 4K:
   write: io=1133MB, bw=12853KB/s, iops=3213, runt= 90277msec
   read : io=1239MB, bw=14046KB/s, iops=3511, runt= 90325msec
seq 4M:
   write: io=37560MB, bw=423874KB/s, iops=103, runt= 90738msec
   read : io=103852MB, bw=1151MB/s, iops=287, runt= 90237msec

without syncfs:
rand 4K:
   write: io=3733MB, bw=42459KB/s, iops=10614, runt= 90039msec
   read : io=6018MB, bw=68446KB/s, iops=17111, runt= 90038msec
seq 4M:
   write: io=51204MB, bw=577328KB/s, iops=140, runt= 90820msec
   read : io=150320MB, bw=1666MB/s, iops=416, runt= 90228msec

I thought syncfs should boost the performance.


Fixed - my chroot env with glibc syncfs support didn't had the libcrypto
installed and it isn't in the control dependencies. Now with libcrypto i
get the same results as without syncfs.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mons network

2012-11-07 Thread Gregory Farnum
On Wed, Nov 7, 2012 at 1:31 PM, Gandalf Corvotempesta
 wrote:
> Which kind of network should I plain for 3 or 4 MONs nodes?
> Are these node called by the ceph client (RBD, RGW, and so on) or only by OSD?
>
> Can I use some virtual machines distributed across multiple Xen nodes
> on a 100mbit/s netoworks reachable by the whole storage cluster
> (client, server, ods, rbd, rgw) on should I put a gigabit network
> reachable only by OSD (for performance and security purposes)

The mons need to be reachable by everybody. They don't do a ton of
network traffic, but 100Mb/s might be pushing it a bit low...
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Mons network

2012-11-07 Thread Gregory Farnum
On Wed, Nov 7, 2012 at 4:20 PM, Gandalf Corvotempesta
 wrote:
> 2012/11/7 Gregory Farnum :
>> The mons need to be reachable by everybody. They don't do a ton of
>> network traffic, but 100Mb/s might be pushing it a bit low...
>
> Some portion of my network are 10/100 with gigabit uplink.
> In this portion we have many XenServer nodes that we would like to use
> for mons, but maximum speed of each is 100mbit
> Moving mons to the SAN will result in new hardware to buy.

Well, you can experiment with it and see what happens.
Or just colocate the monitors with some of your OSDs.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


trying to import crushmap results in max_devices > osdmap max_osd

2012-11-07 Thread Stefan Priebe - Profihost AG

Hello,

i've added two nodes with 4 devices each and modified the crushmap.

But importing the new map results in:
crushmap max_devices 55 > osdmap max_osd 35

What's wrong?

Greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Mark Nelson

On 11/07/2012 10:12 AM, Atchley, Scott wrote:

On Nov 7, 2012, at 10:01 AM, Mark Nelson  wrote:


On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:

2012/11/7 Sage Weil :

On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:

I'm evaluating some SSD drives as journal.
Samsung 840 Pro seems to be the fastest in sequential reads and write.


The 840 Pro seems to reach 485MB/s in sequential write:
http://www.storagereview.com/samsung_ssd_840_pro_review
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



I'm using Intel 510s in a test node and can do about 450MB/s per drive.


Is that sequential read or write? Intel lists them at 210-315 MB/s for 
sequential write. The 520s are rated at 475-520 MB/s seq. write.


Doh, wrote that too early in the morning after staying all night 
watching the elections. :)  You are correct, it's the 520, not the 510.





  Right now I'm doing 3 journals per SSD, but topping out at about
1.2-1.4GB/s from the client perspective for the node with 15+ drives and
5 SSDs.  It's possible newer versions of the code and tuning may
increase that.


What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
Ethernet?

Scott



This is 8 concurrent instances of rados bench running on localhost. 
Ceph is configured with 1x replication.  1.2-1.4GB/s is the aggregate 
throughput of all of the rados bench instances.



TV pointed me at the new Intel DC S3700 which looks like a very
interesting option (the 100GB model for $240).

http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Atchley, Scott
On Nov 7, 2012, at 10:01 AM, Mark Nelson  wrote:

> On 11/07/2012 06:28 AM, Gandalf Corvotempesta wrote:
>> 2012/11/7 Sage Weil :
>>> On Wed, 7 Nov 2012, Gandalf Corvotempesta wrote:
 I'm evaluating some SSD drives as journal.
 Samsung 840 Pro seems to be the fastest in sequential reads and write.
>> 
>> The 840 Pro seems to reach 485MB/s in sequential write:
>> http://www.storagereview.com/samsung_ssd_840_pro_review
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> 
> I'm using Intel 510s in a test node and can do about 450MB/s per drive. 

Is that sequential read or write? Intel lists them at 210-315 MB/s for 
sequential write. The 520s are rated at 475-520 MB/s seq. write.

>  Right now I'm doing 3 journals per SSD, but topping out at about 
> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and 
> 5 SSDs.  It's possible newer versions of the code and tuning may 
> increase that.

What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
Ethernet?

Scott

> TV pointed me at the new Intel DC S3700 which looks like a very 
> interesting option (the 100GB model for $240).
> 
> http://www.anandtech.com/show/6432/the-intel-ssd-dc-s3700-intels-3rd-generation-controller-analyzed
> 
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syncfs slower than without syncfs

2012-11-07 Thread Stefan Priebe

Am 07.11.2012 16:04, schrieb Mark Nelson:

Whew, glad you found the problem Stefan!  I was starting to wonder what
was going on. :)  Do you mind filling a bug about the control dependencies?


Sure where should i fill it in?

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Atchley, Scott
On Nov 7, 2012, at 11:20 AM, Mark Nelson  wrote:

>>>  Right now I'm doing 3 journals per SSD, but topping out at about
>>> 1.2-1.4GB/s from the client perspective for the node with 15+ drives and
>>> 5 SSDs.  It's possible newer versions of the code and tuning may
>>> increase that.
>> 
>> What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
>> expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
>> Ethernet?
> 
> This is 8 concurrent instances of rados bench running on localhost. 
> Ceph is configured with 1x replication.  1.2-1.4GB/s is the aggregate 
> throughput of all of the rados bench instances.

Ok, all local with no communication. Given this level of local performance, 
what does that translate into when talking over the network?

Scott

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Mark Nelson

On 11/07/2012 10:35 AM, Atchley, Scott wrote:

On Nov 7, 2012, at 11:20 AM, Mark Nelson  wrote:


  Right now I'm doing 3 journals per SSD, but topping out at about
1.2-1.4GB/s from the client perspective for the node with 15+ drives and
5 SSDs.  It's possible newer versions of the code and tuning may
increase that.


What interconnect is this? 10G Ethernet is 1.25 GB/s line rate and I would 
expect your Sockets and Ceph overhead to eat into that. Or is it dual 10G 
Ethernet?


This is 8 concurrent instances of rados bench running on localhost.
Ceph is configured with 1x replication.  1.2-1.4GB/s is the aggregate
throughput of all of the rados bench instances.


Ok, all local with no communication. Given this level of local performance, 
what does that translate into when talking over the network?

Scott



Well, local, but still over tcp.  Right now I'm focusing on pushing the 
osds/filestores as far as I can, and after that I'm going to setup a 
bonded 10GbE network to see what kind of messenger bottlenecks I run 
into.  Sadly the testing is going slower than I would like.


Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ubuntu 12.04.1 + xfs + syncfs is still not our friend

2012-11-07 Thread Josh Durgin

On 11/07/2012 12:14 AM, Gandalf Corvotempesta wrote:

2012/11/7 Dan Mick :

Resolution: installing the packages built for precise, rather than squeeze,
got versions that use syncfs.


Which packages, ceph or libc?


The ceph packages.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rbd striping format v1 / v2 benchmark

2012-11-07 Thread Stefan Priebe

Hello list,

i've done some benchmarks regarding striping / v1 / v2.

Results:
format 1:

  write: io=5739MB, bw=65278KB/s, iops=16319, runt= 90029msec
  read : io=5771MB, bw=65636KB/s, iops=16408, runt= 90030msec
  write: io=77224MB, bw=874044KB/s, iops=213, runt= 90473msec
  read : io=178840MB, bw=1983MB/s, iops=495, runt= 90168msec

format 2:

 --stripe-count 8 --stripe-unit 524288 -s 4194304

  write: io=5377MB, bw=61147KB/s, iops=15286, runt= 90041msec
  read : io=5332MB, bw=60617KB/s, iops=15154, runt= 90067msec
  write: io=75136MB, bw=849285KB/s, iops=207, runt= 90593msec
  read : io=160292MB, bw=1777MB/s, iops=444, runt= 90226msec

 --stripe-count 4 --stripe-unit 1048576 -s 4194304

  write: io=5301MB, bw=60281KB/s, iops=15070, runt= 90046msec
  read : io=5367MB, bw=61031KB/s, iops=15257, runt= 90057msec
  write: io=74448MB, bw=840293KB/s, iops=205, runt= 90724msec
  read : io=170616MB, bw=1891MB/s, iops=472, runt= 90227msec

So it seems right now that striping doesn't improve performance.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd striping format v1 / v2 benchmark

2012-11-07 Thread Sage Weil
On Wed, 7 Nov 2012, Stefan Priebe wrote:
> Hello list,
> 
> i've done some benchmarks regarding striping / v1 / v2.
> 
> Results:
> format 1:
> 
>   write: io=5739MB, bw=65278KB/s, iops=16319, runt= 90029msec
>   read : io=5771MB, bw=65636KB/s, iops=16408, runt= 90030msec
>   write: io=77224MB, bw=874044KB/s, iops=213, runt= 90473msec
>   read : io=178840MB, bw=1983MB/s, iops=495, runt= 90168msec
> 
> format 2:
> 
>  --stripe-count 8 --stripe-unit 524288 -s 4194304
> 
>   write: io=5377MB, bw=61147KB/s, iops=15286, runt= 90041msec
>   read : io=5332MB, bw=60617KB/s, iops=15154, runt= 90067msec
>   write: io=75136MB, bw=849285KB/s, iops=207, runt= 90593msec
>   read : io=160292MB, bw=1777MB/s, iops=444, runt= 90226msec
> 
>  --stripe-count 4 --stripe-unit 1048576 -s 4194304
> 
>   write: io=5301MB, bw=60281KB/s, iops=15070, runt= 90046msec
>   read : io=5367MB, bw=61031KB/s, iops=15257, runt= 90057msec
>   write: io=74448MB, bw=840293KB/s, iops=205, runt= 90724msec
>   read : io=170616MB, bw=1891MB/s, iops=472, runt= 90227msec
> 
> So it seems right now that striping doesn't improve performance.

It's mainly going to help when you have a deep queue of small sequential 
IOs, and the fact that they are all piling up on a single rbd block is 
turning into a bottleneck.  The rest of the time the overhead of splitting 
things into smaller pieces will slow you down.

The intended use case is a database journal or something similar, where 
the latency requirements prevent us from making larger IOs, but things are 
still sequential.

rbd bench-write ...

will generate this workload.

sage

> 
> Greets,
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


extreme ceph-osd cpu load for rand. 4k write

2012-11-07 Thread Stefan Priebe

Hello list,

whiling benchmarking i was wondering, why the ceph-osd load is so 
extreme high while having random 4k write i/o.


Here an example while benchmarking:

random 4k write: 16.000 iop/s 180% CPU Load in top from EACH ceph-osd 
process


random 4k read: 16.000 iop/s 19% CPU Load in top from EACH ceph-osd process

seq 4M write: 800MB/s 14% CPU Load in top from EACH ceph-osd process

seq 4M read: 1600MB/s 9% CPU Load in top from EACH ceph-osd process

I can't understand why in this single case the load is so EXTREMELY high.

Greets
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Martin Mailand

Hi,

I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the 
node has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as 
test I use rados bench on each client. The aggregated write speed is 
around 1,6GB/s with single replication.


In the first configuration, I had the SSDs on the raidcontroller as 
well, but then I saturated the PCIe 2.0 x8 interface of the 
raidcontroller, therefore I use a second controller for the SSDs.



-martin


Am 07.11.2012 17:41, schrieb Mark Nelson:

Well, local, but still over tcp.  Right now I'm focusing on pushing the
osds/filestores as far as I can, and after that I'm going to setup a
bonded 10GbE network to see what kind of messenger bottlenecks I run
into.  Sadly the testing is going slower than I would like.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Martin Mailand

Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a 
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox 
ConnectX-3. My favourite is the Intel X520-DA2.


-martin

Am 07.11.2012 22:14, schrieb Gandalf Corvotempesta:

2012/11/7 Martin Mailand :

I have 16 SAS disk on a LSI 9266-8i and 4 Intel 520 SSD on a HBA, the node
has dual 10G Ethernet. The clients are 4 nodes with dual 10GeB, as test I
use rados bench on each client. The aggregated write speed is around 1,6GB/s
with single replication.


Just for curiosity, which switches do you have?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Stefan Priebe

Am 07.11.2012 22:35, schrieb Martin Mailand:

Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox
ConnectX-3. My favourite is the Intel X520-DA2.


That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks. 
HP told me the deep packet buffers of the HP5920 will burst the 
performance and should be used for storage related stuff.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Martin Mailand

Hi Stefan,

deep buffers means latency spikes, you should go for fast switching 
latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns.

And I you should think at the price the HP5900 cost 3 times of the Mellanox.

-martin

Am 07.11.2012 22:44, schrieb Stefan Priebe:

Am 07.11.2012 22:35, schrieb Martin Mailand:

Hi,

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.
For the dual 10GeB NICs I tested the Intel X520-DA2 and the Mellanox
ConnectX-3. My favourite is the Intel X520-DA2.


That's pretty interesting i'll get the HP5900 and HP5920 in a few weeks.
HP told me the deep packet buffers of the HP5920 will burst the
performance and should be used for storage related stuff.

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Stefan Priebe

Am 07.11.2012 22:55, schrieb Martin Mailand:

Hi Stefan,

deep buffers means latency spikes, you should go for fast switching
latency. The HP5900 has a latency of 1ms, the Arista and Mellanox of 250ns.


HP told me they all use the same ships and Arista measures latency while 
only one port is in use. HP guarentees the latency when all ports are in 
use. If this is correct or just somehing hp told me - i don't know. They 
told me the arista is slower and the statistics are not comporable...


> And I you should think at the price the HP5900 cost 3 times of the
> Mellanox.
Don't know what the Mellanox coests. I get the HP for a really good 
price below 10.000 €.


Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


less cores more iops / speed

2012-11-07 Thread Stefan Priebe

Hello again,

I've noticed something really interesting.

I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a 
2.5 Ghz Xeon.


When i move this VM to another kvm host with 3.6Ghz i get 8000 iops 
(still 8 cores) when i then LOWER the assigned cores from 8 to 4 i get 
14.500 iops. If i assign only 2 cores i get 16.000 iops...


Why does less kvm cores mean more speed?

Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Martin Mailand

Hi,

I *think* the HP is Broadcom based, the Arista is Fulcrum based, and I 
don't know which chips Mellanox is using.


Our NOC tested both of them, an the Arista was the clear winner, at 
least in our workload.


-martin

Am 07.11.2012 22:59, schrieb Stefan Priebe:

HP told me they all use the same ships and Arista measures latency while
only one port is in use. HP guarentees the latency when all ports are in
use. If this is correct or just somehing hp told me - i don't know. They
told me the arista is slower and the statistics are not comporable...

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Martin Mailand

good question, probably we do not have enough experience with IPoIB.
But it looks good on paper, so it's definitely a try worth.

-martin

Am 07.11.2012 23:28, schrieb Gandalf Corvotempesta:

2012/11/7 Martin Mailand :

I tested a Arista 7150S-24, a HP5900 and in a few weeks I will get a
Mellanox MSX1016. ATM the Arista is may favourite.


Why not infiniband?


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SSD journal suggestion

2012-11-07 Thread Mark Nelson

On 11/07/2012 04:51 PM, Gandalf Corvotempesta wrote:

2012/11/7 Martin Mailand :

But it looks good on paper, so it's definitely a try worth.


is at least 4x times faster than 10gbe and AFAIK should have a lower latency.
I'm planning to use infiniband as backend storage network, used for
OSD replication. 2 HBA for each OSD should give me 80Gbps and full
redundancy



I haven't done much with IPoIB (just RDMA), but my understanding is that 
it tends to top out at like 15Gb/s.  Some others on this mailing list 
can probably speak more authoritatively.  Even with RDMA you are going 
to top out at around 3.1-3.2GB/s.


This thread may be helpful/interesting:
http://comments.gmane.org/gmane.linux.drivers.rdma/12279

Mark
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Joao Eduardo Luis
On 11/07/2012 10:02 PM, Stefan Priebe wrote:
> Hello again,
> 
> I've noticed something really interesting.
> 
> I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
> 2.5 Ghz Xeon.
> 
> When i move this VM to another kvm host with 3.6Ghz i get 8000 iops
> (still 8 cores) when i then LOWER the assigned cores from 8 to 4 i get
> 14.500 iops. If i assign only 2 cores i get 16.000 iops...
> 
> Why does less kvm cores mean more speed?

Totally going on a limb here, but might be related to the cache maybe?
When you have more cores your threads may bounce around the cores and
invalidate cache entries as they go by; will less cores you might end up
with some sort of twisted, forced cpu affinity that allows you to take
advantage of caching.

But I don't know, really. I would be amazed if what I just wrote had an
ounce of truth, and would be completely astonished if that was the cause
for such a sudden increase on iops.

  -Joao

> 
> Greets,
> Stefan
> -- 
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Openstack - Boot From New Volume

2012-11-07 Thread Quenten Grasso
Hi All,

I've been looking for this bit of code for awhile how to make open stack create 
a VM with the dashboard with a attached/imaged volume to avoid the current 
multistep process of creating a vm with ceph volume.

Here's the code, thanks goes out to vishy on openstack-dev's and I've added a 
couple of bits to allow for downloading status and better vol naming.

Enjoy!

https://github.com/qgrasso/nova/commit/340210a7f62e9e34bcc15972d97e81eb282fddd6

https://github.com/qgrasso/nova/commit/31f6839a2c2dd72e2cc44c37cb316610b42598bd


** use at your own risk as always etc etc, :)

Regards,
Quenten 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Mark Nelson

On 11/07/2012 06:00 PM, Joao Eduardo Luis wrote:

On 11/07/2012 10:02 PM, Stefan Priebe wrote:

Hello again,

I've noticed something really interesting.

I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
2.5 Ghz Xeon.

When i move this VM to another kvm host with 3.6Ghz i get 8000 iops
(still 8 cores) when i then LOWER the assigned cores from 8 to 4 i get
14.500 iops. If i assign only 2 cores i get 16.000 iops...

Why does less kvm cores mean more speed?


Totally going on a limb here, but might be related to the cache maybe?
When you have more cores your threads may bounce around the cores and
invalidate cache entries as they go by; will less cores you might end up
with some sort of twisted, forced cpu affinity that allows you to take
advantage of caching.


There's also the context switching overhead.  It'd be interesting to 
know how much the writer processes were shifting around on cores. 
Stefan, what tool were you using to do writes?




But I don't know, really. I would be amazed if what I just wrote had an
ounce of truth, and would be completely astonished if that was the cause
for such a sudden increase on iops.


Yeah, it's seems pretty surprising that there would be any significant 
effect at this level of performance.




   -Joao



Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Mark Nelson

On 11/07/2012 06:00 PM, Joao Eduardo Luis wrote:

On 11/07/2012 10:02 PM, Stefan Priebe wrote:

Hello again,

I've noticed something really interesting.

I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
2.5 Ghz Xeon.

When i move this VM to another kvm host with 3.6Ghz i get 8000 iops
(still 8 cores) when i then LOWER the assigned cores from 8 to 4 i get
14.500 iops. If i assign only 2 cores i get 16.000 iops...

Why does less kvm cores mean more speed?


Totally going on a limb here, but might be related to the cache maybe?
When you have more cores your threads may bounce around the cores and
invalidate cache entries as they go by; will less cores you might end up
with some sort of twisted, forced cpu affinity that allows you to take
advantage of caching.


There's also the context switching overhead.  It'd be interesting to 
know how much the writer processes were shifting around on cores. 
Stefan, what tool were you using to do writes?




But I don't know, really. I would be amazed if what I just wrote had an
ounce of truth, and would be completely astonished if that was the cause
for such a sudden increase on iops.


Yeah, it's seems pretty surprising that there would be any significant 
effect at this level of performance.




   -Joao



Greets,
Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected behavior by ceph 0.48.2argonaut.

2012-11-07 Thread Josh Durgin

On 11/07/2012 05:34 AM, hemant surale wrote:

I am not sure about my judgments but please help me out understanding
the result of following experiment carried out : -


Experiment  : (3 node cluster,  all have ceph v.0.48.2agonaut( after
building ceph from src code) + UBUNTU 12.04 + kernel 3.2.0 )
--
 VM1 ( mon.0+ osd.0 +mds.0 )  VM2 (mon.1 + osd.1 + mds.1)  
VM3(mon.2)
  - Cluster is up and HEALTH_OK
  - Replication factor is 2. (by default all pools have
replication factor set to 2)
  - After mounting "mount.ceph mon_addr:port :/ ~/cephfs "
, I created file inside mounted Dir "cephfs" .
  - And able to see data on both OSD i.e. VM1(osd.0) and on
VM2(osd.1)  as well as file is accessible .
  - Then VM2 is made down & VM2 absence is verified with ceph -s  .
  - Even after VM1(osd.0 mds.0 mon.0) + VM3 (mon.2)  was
live , I am unable to access the file .


Was the cluster showing all pgs active+degraded after this, or did
some stay inactive?


  - I tried to remount the data on different Dir with
mount.ceph currently_live_mons:/ /home/hemant/xyz
  - Even after that I was unable to access the file stored
on cluster.
--

-
Hemant Surale.


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: syncfs slower than without syncfs

2012-11-07 Thread Josh Durgin

On 11/07/2012 08:26 AM, Stefan Priebe wrote:

Am 07.11.2012 16:04, schrieb Mark Nelson:

Whew, glad you found the problem Stefan!  I was starting to wonder what
was going on. :)  Do you mind filling a bug about the control
dependencies?


Sure where should i fill it in?


http://www.tracker.newdream.net/projects/ceph/issues/new

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: less cores more iops / speed

2012-11-07 Thread Dietmar Maurer
> I've noticed something really interesting.
> 
> I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
> 2.5 Ghz Xeon.
> 
> When i move this VM to another kvm host with 3.6Ghz i get 8000 iops (still 8
> cores) when i then LOWER the assigned cores from 8 to 4 i get
> 14.500 iops. If i assign only 2 cores i get 16.000 iops...
> 
> Why does less kvm cores mean more speed?

There is a serious bug in the kvm vhost code. Do you use virtio-net with vhost?

see: http://lists.nongnu.org/archive/html/qemu-devel/2012-11/msg00579.html

Please test using the e1000 driver instead.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: less cores more iops / speed

2012-11-07 Thread Dietmar Maurer
> > I've noticed something really interesting.
> >
> > I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
> > 2.5 Ghz Xeon.
> >
> > When i move this VM to another kvm host with 3.6Ghz i get 8000 iops
> > (still 8
> > cores) when i then LOWER the assigned cores from 8 to 4 i get
> > 14.500 iops. If i assign only 2 cores i get 16.000 iops...
> >
> > Why does less kvm cores mean more speed?
> 
> There is a serious bug in the kvm vhost code. Do you use virtio-net with
> vhost?
> 
> see: http://lists.nongnu.org/archive/html/qemu-devel/2012-
> 11/msg00579.html
> 
> Please test using the e1000 driver instead.

Or update the guest kernel (what guest kernel do you use?). AFAIK 3.X kernels 
does not trigger the bug.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Stefan Priebe - Profihost AG
Am 08.11.2012 um 06:42 schrieb Dietmar Maurer :

>> I've noticed something really interesting.
>> 
>> I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
>> 2.5 Ghz Xeon.
>> 
>> When i move this VM to another kvm host with 3.6Ghz i get 8000 iops (still 8
>> cores) when i then LOWER the assigned cores from 8 to 4 i get
>> 14.500 iops. If i assign only 2 cores i get 16.000 iops...
>> 
>> Why does less kvm cores mean more speed?
> 
> There is a serious bug in the kvm vhost code. Do you use virtio-net with 
> vhost?
> 
> see: http://lists.nongnu.org/archive/html/qemu-devel/2012-11/msg00579.html
> 
> Please test using the e1000 driver instead.

Why is vhost net driver involved here at all? Kvm guest only uses ssh here.

Stefan
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: less cores more iops / speed

2012-11-07 Thread Dietmar Maurer
> Why is vhost net driver involved here at all? Kvm guest only uses ssh here.

I though you are testing things (rdb) which depends on KVM network speed?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Stefan Priebe - Profihost AG
Am 08.11.2012 um 06:49 schrieb Dietmar Maurer :

>>> I've noticed something really interesting.
>>> 
>>> I get 5000 iops / VM for rand. 4k writes while assigning 4 cores on a
>>> 2.5 Ghz Xeon.
>>> 
>>> When i move this VM to another kvm host with 3.6Ghz i get 8000 iops
>>> (still 8
>>> cores) when i then LOWER the assigned cores from 8 to 4 i get
>>> 14.500 iops. If i assign only 2 cores i get 16.000 iops...
>>> 
>>> Why does less kvm cores mean more speed?
>> 
>> There is a serious bug in the kvm vhost code. Do you use virtio-net with
>> vhost?
>> 
>> see: http://lists.nongnu.org/archive/html/qemu-devel/2012-
>> 11/msg00579.html
>> 
>> Please test using the e1000 driver instead.
> 
> Or update the guest kernel (what guest kernel do you use?). AFAIK 3.X kernels 
> does not trigger the bug.

Guest and Host habe 3.6.6 installed.


> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: less cores more iops / speed

2012-11-07 Thread Stefan Priebe - Profihost AG

Am 08.11.2012 um 06:54 schrieb Dietmar Maurer :

>> Why is vhost net driver involved here at all? Kvm guest only uses ssh here.
> 
> I though you are testing things (rdb) which depends on KVM network speed?

Kvm process uses librbd and both are running on host not in guest.

Stefan 

> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html