Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Josef Johansson
Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here though.

A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:
> I'm clearly talking to myself, but whatever.
>
> For Greg, I've played with all the pertinent journal and filestore options
> and TCP nodelay, no changes at all.
>
> Is there anybody on this ML who's running a Ceph cluster with a fast
> network and FAST filestore, so like me with a big HW cache in front of a
> RAID/JBODs or using SSDs for final storage?
>
> If so, what results do you get out of the fio statement below per OSD?
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
> is of course vastly faster than the normal indvidual HDDs could do.
>
> So I'm wondering if I'm hitting some inherent limitation of how fast a
> single OSD (as in the software) can handle IOPS, given that everything else
> has been ruled out from where I stand.
>
> This would also explain why none of the option changes or the use of
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would
> clearly benefit from even the small 32MB standard RBD cache, while in my
> test case the only time the caching becomes noticeable is if I increase
> the cache size to something larger than the test data size. ^o^
>
> On the other hand if people here regularly get thousands or tens of
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
>
> Christian
>
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>
>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>
>>> Oh, I didn't notice that. I bet you aren't getting the expected
>>> throughput on the RAID array with OSD access patterns, and that's
>>> applying back pressure on the journal.
>>>
>> In the a "picture" being worth a thousand words tradition, I give you
>> this iostat -x output taken during a fio run:
>>
>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>   50.820.00   19.430.170.00   29.58
>>
>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> sda   0.0051.500.00 1633.50 0.00  7460.00
>> 9.13 0.180.110.000.11   0.01   1.40 sdb
>> 0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
>> 0.250.000.25   0.02   2.00 sdc   0.00 5.00
>> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00
>> 0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
>> 0.00 10313.0010.78 0.200.100.000.10   0.09  16.60
>>
>> The %user CPU utilization is pretty much entirely the 2 OSD processes,
>> note the nearly complete absence of iowait.
>>
>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
>> Look at these numbers, the lack of queues, the low wait and service
>> times (this is in ms) plus overall utilization.
>>
>> The only conclusion I can draw from these numbers and the network results
>> below is that the latency happens within the OSD processes.
>>
>> Regards,
>>
>> Christian
>>> When I suggested other tests, I meant with and without Ceph. One
>>> particular one is OSD bench. That should be interesting to try at a
>>> variety of block sizes. You could also try runnin RADOS bench and
>>> smalliobench at a few different sizes.
>>> -Greg
>>>
>>> On Wednesday, May 7, 2014, Alexandre DERUMIER 
>>> wrote:
>>>
 Hi Christian,

 Do you have tried without raid6, to have more osd ?
 (how many disks do you have begin the raid6 ?)


 Aslo, I known that direct ios can be quite slow with ceph,

 maybe can you try without --direct=1

 and also enable rbd_cache

 ceph.conf
 [client]
 rbd cache = true




 - Mail original -

 De: "Christian Balzer" >
 À: "Gregory Farnum" >,
 ceph-users@lists.ceph.com 
 Envoyé: Jeudi 8 Mai 2014 04:49:16
 Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
 backing devices

 On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:

> On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> >
 wrote:
>> Hello,
>>
>> ceph 0.72 on Debian Jessie,

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Stefan Priebe - Profihost AG
Am 14.05.2014 11:29, schrieb Josef Johansson:
> Hi Christian,
> 
> I missed this thread, haven't been reading the list that well the last
> weeks.
> 
> You already know my setup, since we discussed it in an earlier thread. I
> don't have a fast backing store, but I see the slow IOPS when doing
> randwrite inside the VM, with rbd cache. Still running dumpling here though.
> 
> A thought struck me that I could test with a pool that consists of OSDs
> that have tempfs-based disks, think I have a bit more latency than your
> IPoIB but I've pushed 100k IOPS with the same network devices before.
> This would verify if the problem is with the journal disks. I'll also
> try to run the journal devices in tempfs as well, as it would test
> purely Ceph itself.

i did the same with bobtail a year ago and was still limited to nearly
the same values. No idea what firefly will say. I'm pretty sure the
limit is in the ceph code itself.

There were a short discussion here:
http://www.spinics.net/lists/ceph-devel/msg18731.html

Stefan

> I'll get back to you with the results, hopefully I'll manage to get them
> done during this night.
> 
> Cheers,
> Josef
> 
> On 13/05/14 11:03, Christian Balzer wrote:
>> I'm clearly talking to myself, but whatever.
>>
>> For Greg, I've played with all the pertinent journal and filestore options
>> and TCP nodelay, no changes at all.
>>
>> Is there anybody on this ML who's running a Ceph cluster with a fast
>> network and FAST filestore, so like me with a big HW cache in front of a
>> RAID/JBODs or using SSDs for final storage?
>>
>> If so, what results do you get out of the fio statement below per OSD?
>> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
>> is of course vastly faster than the normal indvidual HDDs could do.
>>
>> So I'm wondering if I'm hitting some inherent limitation of how fast a
>> single OSD (as in the software) can handle IOPS, given that everything else
>> has been ruled out from where I stand.
>>
>> This would also explain why none of the option changes or the use of
>> RBD caching has any measurable effect in the test case below. 
>> As in, a slow OSD aka single HDD with journal on the same disk would
>> clearly benefit from even the small 32MB standard RBD cache, while in my
>> test case the only time the caching becomes noticeable is if I increase
>> the cache size to something larger than the test data size. ^o^
>>
>> On the other hand if people here regularly get thousands or tens of
>> thousands IOPS per OSD with the appropriate HW I'm stumped. 
>>
>> Christian
>>
>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>>
>>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>>
 Oh, I didn't notice that. I bet you aren't getting the expected
 throughput on the RAID array with OSD access patterns, and that's
 applying back pressure on the journal.

>>> In the a "picture" being worth a thousand words tradition, I give you
>>> this iostat -x output taken during a fio run:
>>>
>>> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>>>   50.820.00   19.430.170.00   29.58
>>>
>>> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
>>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>>> sda   0.0051.500.00 1633.50 0.00  7460.00
>>> 9.13 0.180.110.000.11   0.01   1.40 sdb
>>> 0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
>>> 0.250.000.25   0.02   2.00 sdc   0.00 5.00
>>> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00
>>> 0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
>>> 0.00 10313.0010.78 0.200.100.000.10   0.09  16.60
>>>
>>> The %user CPU utilization is pretty much entirely the 2 OSD processes,
>>> note the nearly complete absence of iowait.
>>>
>>> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
>>> Look at these numbers, the lack of queues, the low wait and service
>>> times (this is in ms) plus overall utilization.
>>>
>>> The only conclusion I can draw from these numbers and the network results
>>> below is that the latency happens within the OSD processes.
>>>
>>> Regards,
>>>
>>> Christian
 When I suggested other tests, I meant with and without Ceph. One
 particular one is OSD bench. That should be interesting to try at a
 variety of block sizes. You could also try runnin RADOS bench and
 smalliobench at a few different sizes.
 -Greg

 On Wednesday, May 7, 2014, Alexandre DERUMIER 
 wrote:

> Hi Christian,
>
> Do you have tried without raid6, to have more osd ?
> (how many disks do you have begin the raid6 ?)
>
>
> Aslo, I known that direct ios can be quite slow with ceph,
>
> maybe can you try without --direct=1
>
> and also enable rbd_cache
>
> ceph.conf
> [client]
> rbd cache = true
>
>
>
>

[ceph-users] client: centos6.4 no rbd.ko

2014-05-14 Thread maoqi1982
Hi list 
our ceph(0.72) cluster use ubuntu12.04  is ok . client server run openstack 
install "CentOS6.4 final", the kernel is up to 
kernel-2.6.32-358.123.2.openstack.el6.x86_64.
the question is the kernel does not support the rbd.ko ceph.ko. can anyone  
help me to add the rbd.ko ceph.ko in 
kernel-2.6.32-358.123.2.openstack.el6.x86_64 or other way except up kernel


thanks.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client: centos6.4 no rbd.ko

2014-05-14 Thread Andrija Panic
Try 3.x from elrepo repo...works for me, cloudstack/ceph...

Sent from Google Nexus 4
On May 14, 2014 11:56 AM, "maoqi1982"  wrote:

> Hi list
> our ceph(0.72) cluster use ubuntu12.04  is ok . client server run
> openstack install "CentOS6.4 final", the kernel is up to
> kernel-2.6.32-358.123.2.openstack.el6.x86_64.
> the question is the kernel does not support the rbd.ko ceph.ko. can anyone
>  help me to add the rbd.ko ceph.ko in 
> kernel-2.6.32-358.123.2.openstack.el6.x86_64
> or other way except up kernel
>
> thanks.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] sparse copy between pools

2014-05-14 Thread Erwin Lubbers
Hi,

I'm trying to copy a sparse provisioned rbd image from pool A to pool B (both 
are replicated three times). The image has a disksize of 8 GB and contains 
around 1.4 GB of data. I do use:

rbd cp PoolA/Image PoolB/Image

After copying "ceph -s" tells me that 24 GB diskspace extra is in use. Then I 
delete the original pool A image and only 8 GB of space is freed.

Does Ceph not sparse copy the image using cp? Is there another way to do so?

I'm using 0.67.7 dumpling on this cluster.

Regards,
Erwin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client: centos6.4 no rbd.ko

2014-05-14 Thread Cristian Falcas
Why don't you want to update to one of the elrepo kernels? If you
already went to the openstack kernel, you are using an unsupported
kernel.

I don't think anybody from redhat bothered to backport the ceph client
code to a 2.6.32 kernel.

Cristian Falcas

On Wed, May 14, 2014 at 12:56 PM, maoqi1982  wrote:
> Hi list
> our ceph(0.72) cluster use ubuntu12.04  is ok . client server run openstack
> install "CentOS6.4 final", the kernel is up to
> kernel-2.6.32-358.123.2.openstack.el6.x86_64.
> the question is the kernel does not support the rbd.ko ceph.ko. can anyone
> help me to add the rbd.ko ceph.ko in
> kernel-2.6.32-358.123.2.openstack.el6.x86_64 or other way except up kernel
>
> thanks.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-14 Thread Adrian Banasiak
Thank you, that should do the trick.


2014-05-14 6:41 GMT+02:00 Kai Zhang :

> Hi Adrian,
>
> You may be interested in "rados -p poo_name df --format json", although
> it's pool oriented, you could probably add the values together :)
>
> Regards,
> Kai
>
> 在 2014-05-13 08:33:11,"Adrian Banasiak"  写道:
>
> Thanks for sugestion with admin daemon but it looks like single osd
> oriented. I have used perf dump on mon socket and it output some
> interesting data in case of monitoring whole cluster:
> { "cluster": { "num_mon": 4,
>   "num_mon_quorum": 4,
>   "num_osd": 29,
>   "num_osd_up": 29,
>   "num_osd_in": 29,
>   "osd_epoch": 1872,
>   "osd_kb": 20218112516,
>   "osd_kb_used": 5022202696,
>   "osd_kb_avail": 15195909820,
>   "num_pool": 4,
>   "num_pg": 3500,
>   "num_pg_active_clean": 3500,
>   "num_pg_active": 3500,
>   "num_pg_peering": 0,
>   "num_object": 400746,
>   "num_object_degraded": 0,
>   "num_object_unfound": 0,
>   "num_bytes": 1678788329609,
>   "num_mds_up": 0,
>   "num_mds_in": 0,
>   "num_mds_failed": 0,
>   "mds_epoch": 1},
>
> Unfortunately cluster wide IO statistics are still missing.
>
>
> 2014-05-13 17:17 GMT+02:00 Haomai Wang :
>
>> Not sure your demand.
>>
>> I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
>> get the monitor infos. And the result can be parsed by simplejson
>> easily via python.
>>
>> On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak 
>> wrote:
>> > Hi, i am working with test Ceph cluster and now I want to implement
>> Zabbix
>> > monitoring with items such as:
>> >
>> > - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
>> > objects/s)
>> > - pg statistics
>> >
>> > I would like to create single script in python to retrive values using
>> rados
>> > python module, but there are only few informations in documentation
>> about
>> > module usage. I've created single function which calculates all pools
>> > current read/write statistics but i cant find out how to add recovery IO
>> > usage and pg statistics:
>> >
>> > read = 0
>> > write = 0
>> > for pool in conn.list_pools():
>> > io = conn.open_ioctx(pool)
>> > stats[pool] = io.get_stats()
>> > read+=int(stats[pool]['num_rd'])
>> > write+=int(stats[pool]['num_wr'])
>> >
>> > Could someone share his knowledge about rados module for retriving ceph
>> > statistics?
>> >
>> > BTW Ceph is awesome!
>> >
>> > --
>> > Best regards, Adrian Banasiak
>> > email: adr...@banasiak.it
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>>
>
>
>
> --
> Pozdrawiam, Adrian Banasiak
> email: adr...@banasiak.it
>
>


-- 
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rados GW Method not allowed

2014-05-14 Thread Georg Höllrigl

Hello Everyone,

The important thing here is, to include the rgw_dns_name in ceph.conf 
and to restart radosgw. Also you need the DNS configured to point to 
your radosgw + a wildcard subdomain.
For example s3cmd handles the access this way, and you'll see the 
"Method Not Allowed" message if you miss anything!



Kind Regards,
Georg

On 13.05.2014 14:30, Georg Höllrigl wrote:

Hello,

System Ubuntu 14.04
Ceph 0.80

I'm getting either a 405 Method Not Allowed or a 403 Permission Denied
from Radosgw.


Here is what I get from radosgw:

HTTP/1.1 405 Method Not Allowed
Date: Tue, 13 May 2014 12:21:43 GMT
Server: Apache
Accept-Ranges: bytes
Content-Length: 82
Content-Type: application/xml

MethodNotAllowed

I can see that the user exists using:
"radosgw-admin --name client.radosgw.ceph-m-01 metadata list user"

I can get the credentials via:

#radosgw-admin user info --uid=test
{ "user_id": "test",
   "display_name": "test",
   "email": "",
   "suspended": 0,
   "max_buckets": 1000,
   "auid": 0,
   "subusers": [],
   "keys": [
 { "user": "test",
   "access_key": "95L2C7BFQ8492LVZ271N",
   "secret_key": "f2tqIet+LrD0kAXYAUrZXydL+1nsO6Gs+we+94U5"}],
   "swift_keys": [],
   "caps": [],
   "op_mask": "read, write, delete",
   "default_placement": "",
   "placement_tags": [],
   "bucket_quota": { "enabled": false,
   "max_size_kb": -1,
   "max_objects": -1},
   "user_quota": { "enabled": false,
   "max_size_kb": -1,
   "max_objects": -1},
   "temp_url_keys": []}

I've also found some hints about a broken redirect in apache - but not
really a working version.

Any hints? Any thoughts about how to solve that? Where to get more
detailed logs, why it's not supporting to create a bucket?


KInd Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Plugin for Collectd

2014-05-14 Thread Christian Eichelmann
Hi Ceph User!

I had a look at the "official" collectd fork for ceph, which is quite
outdated and not compatible with the upstream version.

Since this was not an option for us, I've worte a Python Plugin for
Collectd, that gets all the precious informations out of the admin
sockets "perf dump" command. It runs on our productive cluster right now
and I'd like to share it with you:

https://github.com/Crapworks/collectd-ceph

Any feedback is welcome!

Regards,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Christian Balzer

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:

> Hi Christian,
> 
> I missed this thread, haven't been reading the list that well the last
> weeks.
> 
> You already know my setup, since we discussed it in an earlier thread. I
> don't have a fast backing store, but I see the slow IOPS when doing
> randwrite inside the VM, with rbd cache. Still running dumpling here
> though.
> 
Nods, I do recall that thread.

> A thought struck me that I could test with a pool that consists of OSDs
> that have tempfs-based disks, think I have a bit more latency than your
> IPoIB but I've pushed 100k IOPS with the same network devices before.
> This would verify if the problem is with the journal disks. I'll also
> try to run the journal devices in tempfs as well, as it would test
> purely Ceph itself.
>
That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit. 
 
> I'll get back to you with the results, hopefully I'll manage to get them
> done during this night.
>
Looking forward to that. ^^


Christian 
> Cheers,
> Josef
> 
> On 13/05/14 11:03, Christian Balzer wrote:
> > I'm clearly talking to myself, but whatever.
> >
> > For Greg, I've played with all the pertinent journal and filestore
> > options and TCP nodelay, no changes at all.
> >
> > Is there anybody on this ML who's running a Ceph cluster with a fast
> > network and FAST filestore, so like me with a big HW cache in front of
> > a RAID/JBODs or using SSDs for final storage?
> >
> > If so, what results do you get out of the fio statement below per OSD?
> > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
> > which is of course vastly faster than the normal indvidual HDDs could
> > do.
> >
> > So I'm wondering if I'm hitting some inherent limitation of how fast a
> > single OSD (as in the software) can handle IOPS, given that everything
> > else has been ruled out from where I stand.
> >
> > This would also explain why none of the option changes or the use of
> > RBD caching has any measurable effect in the test case below. 
> > As in, a slow OSD aka single HDD with journal on the same disk would
> > clearly benefit from even the small 32MB standard RBD cache, while in
> > my test case the only time the caching becomes noticeable is if I
> > increase the cache size to something larger than the test data size.
> > ^o^
> >
> > On the other hand if people here regularly get thousands or tens of
> > thousands IOPS per OSD with the appropriate HW I'm stumped. 
> >
> > Christian
> >
> > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
> >
> >> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
> >>
> >>> Oh, I didn't notice that. I bet you aren't getting the expected
> >>> throughput on the RAID array with OSD access patterns, and that's
> >>> applying back pressure on the journal.
> >>>
> >> In the a "picture" being worth a thousand words tradition, I give you
> >> this iostat -x output taken during a fio run:
> >>
> >> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
> >>   50.820.00   19.430.170.00   29.58
> >>
> >> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> >> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> >> sda   0.0051.500.00 1633.50 0.00  7460.00
> >> 9.13 0.180.110.000.11   0.01   1.40 sdb
> >> 0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
> >> 0.250.000.25   0.02   2.00 sdc   0.00 5.00
> >> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00
> >> 0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
> >> 0.00 10313.0010.78 0.200.100.000.10   0.09  16.60
> >>
> >> The %user CPU utilization is pretty much entirely the 2 OSD processes,
> >> note the nearly complete absence of iowait.
> >>
> >> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
> >> Look at these numbers, the lack of queues, the low wait and service
> >> times (this is in ms) plus overall utilization.
> >>
> >> The only conclusion I can draw from these numbers and the network
> >> results below is that the latency happens within the OSD processes.
> >>
> >> Regards,
> >>
> >> Christian
> >>> When I suggested other tests, I meant with and without Ceph. One
> >>> particular one is OSD bench. That should be interesting to try at a
> >>> variety of block sizes. You could also try runnin RADOS bench and
> >>> smalliobench at a few different sizes.
> >>> -Greg
> >>>
> >>> On Wednesday, May 7, 2014, Alexandre DERUMIER 
> >>> wrote:
> >>>
>  Hi Christian,
> 
>  Do you have tried without raid6, to have more osd ?
>  (how many disks do you have begin the raid6 ?)
> 
> 
>  Aslo, I known that direct ios can be quite slow with ceph,
> 
>  maybe can you try without --direct=1
> 
>  and also enable rbd_cach

Re: [ceph-users] Slow IOPS on RBD compared to journal and backingdevices

2014-05-14 Thread German Anders
Someone could get a performance throughput on RBD of 600MB/s or more 
on (rw) with a block size of 32768k?




German Anders
Field Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journal and 
backingdevices

De: Christian Balzer 
Para: Josef Johansson 
Cc: 
Fecha: Wednesday, 14/05/2014 09:33


Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:



Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. 
I

don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.



A thought struck me that I could test with a pool that consists of 
OSDs
that have tempfs-based disks, think I have a bit more latency than 
your

IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the 
actual

filestore ataround 5%) I'd expect Ceph to be the culprit.



I'll get back to you with the results, hopefully I'll manage to get 
them

done during this night.


Looking forward to that. ^^


Christian


Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:


I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:



On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:



Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.


In the a "picture" being worth a thousand words tradition, I give you
this iostat -x output taken during a fio run:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 50.820.00   19.430.170.00   29.58

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0051.500.00 1633.50 0.00  7460.00
9.13 0.180.110.000.11   0.01   1.40 sdb
0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
0.250.000.25   0.02   2.00 sdc   0.00 5.00
0.00 2468.50 0.00 13419.0010.87 0.240.100.00
0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
0.00 10313.0010.78 0.200.100.000.10   0.09  16.60

The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.

Regards,

Christian


When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg

On Wednesday, May 7, 2014, Alexandre DERUMIER 
wrote:



Hi Christian,

Do you have tried without raid6, to have more osd ?
(how many disks do you have begin the raid6 ?)


Aslo, I known that direct ios can be quite slow with ceph,

maybe can you try without --direct=1

and also enable rbd_cache

ceph.conf
[client]
rbd cache = true




Re: [ceph-users] Slow IOPS on RBD compared to journal andbackingdevices

2014-05-14 Thread German Anders

I forgot to mention, of course on a 10GbE network



German Anders
Field Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journal 
andbackingdevices

De: German Anders 
Para: Christian Balzer 
Cc: 
Fecha: Wednesday, 14/05/2014 09:41


Someone could get a performance throughput on RBD of 600MB/s or more 
on (rw) with a block size of 32768k?




German Anders
Field Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journal and 
backingdevices

De: Christian Balzer 
Para: Josef Johansson 
Cc: 
Fecha: Wednesday, 14/05/2014 09:33


Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:



Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. 
I

don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.



A thought struck me that I could test with a pool that consists of 
OSDs
that have tempfs-based disks, think I have a bit more latency than 
your

IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the 
actual

filestore ataround 5%) I'd expect Ceph to be the culprit.



I'll get back to you with the results, hopefully I'll manage to get 
them

done during this night.


Looking forward to that. ^^


Christian


Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:


I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:



On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:



Oh, I didn't notice that. I bet you aren't getting the expected
throughput on the RAID array with OSD access patterns, and that's
applying back pressure on the journal.


In the a "picture" being worth a thousand words tradition, I give you
this iostat -x output taken during a fio run:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
 50.820.00   19.430.170.00   29.58

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.0051.500.00 1633.50 0.00  7460.00
9.13 0.180.110.000.11   0.01   1.40 sdb
0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
0.250.000.25   0.02   2.00 sdc   0.00 5.00
0.00 2468.50 0.00 13419.0010.87 0.240.100.00
0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
0.00 10313.0010.78 0.200.100.000.10   0.09  16.60

The %user CPU utilization is pretty much entirely the 2 OSD processes,
note the nearly complete absence of iowait.

sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
Look at these numbers, the lack of queues, the low wait and service
times (this is in ms) plus overall utilization.

The only conclusion I can draw from these numbers and the network
results below is that the latency happens within the OSD processes.

Regards,

Christian


When I suggested other tests, I meant with and without Ceph. One
particular one is OSD bench. That should be interesting to try at a
variety of block sizes. You could also try runnin RADOS bench and
smalliobench at a few different sizes.
-Greg

On Wednesday

Re: [ceph-users] Slow IOPS on RBD compared to journal andbackingdevices

2014-05-14 Thread Josef Johansson
Hi,

On 14/05/14 14:45, German Anders wrote:
> I forgot to mention, of course on a 10GbE network
>  
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to journal
>> andbackingdevices
>> *De:* German Anders 
>> *Para:* Christian Balzer 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:41
>>
>> Someone could get a performance throughput on RBD of 600MB/s or more
>> on (rw) with a block size of 32768k?
>>  
Is that 32M then?
Sequential or randwrite?

I get about those speeds when doing (1M block size) buffered writes from
within a VM on 20GbE. The cluster max out at about 900MB/s.

Cheers,
Josef
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to journal
>> and backingdevices
>> *De:* Christian Balzer 
>> *Para:* Josef Johansson 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>> Hello!
>>
>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>> Hi Christian,
>>
>> I missed this thread, haven't been reading the list that well
>> the last
>> weeks.
>>
>> You already know my setup, since we discussed it in an
>> earlier thread. I
>> don't have a fast backing store, but I see the slow IOPS when
>> doing
>> randwrite inside the VM, with rbd cache. Still running
>> dumpling here
>> though.
>>
>> Nods, I do recall that thread.
>>
>> A thought struck me that I could test with a pool that
>> consists of OSDs
>> that have tempfs-based disks, think I have a bit more latency
>> than your
>> IPoIB but I've pushed 100k IOPS with the same network devices
>> before.
>> This would verify if the problem is with the journal disks.
>> I'll also
>> try to run the journal devices in tempfs as well, as it would
>> test
>> purely Ceph itself.
>>
>> That would be interesting indeed.
>> Given what I've seen (with the journal at 20% utilization and the
>> actual
>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>> I'll get back to you with the results, hopefully I'll manage
>> to get them
>> done during this night.
>>
>> Looking forward to that. ^^
>>
>>
>> Christian
>>
>> Cheers,
>> Josef
>>
>> On 13/05/14 11:03, Christian Balzer wrote:
>>
>> I'm clearly talking to myself, but whatever.
>>
>> For Greg, I've played with all the pertinent journal and
>> filestore
>> options and TCP nodelay, no changes at all.
>>
>> Is there anybody on this ML who's running a Ceph cluster
>> with a fast
>> network and FAST filestore, so like me with a big HW
>> cache in front of
>> a RAID/JBODs or using SSDs for final storage?
>>
>> If so, what results do you get out of the fio statement
>> below per OSD?
>> In my case with 4 OSDs and 3200 IOPS that's about 800
>> IOPS per OSD,
>> which is of course vastly faster than the normal
>> indvidual HDDs could
>> do.
>>
>> So I'm wondering if I'm hitting some inherent limitation
>> of how fast a
>> single OSD (as in the software) can handle IOPS, given
>> that everything
>> else has been ruled out from where I stand.
>>
>> This would also explain why none of the option changes or
>> the use of
>> RBD caching has any measurable effect in the test case
>> below.
>> As in, a slow OSD aka single HDD with journal on the same
>> disk would
>> clearly benefit from even the small 32MB standard RBD
>> cache, while in
>> my test case the only time the caching becomes noticeable
>> is if I
>> increase the cache size to something larger than the test
>> data size.
>> ^o^
>>
>> On the other hand if people here regularly get thousands
>> or tens of
>> thousands IOPS per OSD with the appropriate HW I'm stumped.
>>
>> Christian
>>
>> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:
>>
>> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
>>
>> Oh, I didn't notice that. I bet you aren't
>> getting the expected
>> throughput on the RAID array with OSD access
>> patterns, and that's
>> applying back pressure on the journal.
>>
>> 

[ceph-users] Move osd disks between hosts

2014-05-14 Thread Dinu Vlad

I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and I've 
been looking at the possibility to "migrate" the data to 2 new nodes. The 
operation should happen by relocating the disks - I'm not getting any new 
hard-drives. The cluster is used as a backend for an openstack cloud, so 
downtime should be as short as possible - preferably not more than 24 h during 
the week-end.

I'd like a second opinion on the process - since I do not have the resources to 
test the move scenario. I'm running emperor (0.72.1) at the moment. All pools 
in the cluster have size 2. Each existing OSD nodes have each an SSD for 
journals; /dev/disk/by-id paths were used. 

Here's what I think would work:
1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 1 & 
2;
2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it up 
with ceph's upstart script set on manual and check/correct journal paths 
3 - edit the CRUSH map on the monitors to reflect the new situation
4 - start ceph on the new node #1 and old nodes 3 & 4; wait for the rebuild to 
happen
5 - repeat steps 1-4 for the rest of the nodes/drives;

Any opinions? Or a better path to follow? 

Thanks!






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move osd disks between hosts

2014-05-14 Thread Sage Weil
Hi Dinu,

On Wed, 14 May 2014, Dinu Vlad wrote:
> 
> I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and 
> I've been looking at the possibility to "migrate" the data to 2 new nodes. 
> The operation should happen by relocating the disks - I'm not getting any new 
> hard-drives. The cluster is used as a backend for an openstack cloud, so 
> downtime should be as short as possible - preferably not more than 24 h 
> during the week-end.
> 
> I'd like a second opinion on the process - since I do not have the resources 
> to test the move scenario. I'm running emperor (0.72.1) at the moment. All 
> pools in the cluster have size 2. Each existing OSD nodes have each an SSD 
> for journals; /dev/disk/by-id paths were used. 
> 
> Here's what I think would work:
> 1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 1 
> & 2;
> 2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it 
> up with ceph's upstart script set on manual and check/correct journal paths 
> 3 - edit the CRUSH map on the monitors to reflect the new situation
> 4 - start ceph on the new node #1 and old nodes 3 & 4; wait for the rebuild 
> to happen
> 5 - repeat steps 1-4 for the rest of the nodes/drives;

If you used ceph-deploy and/or ceph-disk to set up these OSDs (that is, if 
they are stored on labeled GPT partitions such that upstart is 
automagically starting up the ceph-osd daemons for you without you putting 
anythign in /etc/fstab to manually mount the volumes) then all of this 
should be plug and play for you--including step #3.  By default, the 
startup process will 'fix' the CRUSH hierarchy position based on the 
hostname and (if present) other positional data configured for 'crush 
location' in ceph.conf.  The only real requirement is that both the osd 
data and journal volumes get moved so that the daemon has everything it 
needs to start up.

sage


> 
> Any opinions? Or a better path to follow? 
> 
> Thanks!
> 
> 
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices

2014-05-14 Thread German Anders

Hi Josef,
Thanks a lot for the quick answer.

yes 32M and rand writes

and also, do you get those values i guess with a MTU of 9000 or with 
the traditional and beloved MTU 1500?




German Anders
Field Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to 
journalandbackingdevices

De: Josef Johansson 
Para: 
Fecha: Wednesday, 14/05/2014 10:10


Hi,

On 14/05/14 14:45, German Anders wrote:


I forgot to mention, of course on a 10GbE network



German   Anders
Field   Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to journal 
andbackingdevices

De: German Anders 
Para: Christian Balzer 
Cc: 
Fecha: Wednesday, 14/05/2014 09:41


Someone could get a performance   throughput on RBD of 
600MB/s or more on (rw) with a block   size of 32768k?


Is that 32M then?

Sequential or randwrite?

I get about those speeds when doing (1M block size) buffered writes
 from within a VM on 20GbE. The cluster max out at about 900MB/s.


Cheers,
Josef







German   Anders
Field   Storage Support Engineer
Despegar.com - IT Team











--- Original message ---
Asunto: Re: [ceph-users] Slow IOPS on RBD compared to 
journal and backingdevices

De: Christian Balzer 
Para: Josef Johansson 
Cc: 
Fecha: Wednesday, 14/05/2014 09:33


Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:



Hi Christian,

I missed this thread, haven't been reading the list that   
well the last

weeks.

You already know my setup, since we discussed it in an   
earlier thread. I
don't have a fast backing store, but I see the slow IOPS   
when doing
randwrite inside the VM, with rbd cache. Still running   
dumpling here

though.

 Nods, I do recall that thread.



A thought struck me that I could test with a   pool that 
consists of OSDs
that have tempfs-based disks, think I have a bit more   
latency than your
IPoIB but I've pushed 100k IOPS with the same network   
devices before.
This would verify if the problem is with the journal   
disks. I'll also
try to run the journal devices in tempfs as well, as it   
would test

purely Ceph itself.

 That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization 
and the actual

filestore ataround 5%) I'd expect Ceph to be the culprit.


I'll get back to you with the results,   hopefully I'll 
manage to get them

done during this night.

 Looking forward to that. ^^



Christian


Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:


I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and   
  filestore

options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster   
  with a fast
network and FAST filestore, so like me with a big HW 
cache in front of

a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement
 below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 
IOPS per OSD,
which is of course vastly faster than the normal 
indvidual HDDs could

do.

So I'm wondering if I'm hitting some inherent limitation   
  of how fast a
single OSD (as in the software) can handle IOPS, given 
that everything

else has been ruled out from where I stand.

This would also explain why none of the option changes 
or the use of
RBD caching has any measurable effect in the test case 
below.
As in, a slow OSD aka single HDD with journal on the 
same disk would
clearly benefit from even the small 32MB standard RBD 
cache, while in
my test case the only time the caching becomes 
noticeable is if I
increase the cache size to something larger than the 
test data size.

^o^

On the other hand if people here regularly get thousands   
  or tens of
thousands IOPS per OSD with the appropriate HW I'm 
stumped.


Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer 
wrote:



On Wed, 7 May 2014 22:13:53 -0700 Gregory   Farnum 
wrote:



Oh, I didn't notice that. I bet you aren't getting 
the expected
throughput on the RAID array with OSD access 
patterns, and that's

applying back pressure on the journal.

 In the a "picture" being worth a thousand words   
tradition, I give you

this iostat -x output taken during a fio run:

avg-cpu: %user %nice %system %iowait %steal %idle
   

Re: [ceph-users] Bulk storage use case

2014-05-14 Thread Cedric Lemarchand
Hi Dan,

Le 13/05/2014 13:42, Dan van der Ster a écrit :
> Hi,
> I think you're not getting many replies simply because those are
> rather large servers and not many have such hardware in prod.
Good point.
> We run with 24x3TB drives, 64GB ram, one 10Gbit NIC. Memory-wise there
> are no problems. Throughput-wise, the bottleneck is somewhere between
> the NIC (~1GB/s) and the HBA / SAS backplane (~1.6GB/s). Since writes
> coming in over the network are multiplied by at least 2 times to the
> disks, in our case the HBA is the bottleneck (so we have a practical
> limit of ~800-900MBps).
Your hardware is pretty close from what I am looking for, thanks for info.
> The other factor which which makes it hard to judge your plan is how
> the erasure coding will perform, especially given only a 2Gig network
> between servers. I would guess there is very little prod experience
> with the EC code as of today -- and probably zero with boxes similar
> to what you propose. But my gut tells me that with your proposed
> stripe width of 12/3, combined with the slow network, getting good
> performance might be a challenge.
It would love to hear some advices / recommendation about EC from
Inktank's people ;-)
> I would suggest you start some smaller scale tests to get a feeling
> for the performance before committing to a large purchase of this
> hardware type.
Indeed, without some solid pointers, this is the only way left.

Cheers
> Cheers, Dan
>
> Cédric Lemarchand wrote:
>> Thanks for your answers Craig, it seems this is a niche use case for
>> Ceph, not a lot of replies from the ML.
>>
>> Cheers
>>
>> -- 
>> Cédric Lemarchand
>>
>> Le 11 mai 2014 à 00:35, Craig Lewis > > a écrit :
>>
>>> On 5/10/14 12:43 , Cédric Lemarchand wrote:
 Hi Craig,

 Thanks, I really appreciate the well detailed response.

 I carefully note your advices, specifically about the CPU starvation
 scenario, which as you said sounds scary.

 About IO, datas will be very resilient, in case of crash, loosing not
 fully written objects will not be a problem (they will be re uploaded
 later), so I think in this specific case, disabling journaling could
 be a way to improve IO.
 How Ceph will handle that, are there caveats other than just loosing
 objects that was in the data path when the crash occurs ? I know it
 could sounds weird, but clients workflow could support such thing.

 Thanks !

 -- 
 Cédric Lemarchand

 Le 10 mai 2014 à 04:30, Craig Lewis >>> > a écrit :
>>>
>>> Disabling the journal does make sense in some cases, like all the data
>>> is a backup copy.
>>>
>>> I don't know anything about how Ceph behaves in that setup. Maybe
>>> somebody else can chime in?
>>>
>>> -- 
>>>
>>> *Craig Lewis*
>>> Senior Systems Engineer
>>> Office +1.714.602.1309
>>> Email cle...@centraldesktop.com 
>>>
>>> *Central Desktop. Work together in ways you never thought possible.*
>>> Connect with us Website  | Twitter
>>>  | Facebook
>>>  | LinkedIn
>>>  | Blog
>>> 
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com 
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 
Cédric

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Plugin for Collectd

2014-05-14 Thread Mark Nelson

On 05/14/2014 07:24 AM, Christian Eichelmann wrote:

Hi Ceph User!

I had a look at the "official" collectd fork for ceph, which is quite
outdated and not compatible with the upstream version.

Since this was not an option for us, I've worte a Python Plugin for
Collectd, that gets all the precious informations out of the admin
sockets "perf dump" command. It runs on our productive cluster right now
and I'd like to share it with you:

https://github.com/Crapworks/collectd-ceph

Any feedback is welcome!


Nice!  I have to admit that I don't use collectd much, but it's nice to 
see the plugin updated!  Only feedback is, carry on! :D


Mark



Regards,
Christian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move osd disks between hosts

2014-05-14 Thread Dinu Vlad
Hello Sage,

Yes, original deployment was done via ceph-deploy - and I am very happy to read 
this :)

Thank you!
Dinu


On May 14, 2014, at 4:17 PM, Sage Weil  wrote:

> Hi Dinu,
> 
> On Wed, 14 May 2014, Dinu Vlad wrote:
>> 
>> I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and 
>> I've been looking at the possibility to "migrate" the data to 2 new nodes. 
>> The operation should happen by relocating the disks - I'm not getting any 
>> new hard-drives. The cluster is used as a backend for an openstack cloud, so 
>> downtime should be as short as possible - preferably not more than 24 h 
>> during the week-end.
>> 
>> I'd like a second opinion on the process - since I do not have the resources 
>> to test the move scenario. I'm running emperor (0.72.1) at the moment. All 
>> pools in the cluster have size 2. Each existing OSD nodes have each an SSD 
>> for journals; /dev/disk/by-id paths were used. 
>> 
>> Here's what I think would work:
>> 1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 
>> 1 & 2;
>> 2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it 
>> up with ceph's upstart script set on manual and check/correct journal paths 
>> 3 - edit the CRUSH map on the monitors to reflect the new situation
>> 4 - start ceph on the new node #1 and old nodes 3 & 4; wait for the rebuild 
>> to happen
>> 5 - repeat steps 1-4 for the rest of the nodes/drives;
> 
> If you used ceph-deploy and/or ceph-disk to set up these OSDs (that is, if 
> they are stored on labeled GPT partitions such that upstart is 
> automagically starting up the ceph-osd daemons for you without you putting 
> anythign in /etc/fstab to manually mount the volumes) then all of this 
> should be plug and play for you--including step #3.  By default, the 
> startup process will 'fix' the CRUSH hierarchy position based on the 
> hostname and (if present) other positional data configured for 'crush 
> location' in ceph.conf.  The only real requirement is that both the osd 
> data and journal volumes get moved so that the daemon has everything it 
> needs to start up.
> 
> sage
> 
> 
>> 
>> Any opinions? Or a better path to follow? 
>> 
>> Thanks!
>> 
>> 
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journalandbackingdevices

2014-05-14 Thread Josef Johansson
Hi,

Yeah, running with MTU 9000 here, but the test was with sequential.

Just ran rbd -p shared-1 bench-write test --io-size $((32*1024*1024))
--io-pattern rand

The cluster itself showed 700MB/s write (3x replicas), but the test just
45MB/s. But I think rbd is a little bit broken ;)

Cheers,
Josef

On 14/05/14 15:23, German Anders wrote:
> Hi Josef, 
> Thanks a lot for the quick answer.
>
> yes 32M and rand writes
>
> and also, do you get those values i guess with a MTU of 9000 or with
> the traditional and beloved MTU 1500?
>
>  
>
> *German Anders*
> /Field Storage Support Engineer/**
>
> Despegar.com - IT Team
>
>
>
>
>
>
>
>
>  
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journalandbackingdevices
>> *De:* Josef Johansson 
>> *Para:* 
>> *Fecha:* Wednesday, 14/05/2014 10:10
>>
>> Hi,
>>
>> On 14/05/14 14:45, German Anders wrote:
>>
>> I forgot to mention, of course on a 10GbE network
>>  
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journal andbackingdevices
>> *De:* German Anders 
>> *Para:* Christian Balzer 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:41
>>
>> Someone could get a performance throughput on RBD of 600MB/s
>> or more on (rw) with a block size of 32768k?
>>  
>>
>> Is that 32M then?
>> Sequential or randwrite?
>>
>> I get about those speeds when doing (1M block size) buffered writes
>> from within a VM on 20GbE. The cluster max out at about 900MB/s.
>>
>> Cheers,
>> Josef
>>
>>  
>>
>> *German Anders*
>> /Field Storage Support Engineer/**
>>
>> Despegar.com - IT Team
>>
>>
>>
>>
>>
>>
>>
>>
>>  
>>
>> --- Original message ---
>> *Asunto:* Re: [ceph-users] Slow IOPS on RBD compared to
>> journal and backingdevices
>> *De:* Christian Balzer 
>> *Para:* Josef Johansson 
>> *Cc:* 
>> *Fecha:* Wednesday, 14/05/2014 09:33
>>
>>
>> Hello!
>>
>> On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:
>>
>> Hi Christian,
>>
>> I missed this thread, haven't been reading the list
>> that well the last
>> weeks.
>>
>> You already know my setup, since we discussed it in
>> an earlier thread. I
>> don't have a fast backing store, but I see the slow
>> IOPS when doing
>> randwrite inside the VM, with rbd cache. Still
>> running dumpling here
>> though.
>>
>> Nods, I do recall that thread.
>>
>> A thought struck me that I could test with a pool
>> that consists of OSDs
>> that have tempfs-based disks, think I have a bit more
>> latency than your
>> IPoIB but I've pushed 100k IOPS with the same network
>> devices before.
>> This would verify if the problem is with the journal
>> disks. I'll also
>> try to run the journal devices in tempfs as well, as
>> it would test
>> purely Ceph itself.
>>
>> That would be interesting indeed.
>> Given what I've seen (with the journal at 20% utilization
>> and the actual
>> filestore ataround 5%) I'd expect Ceph to be the culprit.
>>
>> I'll get back to you with the results, hopefully I'll
>> manage to get them
>> done during this night.
>>
>> Looking forward to that. ^^
>>
>>
>> Christian
>>
>> Cheers,
>> Josef
>>
>> On 13/05/14 11:03, Christian Balzer wrote:
>>
>> I'm clearly talking to myself, but whatever.
>>
>> For Greg, I've played with all the pertinent
>> journal and filestore
>> options and TCP nodelay, no changes at all.
>>
>> Is there anybody on this ML who's running a Ceph
>> cluster with a fast
>> network and FAST filestore, so like me with a big
>> HW cache in front of
>> a RAID/JBODs or using SSDs for final storage?
>>
>> If so, what results do you get out of the fio
>> statement below per OSD?
>> In my case with 4 OSDs and 3200 IOPS that's about
>> 800 IOPS per OSD,
>> which is of course vastly faster than the normal
>> indvidual HDDs could
>> 

Re: [ceph-users] crushmap question

2014-05-14 Thread Gregory Farnum
It won't pay any attention to the racks after you change the rule. So
some PGs may have all their OSDs in one rack, and others may be spread
across racks.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 13, 2014 at 10:54 PM, Cao, Buddy  wrote:
> BTW, I'd like to know, after I change the "from rack" to "from host", if I 
> add more racks with host/osds in the cluster, will ceph choose the osds for 
> pg only from one zone? or ceph will randomly choose from several different 
> zones?
>
>
> Wei Cao (Buddy)
>
> -Original Message-
> From: Cao, Buddy
> Sent: Wednesday, May 14, 2014 1:30 PM
> To: 'Gregory Farnum'
> Cc: ceph-users@lists.ceph.com
> Subject: RE: [ceph-users] crushmap question
>
> Thanks Gregory so much,it solved the problem!
>
>
> Wei Cao (Buddy)
>
> -Original Message-
> From: Gregory Farnum [mailto:g...@inktank.com]
> Sent: Wednesday, May 14, 2014 2:00 AM
> To: Cao, Buddy
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] crushmap question
>
> You just use a type other than "rack" in your chooseleaf rule. In your case, 
> "host". When using chooseleaf, the bucket type you specify is the failure 
> domain which it must segregate across.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, May 13, 2014 at 12:52 AM, Cao, Buddy  wrote:
>> Hi,
>>
>>
>>
>> I have a crushmap structure likes root->rack->host->osds. I designed
>> the rule below, since I used “chooseleaf…rack” in rule definition, if
>> there is only one rack in the cluster, the ceph gps will always stay
>> at stuck unclean state (that is because the default metadata/data/rbd pool 
>> set 2 replicas).
>> Could you let me know how do I configure the rule to let it can also
>> work in a cluster with only one rack?
>>
>>
>>
>> rule ssd{
>>
>> ruleset 1
>>
>> type replicated
>>
>> min_size 0
>>
>> max_size 10
>>
>> step take root
>>
>> step chooseleaf firstn 0 type rack
>>
>> step emit
>>
>> }
>>
>>
>>
>> BTW, if I add a new rack into the crushmap, the pg status will finally
>> get to active+clean. However, my customer do ONLY have one rack in
>> their env, so hard for me to have workaround to ask him setup several racks.
>>
>>
>>
>> Wei Cao (Buddy)
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pool without Name

2014-05-14 Thread Georg Höllrigl

Hello List,

I see a pool without a name:

ceph> osd lspools
0 data,1 metadata,2 rbd,3 .rgw.root,4 .rgw.control,5 .rgw,6 .rgw.gc,7 
.users.uid,8 openstack-images,9 openstack-volumes,10 
openstack-backups,11 .users,12 .users.swift,13 .users.email,14 .log,15 
.rgw.buckets,16 .rgw.buckets.index,17 .usage,18 .intent-log,20 ,


I've already deleted one of those (with ID 19) with

rados rmpool "" "" --yes-i-really-really-mean-it

But now it's back with ID 20.

Where do they come from? What kind of data is in there?


Kind Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pool without Name

2014-05-14 Thread Wido den Hollander

On 05/14/2014 05:24 PM, Georg Höllrigl wrote:

Hello List,

I see a pool without a name:

ceph> osd lspools
0 data,1 metadata,2 rbd,3 .rgw.root,4 .rgw.control,5 .rgw,6 .rgw.gc,7
.users.uid,8 openstack-images,9 openstack-volumes,10
openstack-backups,11 .users,12 .users.swift,13 .users.email,14 .log,15
.rgw.buckets,16 .rgw.buckets.index,17 .usage,18 .intent-log,20 ,

I've already deleted one of those (with ID 19) with

rados rmpool "" "" --yes-i-really-really-mean-it

But now it's back with ID 20.

Where do they come from? What kind of data is in there?



You are running Dumpling 0.67.X with the RGW? It's something which is 
caused by the RGW.


There is a thread on this list from two weeks ago about this.



Kind Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

2014-05-14 Thread Fabrizio G. Ventola
By the way, just to report my experience: I've upgraded another
testing cluster. Both clusters (this one and the one in my previous
mail) are ok now and aren't facing the cyclical "scrubbing" -
"active+clean" state issue. They have automatically reached a steady
"active+clean" status.

Best regards,
Fabrizio

On 13 May 2014 23:12, Michael  wrote:
> Anyone still seeing this issue on 0.80.1 you'll probable need to dump out
> your scrub list "ceph pg dump | grep scrub" then find the OSD listed as the
> acting primary for the PG stuck scrubbing and stop it a bit more
> aggressively. I found that the acting primary for a PG stuck in scrub status
> was completely ignoring standard restart commands which prevented any
> scrubbing from continuing within the cluster even after update.
>
> -Michael
>
>
> On 13/05/2014 17:03, Fabrizio G. Ventola wrote:
>>
>> I've upgraded to 0.80.1 on a testing instance: the cluster gets
>> cyclically active+clean+deep scrubbing for a little while and then
>> reaches active+clean status. I'm not worried about this, I think it's
>> normal, but I didn't have this behaviour on emperor 0.72.2.
>>
>> Cheers,
>> Fabrizio
>>
>> On 13 May 2014 06:08, Alexandre DERUMIER  wrote:
>>>
>>> 0.80.1 update has fixed the problem.
>>>
>>> thanks to ceph team !
>>>
>>> - Mail original -
>>>
>>> De: "Simon Ironside" 
>>> À: ceph-users@lists.ceph.com
>>> Envoyé: Lundi 12 Mai 2014 18:13:32
>>> Objet: Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state
>>>
>>> Hi,
>>>
>>> I'm sure I saw on the IRC channel yesterday that this is a known problem
>>> with Firefly which is due to be fixed with the release (possibly today?)
>>> of 0.80.1.
>>>
>>> Simon
>>>
>>> On 12/05/14 14:53, Alexandre DERUMIER wrote:

 Hi, I observe the same behaviour on a test ceph cluster (upgrade from
 emperor to firefly)


 cluster 819ea8af-c5e2-4e92-81f5-4348e23ae9e8
 health HEALTH_OK
 monmap e3: 3 mons at ..., election epoch 12, quorum 0,1,2 0,1,2
 osdmap e94: 12 osds: 12 up, 12 in
 pgmap v19001: 592 pgs, 4 pools, 30160 MB data, 7682 objects
 89912 MB used, 22191 GB / 22279 GB avail
 588 active+clean
 4 active+clean+scrubbing

 - Mail original -

 De: "Fabrizio G. Ventola" 
 À: ceph-users@lists.ceph.com
 Envoyé: Lundi 12 Mai 2014 15:42:03
 Objet: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

 Hello, last week I've upgraded from 0.72.2 to last stable firefly 0.80
 following the suggested procedure (upgrade in order monitors, OSDs,
 MDSs, clients) on my 2 different clusters.

 Everything is ok, I've HEALTH_OK on both, the only weird thing is that
 few PGs remain in active+clean+scrubbing. I've tried to query the PG
 and reboot the involved OSD daemons and hosts but the issue is still
 present and the involved PGs with +scrubbing state changes.

 I've tried as well to put noscrub on OSDs with "ceph osd set noscrub"
 nut nothing changed.

 What can I do? I attach the cluster statuses and their cluster maps:

 FIRST CLUSTER:

 health HEALTH_OK
 mdsmap e510: 1/1/1 up {0=ceph-mds1=up:active}, 1 up:standby
 osdmap e4604: 5 osds: 5 up, 5 in
 pgmap v138288: 1332 pgs, 4 pools, 117 GB data, 30178 objects
 353 GB used, 371 GB / 724 GB avail
 1331 active+clean
 1 active+clean+scrubbing

 # id weight type name up/down reweight
 -1 0.84 root default
 -7 0.28 rack rack1
 -2 0.14 host cephosd1-dev
 0 0.14 osd.0 up 1
 -3 0.14 host cephosd2-dev
 1 0.14 osd.1 up 1
 -8 0.28 rack rack2
 -4 0.14 host cephosd3-dev
 2 0.14 osd.2 up 1
 -5 0.14 host cephosd4-dev
 3 0.14 osd.3 up 1
 -9 0.28 rack rack3
 -6 0.28 host cephosd5-dev
 4 0.28 osd.4 up 1

 SECOND CLUSTER:

 health HEALTH_OK
 osdmap e158: 10 osds: 10 up, 10 in
 pgmap v9724: 2001 pgs, 6 pools, 395 MB data, 139 objects
 1192 MB used, 18569 GB / 18571 GB avail
 1998 active+clean
 3 active+clean+scrubbing

 # id weight type name up/down reweight
 -1 18.1 root default
 -2 9.05 host wn-recas-uniba-30
 0 1.81 osd.0 up 1
 1 1.81 osd.1 up 1
 2 1.81 osd.2 up 1
 3 1.81 osd.3 up 1
 4 1.81 osd.4 up 1
 -3 9.05 host wn-recas-uniba-32
 5 1.81 osd.5 up 1
 6 1.81 osd.6 up 1
 7 1.81 osd.7 up 1
 8 1.81 osd.8 up 1
 9 1.81 osd.9 up 1
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___

[ceph-users] Advanced CRUSH map rules

2014-05-14 Thread Fabrizio G. Ventola
Hi everybody,

Is it possible with CRUSH map to make a rule that puts R-1 replicas on
a node and the remaining one on a different node of the same failure
domain (for example datacenter) putting the replicas considering a
deeper failure domain (e.g. room)? Statement "step emit" may help in
this?
Ideally I'm trying to put two of three replicas into same datacenter
but in different rooms (or similar) and the remaining one in another
datacenter. CRUSH map can do this or it's only achievable with
zone/region configuration with rados gw?

And it's even possible to specifiy the "primary affinity" in the sense
that for specified clients (or for specified pools) ceph has to store
the primary replica in the closest (to the client) datacenter and the
other replica in another datacenter.
Probably for this I will need zones/regions configuration.

Cheers,
Fabrizio
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Advanced CRUSH map rules

2014-05-14 Thread Gregory Farnum
On Wed, May 14, 2014 at 9:56 AM, Fabrizio G. Ventola
 wrote:
> Hi everybody,
>
> Is it possible with CRUSH map to make a rule that puts R-1 replicas on
> a node and the remaining one on a different node of the same failure
> domain (for example datacenter) putting the replicas considering a
> deeper failure domain (e.g. room)? Statement "step emit" may help in
> this?
> Ideally I'm trying to put two of three replicas into same datacenter
> but in different rooms (or similar) and the remaining one in another
> datacenter. CRUSH map can do this or it's only achievable with
> zone/region configuration with rados gw?

CRUSH can do this. You'd have two choose ...emit sequences;
the first of which would descend down to a host and then choose n-1
devices within the host; the second would descend once. I think
something like this should work:

step take default
step choose firstn 1 datacenter
step chooseleaf firstn -1 room
step emit
step chooseleaf firstn 1 datacenter
step emit

Would pick one datacenter, and put R-1 copies of the data in separate
rooms. Then it would pick another datacenter and put 1 copy of the
data somewhere in it. I haven't tested this and it's been a while so
there might be some sharp edges, though (I *think* that should work
just fine, but you might need to use choose statements instead of
chooseleaf all the way down or something).
-Greg

> And it's even possible to specifiy the "primary affinity" in the sense
> that for specified clients (or for specified pools) ceph has to store
> the primary replica in the closest (to the client) datacenter and the
> other replica in another datacenter.

To do something like this you'd want to set up pools with special
rules to do that. Instead of "step take default" you'd do "step take
".
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Advanced CRUSH map rules

2014-05-14 Thread Pavel V. Kaygorodov
Hi!

> CRUSH can do this. You'd have two choose ...emit sequences;
> the first of which would descend down to a host and then choose n-1
> devices within the host; the second would descend once. I think
> something like this should work:
> 
> step take default
> step choose firstn 1 datacenter
> step chooseleaf firstn -1 room
> step emit
> step chooseleaf firstn 1 datacenter
> step emit
> 

May be I'm wrong, but this will not guarantee choice of different datacenters 
for n-1 and remaining replica. 
I have experimented with rules like this, trying to put one replica to "main 
host" and other replicas to some other hosts.
Some OSDs was referenced two times in some of generated pg's.

Pavel.



> Would pick one datacenter, and put R-1 copies of the data in separate
> rooms. Then it would pick another datacenter and put 1 copy of the
> data somewhere in it. I haven't tested this and it's been a while so
> there might be some sharp edges, though (I *think* that should work
> just fine, but you might need to use choose statements instead of
> chooseleaf all the way down or something).
> -Greg
> 
>> And it's even possible to specifiy the "primary affinity" in the sense
>> that for specified clients (or for specified pools) ceph has to store
>> the primary replica in the closest (to the client) datacenter and the
>> other replica in another datacenter.
> 
> To do something like this you'd want to set up pools with special
> rules to do that. Instead of "step take default" you'd do "step take
> ".
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] sparse copy between pools

2014-05-14 Thread Andrey Korolyov
On 05/14/2014 02:13 PM, Erwin Lubbers wrote:
> Hi,
> 
> I'm trying to copy a sparse provisioned rbd image from pool A to pool B (both 
> are replicated three times). The image has a disksize of 8 GB and contains 
> around 1.4 GB of data. I do use:
> 
> rbd cp PoolA/Image PoolB/Image
> 
> After copying "ceph -s" tells me that 24 GB diskspace extra is in use. Then I 
> delete the original pool A image and only 8 GB of space is freed.
> 
> Does Ceph not sparse copy the image using cp? Is there another way to do so?
> 
> I'm using 0.67.7 dumpling on this cluster.

I believe that the
http://tracker.ceph.com/projects/ceph/repository/revisions/824da2029613a6f4b380b6b2f16a0bd0903f7e3c/diff/src/librbd/internal.cc
had to went to the dumpling as backport; github shows that it was not.

Josh, would you mind to add your fix there too?
> 
> Regards,
> Erwin
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Advanced CRUSH map rules

2014-05-14 Thread Gregory Farnum
On Wed, May 14, 2014 at 10:52 AM, Pavel V. Kaygorodov  wrote:
> Hi!
>
>> CRUSH can do this. You'd have two choose ...emit sequences;
>> the first of which would descend down to a host and then choose n-1
>> devices within the host; the second would descend once. I think
>> something like this should work:
>>
>> step take default
>> step choose firstn 1 datacenter
>> step chooseleaf firstn -1 room
>> step emit
>> step chooseleaf firstn 1 datacenter
>> step emit
>>
>
> May be I'm wrong, but this will not guarantee choice of different datacenters 
> for n-1 and remaining replica.
> I have experimented with rules like this, trying to put one replica to "main 
> host" and other replicas to some other hosts.
> Some OSDs was referenced two times in some of generated pg's.

Argh, I forgot about this, but you're right. :( So you can construct
these sorts of systems manually (by having different "step take...step
emit" blocks, but CRUSH won't do it for you in a generic way.

However, for *most* situations that people are interested in, you can
pull various tricks to accomplish what you're actually after. (I
haven't done this one myself, but I'm told others have.) For instance,
if you just want 1 copy segregated from the others, you can do this:

step take default
step choose firstn 2 datacenter
step chooseleaf firstn -1 room
step emit

That will generate an ordered list of 2(n-1) OSDs, but since you only
want n, you'll take n-1 from the first datacenter and only 1 from the
second. :) You can extend this to n-2, etc.

If you have the pools associated with particular datacenters, you can
set up rules which place a certain number of copies in the primary
datacenter, and then use parallel crush maps to choose one of the
other datacenters for a given number of replica copies. (That is, you
can have multiple root buckets; one for each datacenter that includes
everybody BUT the datacenter it is associated with.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why number of objects increase when a PG is added

2014-05-14 Thread Shesha Sreenivasamurthy
Hi,
   I was experimenting with Ceph and found an interesting behavior  (at
least to me) : Number of objects doubled when a new placement group was
added.

Experiment Set Up:

   - 3 Nodes with one OSD per node
   - Replication = 1
   - ceph osd pool create $poolName 1;
  - ceph osd pool set $poolName size 1;
   - Set number of PG=30
   - ceph osd pool set $poolName pg_num 30;
  - ceph osd pool set $poolName pgp_num 30
   - Start creating objects of 1000 Bytes for a period of 120 seconds using
   rados bench with 1 thread
   - rados -p $poolName -b 1000 -t 1 bench 120 write &> bench.out
   - While the creation is going on gather df statistics every second
   - rados df -p $poolName &> df.out
   - After 75 seconds add a new placement group
  - ceph osd pool set $poolName pg_num 31;
  - ceph osd pool set $poolName pgp_num 31;
   - Plot the number of objects and data size from the above df command.

I was wondering why the number of object count doubled when we add an new
placement group.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why number of objects increase when a PG is added

2014-05-14 Thread Gregory Farnum
On Wed, May 14, 2014 at 12:12 PM, Shesha Sreenivasamurthy
 wrote:
> Hi,
>I was experimenting with Ceph and found an interesting behavior  (at
> least to me) : Number of objects doubled when a new placement group was
> added.
>
> Experiment Set Up:
>
> 3 Nodes with one OSD per node
> Replication = 1
>
> ceph osd pool create $poolName 1;
> ceph osd pool set $poolName size 1;
>
> Set number of PG=30
>
> ceph osd pool set $poolName pg_num 30;
> ceph osd pool set $poolName pgp_num 30
>
> Start creating objects of 1000 Bytes for a period of 120 seconds using rados
> bench with 1 thread
>
> rados -p $poolName -b 1000 -t 1 bench 120 write &> bench.out
>
> While the creation is going on gather df statistics every second
>
> rados df -p $poolName &> df.out
>
> After 75 seconds add a new placement group
>
> ceph osd pool set $poolName pg_num 31;
> ceph osd pool set $poolName pgp_num 31;
>
> Plot the number of objects and data size from the above df command.
>
> I was wondering why the number of object count doubled when we add an new
> placement group.

It's an accounting artifact. When you split PGs, each of the "child"
PGs will report the parents' statistics until it does a scrub and
knows how much data is actually present.
I believe this is changed in Firefly, so it splits up the data
proportionally between them.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephx authentication defaults

2014-05-14 Thread Brian Rak
Why are the defaults for 'cephx require signatures' and similar still 
false?  Is it still necessary to maintain backwards compatibility with 
very old clients by default?  It seems like from a security POV, you'd 
want everything to be more secure out of the box, and require the user 
to explicitly disable security if they need backwards compatibility with 
ancient clients.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] simultaneous access to ceph via librados and s3 gw

2014-05-14 Thread Lukac, Erik
Hi there,

does anybody have an idea, how I can access my files created via librados 
through the s3 gateway on my ceph-cluster?

Uploading via librados and then accessing via s3 seems to be impossible because 
I only see a bunch of entries but not the files I uploaded.

The perfect solution would be, if I can access all my content through s3 gw, 
librados and rbd.

Does anybody have a hint how I could do that? Or is it simply not possible?

Thanks in advance

Erik
--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] librados with java - who is using it?

2014-05-14 Thread Lukac, Erik
Hi there,



me again



is there anybody who uses librados in java? It seems like my company would be 
the first one who thinks about using it and if I (as a part of OPS-Team) cant 
convince our DEV-Team to use librados and improve performance they'll use 
radosgw :(



I'd like to know best practices. Maybe anybody wants to share knowledge with me.



Thanks in advance



Erik

--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] simultaneous access to ceph via librados and s3 gw

2014-05-14 Thread Gregory Farnum
On Wed, May 14, 2014 at 2:42 PM, Lukac, Erik  wrote:
> Hi there,
>
> does anybody have an idea, how I can access my files created via librados
> through the s3 gateway on my ceph-cluster?
>
> Uploading via librados and then accessing via s3 seems to be impossible
> because I only see a bunch of entries but not the files I uploaded.
>
> The perfect solution would be, if I can access all my content through s3 gw,
> librados and rbd.
>
> Does anybody have a hint how I could do that? Or is it simply not possible?

Sadly, this is not an option. The access semantics are just too
different, so each of those mechanisms is an independent pool of
storage. (Well, technically you can look at the s3 content via
librados, but there's not a good library for doing so.)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] simultaneous access to ceph via librados and s3 gw

2014-05-14 Thread Lukac, Erik
Hi Greg,



wow, that was a fast answer! Thanks a lot!



Okay, I suspected that :(



Good night from Munic

h



Erik


Von: Gregory Farnum [g...@inktank.com]
Gesendet: Mittwoch, 14. Mai 2014 23:55
An: Lukac, Erik
Cc: ceph-us...@ceph.com
Betreff: Re: [ceph-users] simultaneous access to ceph via librados and s3 gw

On Wed, May 14, 2014 at 2:42 PM, Lukac, Erik  wrote:
> Hi there,
>
> does anybody have an idea, how I can access my files created via librados
> through the s3 gateway on my ceph-cluster?
>
> Uploading via librados and then accessing via s3 seems to be impossible
> because I only see a bunch of entries but not the files I uploaded.
>
> The perfect solution would be, if I can access all my content through s3 gw,
> librados and rbd.
>
> Does anybody have a hint how I could do that? Or is it simply not possible?

Sadly, this is not an option. The access semantics are just too
different, so each of those mechanisms is an independent pool of
storage. (Well, technically you can look at the s3 content via
librados, but there's not a good library for doing so.)
-Greg
Software Engineer #42 @ http://inktank.com | 
http://ceph.com
--
Bayerischer Rundfunk; Rundfunkplatz 1; 80335 München
Telefon: +49 89 590001; E-Mail: i...@br.de; Website: http://www.BR.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread Josef Johansson

Hi,

So, apparently tmpfs does not support non-root xattr due to a possible 
DoS-vector. There's configuration set for enabling it as far as I can see.


CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y

Anyone know a way around it? Saw that there's a patch for enabling it, 
but recompiling my kernel is out of reach right now ;)


Created the osd with following:

root@osd1:/# dd seek=6G if=/dev/zero of=/dev/shm/test-osd/img bs=1 count=1
root@osd1:/# losetup /dev/loop0 /dev/shm/test-osd/img
root@osd1:/# mkfs.xfs /dev/loop0
root@osd1:/# ceph osd create
50
root@osd1:/# mkdir /var/lib/ceph/osd/ceph-50
root@osd1:/# mount -t xfs /dev/loop0 /var/lib/ceph/osd/ceph-50
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey 
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:29.796822 7f40063bb780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:29.798583 7f40063bb780 -1 journal check: ondisk fsid 
bc14ff30-e016-4e0d-9672-96262ee5f07e doesn't match expected 
b3f5b98b-e024-4153-875d-5c758a6060eb, invalid (someone else's?) journal
2014-05-15 00:20:29.802155 7f40063bb780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:29.807237 7f40063bb780 -1 
filestore(/var/lib/ceph/osd/ceph-50) could not find 
23c2fcde/osd_superblock/0//-1 in index: (2) No such file or directory
2014-05-15 00:20:29.809083 7f40063bb780 -1 created object store 
/var/lib/ceph/osd/ceph-50 journal /dev/sdc7 for osd.50 fsid 
c51a2683-55dc-4634-9d9d-f0fec9a6f389
2014-05-15 00:20:29.809121 7f40063bb780 -1 auth: error reading file: 
/var/lib/ceph/osd/ceph-50/keyring: can't open 
/var/lib/ceph/osd/ceph-50/keyring: (2) No such file or directory
2014-05-15 00:20:29.809179 7f40063bb780 -1 created new key in keyring 
/var/lib/ceph/osd/ceph-50/keyring
root@osd1:/# ceph-osd --debug_ms 50 -i 50 --mkfs --mkkey 
--osd-journal=/dev/sdc7 --mkjournal
2014-05-15 00:20:51.122716 7ff813ba4780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:51.126275 7ff813ba4780 -1 journal FileJournal::_open: 
aio not supported without directio; disabling aio
2014-05-15 00:20:51.129532 7ff813ba4780 -1 provided osd id 50 != 
superblock's -1
2014-05-15 00:20:51.129845 7ff813ba4780 -1  ** ERROR: error creating 
empty object store in /var/lib/ceph/osd/ceph-50: (22) Invalid argument


Cheers,
Josef

Christian Balzer skrev 2014-05-14 14:33:

Hello!

On Wed, 14 May 2014 11:29:47 +0200 Josef Johansson wrote:


Hi Christian,

I missed this thread, haven't been reading the list that well the last
weeks.

You already know my setup, since we discussed it in an earlier thread. I
don't have a fast backing store, but I see the slow IOPS when doing
randwrite inside the VM, with rbd cache. Still running dumpling here
though.


Nods, I do recall that thread.


A thought struck me that I could test with a pool that consists of OSDs
that have tempfs-based disks, think I have a bit more latency than your
IPoIB but I've pushed 100k IOPS with the same network devices before.
This would verify if the problem is with the journal disks. I'll also
try to run the journal devices in tempfs as well, as it would test
purely Ceph itself.


That would be interesting indeed.
Given what I've seen (with the journal at 20% utilization and the actual
filestore ataround 5%) I'd expect Ceph to be the culprit.
  

I'll get back to you with the results, hopefully I'll manage to get them
done during this night.


Looking forward to that. ^^


Christian

Cheers,
Josef

On 13/05/14 11:03, Christian Balzer wrote:

I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore
options and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of
a RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
which is of course vastly faster than the normal indvidual HDDs could
do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything
else has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below.
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in
my test case the only time the caching becomes noticeable is if I
increase the cache size to something larger than the test data size.
^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped.

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:


On Wed, 7 May 2014 22:13:53

[ceph-users] PCI-E SSD Journal for SSD-OSD Disks

2014-05-14 Thread Tyler Wilson
Hey All,

I am setting up a new storage cluster that absolutely must have the best
read/write sequential speed @ 128k and the highest IOps at 4k read/write as
possible.

My current specs for each storage node are currently;
CPU: 2x E5-2670V2
Motherboard: SM X9DRD-EF
OSD Disks: 20-30 Samsung 840 1TB
OSD Journal(s): 1-2 Micron RealSSD P320h
Network: 4x 10gb, Bridged
Memory: 32-96GB depending on need

Does anyone see any potential bottlenecks in the above specs? What kind of
improvements or configurations can we make on the OSD config side? We are
looking to run this with 2 replication.

Thanks for your guys assistance with this.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bulk storage use case

2014-05-14 Thread Craig Lewis



I would suggest you start some smaller scale tests to get a feeling
for the performance before committing to a large purchase of this
hardware type.

Indeed, without some solid pointers, this is the only way left.



Even with solid pointers, that's the best way.  :-)


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] can i change the ruleset for the default pools (data, metadata, rbd)?

2014-05-14 Thread Cao, Buddy
Hi,

I notice after create ceph cluster, the ruleset for the default pools (data, 
metadata, rbd) are 0,1,2 respectively. After creating the cluster, are there 
any impact if I change the default ruleset to other ruleset?


Wei Cao (Buddy)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCI-E SSD Journal for SSD-OSD Disks

2014-05-14 Thread Mark Nelson

On 05/14/2014 06:36 PM, Tyler Wilson wrote:

Hey All,


Hi!



I am setting up a new storage cluster that absolutely must have the best
read/write sequential speed @ 128k and the highest IOps at 4k read/write
as possible.


I assume random?



My current specs for each storage node are currently;
CPU: 2x E5-2670V2
Motherboard: SM X9DRD-EF
OSD Disks: 20-30 Samsung 840 1TB
OSD Journal(s): 1-2 Micron RealSSD P320h
Network: 4x 10gb, Bridged
Memory: 32-96GB depending on need

Does anyone see any potential bottlenecks in the above specs? What kind
of improvements or configurations can we make on the OSD config side? We
are looking to run this with 2 replication.


Likely you'll run into latency due to context switching and lock 
contention in the OSDs and maybe even some kernel slowness.  Potentially 
you could end up CPU limited too, even with E5-2670s given how fast all 
of those SSDs are.  I'd suggest considering a chassis without an 
expander backplane and using multiple controllers with the drives 
directly attached.


There's work going into improving things on the Ceph side but I don't 
know how much of it has even hit our wip branches in github yet.  So for 
now ymmv, but there's a lot of work going on in this area as it's 
something that lots of folks are interested in.


I'd also suggest testing whether or not putting all of the journals on 
the RealSSD cards actually helps you that much over just putting your 
journals on the other SSDs.  The advantage here is that by putting 
journals on the 2.5" SSDs, you don't lose a pile of OSDs if one of those 
PCIE cards fails.


The only other thing I would be careful about is making sure that your 
SSDs are good about dealing with power failure during writes.  Not all 
SSDs behave as you would expect.




Thanks for your guys assistance with this.


np, good luck!




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Move osd disks between hosts

2014-05-14 Thread Craig Lewis




On 5/14/14 06:36 , Dinu Vlad wrote:



Hi Dinu,

On Wed, 14 May 2014, Dinu Vlad wrote:

I'm running a ceph cluster with 3 mon and 4 osd nodes (32 disks total) and I've been 
looking at the possibility to "migrate" the data to 2 new nodes. The operation 
should happen by relocating the disks - I'm not getting any new hard-drives. The cluster 
is used as a backend for an openstack cloud, so downtime should be as short as possible - 
preferably not more than 24 h during the week-end.

I'd like a second opinion on the process - since I do not have the resources to 
test the move scenario. I'm running emperor (0.72.1) at the moment. All pools 
in the cluster have size 2. Each existing OSD nodes have each an SSD for 
journals; /dev/disk/by-id paths were used.

Here's what I think would work:
1 - stop ceph on the existing OSD nodes (all of them) and shutdown the node 1 & 
2;
2 - take drives 1-16/ssds 1-2 out and put them in the new node #1; start it up 
with ceph's upstart script set on manual and check/correct journal paths
3 - edit the CRUSH map on the monitors to reflect the new situation
4 - start ceph on the new node #1 and old nodes 3 & 4; wait for the rebuild to 
happen
5 - repeat steps 1-4 for the rest of the nodes/drives;

If you used ceph-deploy and/or ceph-disk to set up these OSDs (that is, if
they are stored on labeled GPT partitions such that upstart is
automagically starting up the ceph-osd daemons for you without you putting
anythign in /etc/fstab to manually mount the volumes) then all of this
should be plug and play for you--including step #3.  By default, the
startup process will 'fix' the CRUSH hierarchy position based on the
hostname and (if present) other positional data configured for 'crush
location' in ceph.conf.  The only real requirement is that both the osd
data and journal volumes get moved so that the daemon has everything it
needs to start up.

sage




If you used disk encryption (ceph-deploy --dmcrypt), you'll also need to 
copy the keys from /etc/ceph/dmcrypt-keys/.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Flapping OSDs. Safe to upgrade?

2014-05-14 Thread Craig Lewis
I have 4 OSDs that won't stay in the cluster.  I restart them, they join 
for a bit, then get kicked out because they stop responding to pings 
from the other OSDs.


I don't know what the issue is.  The disks look fine.  SMART reports no 
errors or reallocated sectors.  iostat says the disks are nearly idle 
when the OSD stops responding.  dmesg says it's restarting the process, 
but doesn't say anything else interesting.  kern.log doesn't say anything.


I'm out of ideas, and I'm ready to gamble.


So I have two ideas that might fix the issue.  I can upgrade Emperor to 
Firefly.  Or I can upgrade Ubuntu 12.04 (kernel 3.5.0-49-generic) to 
14.04 (kernel 3.13.0-24-generic).  If I upgrade to 14.04, I plan to hold 
Ceph on Emperor for the time being.




My PG states:
1989 active+clean
  17 active+remapped
  12 down+peering
 507 active+degraded
   1 active+degraded+remapped+wait_backfill
  28 stale+down+peering
   2 active+recovering+degraded+remapped
   1 down+remapped+peering
3 incomplete

If I upgrade to Firefly, am I going to make things worse?

Any opinions on which gamble is more likely to pay off?


I plan to do both upgrades, but I want to do them one at a time unless 
necessary.  I'm wondering which upgrade I should attempt first.







--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs. Safe to upgrade?

2014-05-14 Thread Brian Rak
Anything in dmesg?  When you say restart, do you mean a physical 
restart, or just restarting the daemon?  If it takes a physical restart 
and you're using intel NICs, it might be worth upgrading network 
drivers.  Old versions have some bugs that cause them to just drop traffic.


On 5/14/2014 9:06 PM, Craig Lewis wrote:
I have 4 OSDs that won't stay in the cluster.  I restart them, they 
join for a bit, then get kicked out because they stop responding to 
pings from the other OSDs.


I don't know what the issue is.  The disks look fine.  SMART reports 
no errors or reallocated sectors.  iostat says the disks are nearly 
idle when the OSD stops responding.  dmesg says it's restarting the 
process, but doesn't say anything else interesting.  kern.log doesn't 
say anything.


I'm out of ideas, and I'm ready to gamble.


So I have two ideas that might fix the issue.  I can upgrade Emperor 
to Firefly.  Or I can upgrade Ubuntu 12.04 (kernel 3.5.0-49-generic) 
to 14.04 (kernel 3.13.0-24-generic).  If I upgrade to 14.04, I plan to 
hold Ceph on Emperor for the time being.




My PG states:
1989 active+clean
  17 active+remapped
  12 down+peering
 507 active+degraded
   1 active+degraded+remapped+wait_backfill
  28 stale+down+peering
   2 active+recovering+degraded+remapped
   1 down+remapped+peering
3 incomplete

If I upgrade to Firefly, am I going to make things worse?

Any opinions on which gamble is more likely to pay off?


I plan to do both upgrades, but I want to do them one at a time unless 
necessary.  I'm wondering which upgrade I should attempt first.







--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCI-E SSD Journal for SSD-OSD Disks

2014-05-14 Thread Mark Kirkwood

On 15/05/14 11:36, Tyler Wilson wrote:

Hey All,

I am setting up a new storage cluster that absolutely must have the best
read/write sequential speed @ 128k and the highest IOps at 4k read/write
as possible.

My current specs for each storage node are currently;
CPU: 2x E5-2670V2
Motherboard: SM X9DRD-EF
OSD Disks: 20-30 Samsung 840 1TB
OSD Journal(s): 1-2 Micron RealSSD P320h
Network: 4x 10gb, Bridged
Memory: 32-96GB depending on need

Does anyone see any potential bottlenecks in the above specs? What kind
of improvements or configurations can we make on the OSD config side? We
are looking to run this with 2 replication.

Thanks for your guys assistance with this.


On thing that comes to mind is write endurance for the Samsung drives:

Samsung 840 Pro: 
(http://www.samsung.com/us/pdf/memory-storage/840PRO_25_SATA_III_Spec.pdf)


For enterprise applications, 5 years limited warranty assumes a
maximum average workload
of 40GB/day (calculated based on host
writes and on the industry standard of 3-month data retention).
Workloads in excess of 40GB/day are not covered under warrant

Compare with

Intel DC3700 
(http://www.intel.com/content/dam/www/public/us/en/documents/product-specifications/ssd-dc-s3700-spec.pdf)


10 drive writes per day for 5 years

I'm not trying to be an Intel sales guy here, but I'd be wary of using 
the Samsung 840 for a (busy) server based workload - 40G of data churn 
per day is not a great deal.


Regards

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OpenStack Icehouse and ephemeral disks created from image

2014-05-14 Thread Maciej Gałkiewicz
On 28 April 2014 16:11, Sebastien Han  wrote:

> Yes yes, just restart cinder-api and cinder-volume.
> It worked for me.


In my case the image is still downloaded:(

 {"status": "active", "name": "instance.image-v0.98-1-cc2.img", "tags": [],
"container_format": "bare", "created_at": "2014-05-14T10:32:08Z",
"disk_format": "raw", "updated_at": "2014-05-14T10:34:26Z", "visibility":
"public", "self": "/v2/
images/d747645d-1d8c-4725-aa6b-a75878ca99b9", "min_disk": 0, "protected":
false, "id": "d747645d-1d8c-4725-aa6b-a75878ca99b9", "file":
"/v2/images/d747645d-1d8c-4725-aa6b-a75878ca99b9/file", "checksum":
"3a9ed9e7a37207b8da3e0c89ed467447",
 "owner": "66a545918dea4e62b83356179c60dd2f", "size": 4831838208,
"min_ram": 0, "schema": "/v2/schemas/image"}
 log_http_response
/usr/lib/python2.7/dist-packages/glanceclient/common/http.py:151
2014-05-15 03:49:51.936 6745 INFO cinder.volume.flows.manager.create_volume
[req-05877879-f875-4a69-893b-dde93c2a9267 3abc796d9c544d039fe7d5b90b206a30
e466feaf9a58472a86989156edc9acf4 - - -] Volume
f9b21cc6-db73-41c4-9c3b-04ef6217fb3c: be
ing created using CreateVolumeFromSpecTask._create_from_image with
specification: {'status': u'creating', 'image_location': (None, None),
'volume_size': 8, 'volume_name':
u'volume-f9b21cc6-db73-41c4-9c3b-04ef6217fb3c', 'image_id': u'd7476
45d-1d8c-4725-aa6b-a75878ca99b9', 'image_service':
,
'image_meta': {'status': u'active', 'name':
u'instance.image-v0.98-1-cc2.img', 'deleted': None, 'container_format': u'ba
re', 'created_at': datetime.datetime(2014, 5, 14, 10, 32, 8,
tzinfo=), 'disk_format':
u'raw', 'updated_at': datetime.datetime(2014, 5, 14, 10, 34, 26,
tzinfo=), 'id': u'd747645d-1d8c-4725-aa6b-a75878ca99b9', 'owner':
u'66a545918dea4e62b83356179c60dd2f', 'min_ram': 0, 'checksum':
u'3a9ed9e7a37207b8da3e0c89ed467447', 'min_disk': 0, 'is_public': None,
'deleted_at': None, 'properties':
 {}, 'size': 4831838208}}
2014-05-15 03:49:51.936 6745 DEBUG
cinder.volume.flows.manager.create_volume
[req-05877879-f875-4a69-893b-dde93c2a9267 3abc796d9c544d039fe7d5b90b206a30
e466feaf9a58472a86989156edc9acf4 - - -] Cloning
f9b21cc6-db73-41c4-9c3b-04ef6217fb3c f
rom image d747645d-1d8c-4725-aa6b-a75878ca99b9  at location (None, None).
_create_from_image
/usr/lib/python2.7/dist-packages/cinder/volume/flows/manager/create_volume.py:528
2014-05-15 03:49:51.936 6745 DEBUG cinder.volume.drivers.rbd
[req-05877879-f875-4a69-893b-dde93c2a9267 3abc796d9c544d039fe7d5b90b206a30
e466feaf9a58472a86989156edc9acf4 - - -] creating volume
'volume-f9b21cc6-db73-41c4-9c3b-04ef6217fb3c'
create_volume
/usr/lib/python2.7/dist-packages/cinder/volume/drivers/rbd.py:469
2014-05-15 03:49:52.244 6745 DEBUG
cinder.volume.flows.manager.create_volume
[req-05877879-f875-4a69-893b-dde93c2a9267 3abc796d9c544d039fe7d5b90b206a30
e466feaf9a58472a86989156edc9acf4 - - -] Attempting download of
d747645d-1d8c-4725-aa6b
-a75878ca99b9 ((None, None)) to volume
f9b21cc6-db73-41c4-9c3b-04ef6217fb3c. _copy_image_to_volume
/usr/lib/python2.7/dist-packages/cinder/volume/flows/manager/create_volume.py:450

Any suggestions?

regards
-- 
Maciej Gałkiewicz
Shelly Cloud Sp. z o. o., Co-founder, Sysadmin
http://shellycloud.com/, mac...@shellycloud.com
KRS: 440358 REGON: 101504426
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flapping OSDs. Safe to upgrade?

2014-05-14 Thread Craig Lewis



Anything in dmesg? 


Just
[188924.137100] init: ceph-osd (ceph/6) main process (8262) killed by 
ABRT signal

[188924.137138] init: ceph-osd (ceph/6) main process ended, respawning


When you say restart, do you mean a physical restart, or just 
restarting the daemon?  If it takes a physical restart and you're 
using intel NICs, it might be worth upgrading network drivers. Old 
versions have some bugs that cause them to just drop traffic.


Either a daemon restart, or a node reboot.

I am using Intel NICs. lspci says 'Intel Corporation 82576 Gigabit 
Network Connection'.  It doesn't appear to be dropped traffic.  It's too 
consistent to be randomly dropped traffic.



But I'll take that as a vote for the Ubuntu upgrade.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PCI-E SSD Journal for SSD-OSD Disks

2014-05-14 Thread Christian Balzer
On Wed, 14 May 2014 19:28:17 -0500 Mark Nelson wrote:

> On 05/14/2014 06:36 PM, Tyler Wilson wrote:
> > Hey All,
> 
> Hi!
> 
> >
> > I am setting up a new storage cluster that absolutely must have the
> > best read/write sequential speed @ 128k and the highest IOps at 4k
> > read/write as possible.
> 
> I assume random?
> 
> >
> > My current specs for each storage node are currently;
> > CPU: 2x E5-2670V2
> > Motherboard: SM X9DRD-EF
> > OSD Disks: 20-30 Samsung 840 1TB
> > OSD Journal(s): 1-2 Micron RealSSD P320h
> > Network: 4x 10gb, Bridged
I assume you mean 2x10Gb bonded for public and 2x10Gb for cluster network?

The SSDs you specified would read at about 500MB/s, meaning that only 4 of
them would already saturate your network uplink.
For writes (assuming journal on SSDs, see below) you reach that point with
just 8 SSDs.

> > Memory: 32-96GB depending on need
RAM is pretty cheap these days and a large pagecache on the storage nodes
is always quite helpful.

> >

How many of these nodes are you planning to deploy initially?
As always and especially when going for performance, more and smaller
nodes tend to be better, also less impact if one goes down.
And in your case it is easier to balance storage and network bandwidth,
see above.

> > Does anyone see any potential bottlenecks in the above specs? What kind
> > of improvements or configurations can we make on the OSD config side?
> > We are looking to run this with 2 replication.
> 
> Likely you'll run into latency due to context switching and lock 
> contention in the OSDs and maybe even some kernel slowness.  Potentially 
> you could end up CPU limited too, even with E5-2670s given how fast all 
> of those SSDs are.  I'd suggest considering a chassis without an 
> expander backplane and using multiple controllers with the drives 
> directly attached.
> 

Indeed, I'd be worried about that as well, same with the
chassis/controller bit.

> There's work going into improving things on the Ceph side but I don't 
> know how much of it has even hit our wip branches in github yet.  So for 
> now ymmv, but there's a lot of work going on in this area as it's 
> something that lots of folks are interested in.
> 
If you look at the current "Slow IOPS on RBD compared to journal and
backing devices" thread and the Inktank document referenced in it

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
 

you should probably assume no more than 800 random write IOPS and 4000
random read IOPS per OSD (4KB block size). 
That later number I can also reproduce with my cluster.

Now I expect those numbers to go up as Ceph is improved, but for the time
being those limits might influence your choice of hardware.

> I'd also suggest testing whether or not putting all of the journals on 
> the RealSSD cards actually helps you that much over just putting your 
> journals on the other SSDs.  The advantage here is that by putting 
> journals on the 2.5" SSDs, you don't lose a pile of OSDs if one of those 
> PCIE cards fails.
> 
More than seconded, I could only find READ values on the Micron site which
makes me very suspicious, as the journal's main role is to be able to
WRITE as fast as possible. Also all journals combined ought to be faster
than your final storage. 
Lastly there was no endurance data on the Micron site either and with ALL
your writes having to through those devices I'd be dead scared to deploy
them.

I'd spend that money on the case and controllers as mentioned above and
better storage SSDs.

I was going to pipe up about the Samsungs, but Mark Kirkwood did beat me
to it.
Unless you can be 100% certain that your workload per storage SSD
doesn't exceed 40GB/day I'd stay very clear of them.

Christian

> The only other thing I would be careful about is making sure that your 
> SSDs are good about dealing with power failure during writes.  Not all 
> SSDs behave as you would expect.
> 
> >
> > Thanks for your guys assistance with this.
> 
> np, good luck!
> 
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-14 Thread xan.peng
On Thu, May 8, 2014 at 9:37 AM, Gregory Farnum  wrote:
>
> Hmm, with 128 IOs at a time (I believe I'm reading that correctly?)
> that's about 40ms of latency per op (for userspace RBD), which seems
> awfully long.

Maybe this is off the topic, AFAIK "--iodepth=128" doesn't submits 128
IOs at a time.
There is a option of fio "iodepth_batch_submit=int",  which defaults
to 1, makes fio submit
each IO as soon as it is available.

See more: http://www.bluestop.org/fio/HOWTO.txt
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance stats

2014-05-14 Thread yalla.gnan.kumar
Hi All,

Is there a way by which we can measure the performance of Ceph block devices ? 
(Example :  I/O stats, data to identify bottlenecks etc).
Also what are the available ways in which we can compare Ceph storage 
performance with other storage solutions  ?


Thanks
Kumar




This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com