date:20140513

[ceph-users] crushmap question

2014-05-13 Thread Cao, Buddy

Hi,

I have a crushmap structure likes root->rack->host->osds. I designed the rule 
below, since I used "chooseleaf...rack" in rule definition, if there is only 
one rack in the cluster, the ceph gps will always stay at stuck unclean state 
(that is because the default metadata/data/rbd pool set 2 replicas). Could you 
let me know how do I configure the rule to let it can also work in a cluster 
with only one rack?

rule ssd{
ruleset 1
type replicated
min_size 0
max_size 10
step take root
step chooseleaf firstn 0 type rack
step emit
}

BTW, if I add a new rack into the crushmap, the pg status will finally get to 
active+clean. However, my customer do ONLY have one rack in their env, so hard 
for me to have workaround to ask him setup several racks.

Wei Cao (Buddy)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crushmap question

2014-05-13 Thread Peter

perhaps group sets of hosts into racks in crushmap. The crushmap doesn't 
have to strictly map the real world.


On 05/13/2014 08:52 AM, Cao, Buddy wrote:


Hi,

I have a crushmap structure likes root->rack->host->osds. I designed 
the rule below, since I used "chooseleaf...rack" in rule definition, 
if there is only one rack in the cluster, the ceph gps will always 
stay at stuck unclean state (that is because the default 
metadata/data/rbd pool set 2 replicas). Could you let me know how do I 
configure the rule to let it can also work in a cluster with only one 
rack?


rule ssd{

ruleset 1

type replicated

min_size 0

max_size 10

step take root

step chooseleaf firstn 0 type rack

step emit

}

BTW, if I add a new rack into the crushmap, the pg status will finally 
get to active+clean. However, my customer do ONLY have one rack in 
their env, so hard for me to have workaround to ask him setup several 
racks.


Wei Cao (Buddy)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Christian Balzer


I'm clearly talking to myself, but whatever.

For Greg, I've played with all the pertinent journal and filestore options
and TCP nodelay, no changes at all.

Is there anybody on this ML who's running a Ceph cluster with a fast
network and FAST filestore, so like me with a big HW cache in front of a
RAID/JBODs or using SSDs for final storage?

If so, what results do you get out of the fio statement below per OSD?
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which
is of course vastly faster than the normal indvidual HDDs could do.

So I'm wondering if I'm hitting some inherent limitation of how fast a
single OSD (as in the software) can handle IOPS, given that everything else
has been ruled out from where I stand.

This would also explain why none of the option changes or the use of
RBD caching has any measurable effect in the test case below. 
As in, a slow OSD aka single HDD with journal on the same disk would
clearly benefit from even the small 32MB standard RBD cache, while in my
test case the only time the caching becomes noticeable is if I increase
the cache size to something larger than the test data size. ^o^

On the other hand if people here regularly get thousands or tens of
thousands IOPS per OSD with the appropriate HW I'm stumped. 

Christian

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote:

> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote:
> 
> > Oh, I didn't notice that. I bet you aren't getting the expected
> > throughput on the RAID array with OSD access patterns, and that's
> > applying back pressure on the journal.
> > 
> 
> In the a "picture" being worth a thousand words tradition, I give you
> this iostat -x output taken during a fio run:
> 
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>   50.820.00   19.430.170.00   29.58
> 
> Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda   0.0051.500.00 1633.50 0.00  7460.00
> 9.13 0.180.110.000.11   0.01   1.40 sdb
> 0.00 0.000.00 1240.50 0.00  5244.00 8.45 0.30
> 0.250.000.25   0.02   2.00 sdc   0.00 5.00
> 0.00 2468.50 0.00 13419.0010.87 0.240.100.00
> 0.10   0.09  22.00 sdd   0.00 6.500.00 1913.00
> 0.00 10313.0010.78 0.200.100.000.10   0.09  16.60
> 
> The %user CPU utilization is pretty much entirely the 2 OSD processes,
> note the nearly complete absence of iowait.
> 
> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs.
> Look at these numbers, the lack of queues, the low wait and service
> times (this is in ms) plus overall utilization.
> 
> The only conclusion I can draw from these numbers and the network results
> below is that the latency happens within the OSD processes.
> 
> Regards,
> 
> Christian
> > When I suggested other tests, I meant with and without Ceph. One
> > particular one is OSD bench. That should be interesting to try at a
> > variety of block sizes. You could also try runnin RADOS bench and
> > smalliobench at a few different sizes.
> > -Greg
> > 
> > On Wednesday, May 7, 2014, Alexandre DERUMIER 
> > wrote:
> > 
> > > Hi Christian,
> > >
> > > Do you have tried without raid6, to have more osd ?
> > > (how many disks do you have begin the raid6 ?)
> > >
> > >
> > > Aslo, I known that direct ios can be quite slow with ceph,
> > >
> > > maybe can you try without --direct=1
> > >
> > > and also enable rbd_cache
> > >
> > > ceph.conf
> > > [client]
> > > rbd cache = true
> > >
> > >
> > >
> > >
> > > - Mail original -
> > >
> > > De: "Christian Balzer" >
> > > À: "Gregory Farnum" >,
> > > ceph-users@lists.ceph.com 
> > > Envoyé: Jeudi 8 Mai 2014 04:49:16
> > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
> > > backing devices
> > >
> > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote:
> > >
> > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer
> > > > >
> > > wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The
> > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6
> > > > > behind an Areca 1882 with 4GB of cache.
> > > > >
> > > > > Running this fio:
> > > > >
> > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k
> > > > > --iodepth=128
> > > > >
> > > > > results in:
> > > > >
> > > > > 30k IOPS on the journal SSD (as expected)
> > > > > 110k IOPS on the OSD (it fits neatly into the cache, no surprise
> > > > > there) 3200 IOPS from a VM using userspace RBD
> > > > > 2900 IOPS from a host kernelspace mounted RBD
> > > > >
> > > > > When running the fio from the VM RBD the utilization of the
> > > > > journals is about 20% (2400 IOPS) and the OSDs are bored at 2%
> > > > > (1500 IOPS after some obvious merging).
>

[ceph-users] Journal SSD durability

2014-05-13 Thread Christian Balzer


Hello,

No actual question, just some food for thought and something that later
generations can scour from the ML archive.

I'm planning another Ceph storage cluster, this time a "classic" Ceph
design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and journal.

When juggling the budget for it the 12 DC3700 200GB SSDs of my first
draft stood out like the proverbial sore thumb, nearly 1/6th of the total
budget. 
I really like those SSDs with their smooth performance and durability of
1TB/day writes (over 5 years, same for all the other numbers below), but
wondered if that was really needed. 

This cluster is supposed to provide the storage for VMs (Vservers
really) that are currently on 3 DRBD cluster pairs.
Not particular write intensive, all of them just total about 20GB/day.
With 2 journals per SSD that's 5GB/day of writes, well within the Intel
specification of 20GB/day for their 530 drives (180GB version).

However the uneven IOPS of the 530 and potential future changes in write
patterns make this 300% safety margin still to slim for my liking.

Alas a DC3500 240GB SSD will perform well enough at half the price of the
DC3700 and give me enough breathing room at about 80GB/day writes, so this
is what I will order in the end.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

Hi Christian,

I'm going to test a full ssd cluster in coming months,
I'll send result on the mailing.

Do you have tried to use 1 osd by physical disk ? (without raid6)

Maybe they are bottleneck in osd daemon, 
and using osd daemon by disk could help.

- Mail original - 

De: "Christian Balzer"  
À: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 11:03:47 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 

I'm clearly talking to myself, but whatever. 

For Greg, I've played with all the pertinent journal and filestore options 
and TCP nodelay, no changes at all. 

Is there anybody on this ML who's running a Ceph cluster with a fast 
network and FAST filestore, so like me with a big HW cache in front of a 
RAID/JBODs or using SSDs for final storage? 

If so, what results do you get out of the fio statement below per OSD? 
In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, which 
is of course vastly faster than the normal indvidual HDDs could do. 

So I'm wondering if I'm hitting some inherent limitation of how fast a 
single OSD (as in the software) can handle IOPS, given that everything else 
has been ruled out from where I stand. 

This would also explain why none of the option changes or the use of 
RBD caching has any measurable effect in the test case below. 
As in, a slow OSD aka single HDD with journal on the same disk would 
clearly benefit from even the small 32MB standard RBD cache, while in my 
test case the only time the caching becomes noticeable is if I increase 
the cache size to something larger than the test data size. ^o^ 

On the other hand if people here regularly get thousands or tens of 
thousands IOPS per OSD with the appropriate HW I'm stumped. 

Christian 

On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 

> On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> 
> > Oh, I didn't notice that. I bet you aren't getting the expected 
> > throughput on the RAID array with OSD access patterns, and that's 
> > applying back pressure on the journal. 
> > 
> 
> In the a "picture" being worth a thousand words tradition, I give you 
> this iostat -x output taken during a fio run: 
> 
> avg-cpu: %user %nice %system %iowait %steal %idle 
> 50.82 0.00 19.43 0.17 0.00 29.58 
> 
> Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s 
> avgrq-sz avgqu-sz await r_await w_await svctm %util 
> sda 0.00 51.50 0.00 1633.50 0.00 7460.00 
> 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb 
> 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 
> 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 
> 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 
> 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 
> 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 
> 
> The %user CPU utilization is pretty much entirely the 2 OSD processes, 
> note the nearly complete absence of iowait. 
> 
> sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. 
> Look at these numbers, the lack of queues, the low wait and service 
> times (this is in ms) plus overall utilization. 
> 
> The only conclusion I can draw from these numbers and the network results 
> below is that the latency happens within the OSD processes. 
> 
> Regards, 
> 
> Christian 
> > When I suggested other tests, I meant with and without Ceph. One 
> > particular one is OSD bench. That should be interesting to try at a 
> > variety of block sizes. You could also try runnin RADOS bench and 
> > smalliobench at a few different sizes. 
> > -Greg 
> > 
> > On Wednesday, May 7, 2014, Alexandre DERUMIER  
> > wrote: 
> > 
> > > Hi Christian, 
> > > 
> > > Do you have tried without raid6, to have more osd ? 
> > > (how many disks do you have begin the raid6 ?) 
> > > 
> > > 
> > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > 
> > > maybe can you try without --direct=1 
> > > 
> > > and also enable rbd_cache 
> > > 
> > > ceph.conf 
> > > [client] 
> > > rbd cache = true 
> > > 
> > > 
> > > 
> > > 
> > > - Mail original - 
> > > 
> > > De: "Christian Balzer" > 
> > > À: "Gregory Farnum" >, 
> > > ceph-users@lists.ceph.com  
> > > Envoyé: Jeudi 8 Mai 2014 04:49:16 
> > > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and 
> > > backing devices 
> > > 
> > > On Wed, 7 May 2014 18:37:48 -0700 Gregory Farnum wrote: 
> > > 
> > > > On Wed, May 7, 2014 at 5:57 PM, Christian Balzer 
> > > > > 
> > > wrote: 
> > > > > 
> > > > > Hello, 
> > > > > 
> > > > > ceph 0.72 on Debian Jessie, 2 storage nodes with 2 OSDs each. The 
> > > > > journals are on (separate) DC 3700s, the actual OSDs are RAID6 
> > > > > behind an Areca 1882 with 4GB of cache. 
> > > > > 
> > > > > Running this fio: 
> > > > > 
> > > > > fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > > > > --numjobs=1 --rw=randwrite --name=fiojob --blocksize=4k 
> > > > > --iodepth=128 
> > > > > 
> > > > > results in: 
> > > > > 
> > > > > 30k IOPS on the journal SSD (as expected) 
> > > > > 110k IOPS on the OSD (it

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Christian Balzer


Hello,

On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote:

> Hi Christian,
> 
> I'm going to test a full ssd cluster in coming months,
> I'll send result on the mailing.
>
Looking forward to that.
 
> 
> Do you have tried to use 1 osd by physical disk ? (without raid6)
>
No, if you look back to the last year December "Sanity check..." thread
by me, it gives the reasons.
In short, highest density (thus replication of 2 and to make that safe
based on RAID6) and operational maintainability (it is a remote data
center, so replacing broken disks is a pain).   

That cluster is fast enough for my purposes and that fio test isn't a
typical load for it when it goes into production. 
But for designing a general purpose or high performance Ceph cluster in
the future I'd really love to have this mystery solved.

> Maybe they are bottleneck in osd daemon, 
> and using osd daemon by disk could help.
>
It might, but at the IOPS I'm seeing anybody using SSD for file storage
should have screamed out already. 
Also given the CPU usage I'm seeing during that test run such a setup
would probably require 32+ cores. 
 
Christian

> 
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 11:03:47 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> 
> I'm clearly talking to myself, but whatever. 
> 
> For Greg, I've played with all the pertinent journal and filestore
> options and TCP nodelay, no changes at all. 
> 
> Is there anybody on this ML who's running a Ceph cluster with a fast 
> network and FAST filestore, so like me with a big HW cache in front of a 
> RAID/JBODs or using SSDs for final storage? 
> 
> If so, what results do you get out of the fio statement below per OSD? 
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD,
> which is of course vastly faster than the normal indvidual HDDs could
> do. 
> 
> So I'm wondering if I'm hitting some inherent limitation of how fast a 
> single OSD (as in the software) can handle IOPS, given that everything
> else has been ruled out from where I stand. 
> 
> This would also explain why none of the option changes or the use of 
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would 
> clearly benefit from even the small 32MB standard RBD cache, while in my 
> test case the only time the caching becomes noticeable is if I increase 
> the cache size to something larger than the test data size. ^o^ 
> 
> On the other hand if people here regularly get thousands or tens of 
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
> 
> Christian 
> 
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 
> 
> > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> > 
> > > Oh, I didn't notice that. I bet you aren't getting the expected 
> > > throughput on the RAID array with OSD access patterns, and that's 
> > > applying back pressure on the journal. 
> > > 
> > 
> > In the a "picture" being worth a thousand words tradition, I give you 
> > this iostat -x output taken during a fio run: 
> > 
> > avg-cpu: %user %nice %system %iowait %steal %idle 
> > 50.82 0.00 19.43 0.17 0.00 29.58 
> > 
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s 
> > avgrq-sz avgqu-sz await r_await w_await svctm %util 
> > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 
> > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb 
> > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 
> > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 
> > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 
> > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 
> > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 
> > 
> > The %user CPU utilization is pretty much entirely the 2 OSD processes, 
> > note the nearly complete absence of iowait. 
> > 
> > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. 
> > Look at these numbers, the lack of queues, the low wait and service 
> > times (this is in ms) plus overall utilization. 
> > 
> > The only conclusion I can draw from these numbers and the network
> > results below is that the latency happens within the OSD processes. 
> > 
> > Regards, 
> > 
> > Christian 
> > > When I suggested other tests, I meant with and without Ceph. One 
> > > particular one is OSD bench. That should be interesting to try at a 
> > > variety of block sizes. You could also try runnin RADOS bench and 
> > > smalliobench at a few different sizes. 
> > > -Greg 
> > > 
> > > On Wednesday, May 7, 2014, Alexandre DERUMIER  
> > > wrote: 
> > > 
> > > > Hi Christian, 
> > > > 
> > > > Do you have tried without raid6, to have more osd ? 
> > > > (how many disks do you have begin the raid6 ?) 
> > > > 
> > > > 
> > > > Aslo, I known that direct ios can be quite slow with ceph, 
> > > > 
> > > > maybe can you try without --direct=1 
> > > > 
> > > > and also enable rbd_cache 
> > > > 
> > > > ceph.conf

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Mark Kirkwood

On thing that would put me off the 530 is lack on power off safety 
(capacitor or similar). Given the job of the journal, I think an SSD 
that has some guarantee of write integrity is crucial - so yeah the 
DC3500 or DC3700 seem like the best choices.


Regards

Mark

On 13/05/14 21:31, Christian Balzer wrote:


Hello,

No actual question, just some food for thought and something that later
generations can scour from the ML archive.

I'm planning another Ceph storage cluster, this time a "classic" Ceph
design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and journal.

When juggling the budget for it the 12 DC3700 200GB SSDs of my first
draft stood out like the proverbial sore thumb, nearly 1/6th of the total
budget.
I really like those SSDs with their smooth performance and durability of
1TB/day writes (over 5 years, same for all the other numbers below), but
wondered if that was really needed.

This cluster is supposed to provide the storage for VMs (Vservers
really) that are currently on 3 DRBD cluster pairs.
Not particular write intensive, all of them just total about 20GB/day.
With 2 journals per SSD that's 5GB/day of writes, well within the Intel
specification of 20GB/day for their 530 drives (180GB version).

However the uneven IOPS of the 530 and potential future changes in write
patterns make this 300% safety margin still to slim for my liking.

Alas a DC3500 240GB SSD will perform well enough at half the price of the
DC3700 and give me enough breathing room at about 80GB/day writes, so this
is what I will order in the end.

Christian



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Xabier Elkano

El 13/05/14 11:31, Christian Balzer escribió:
> Hello,
>
> No actual question, just some food for thought and something that later
> generations can scour from the ML archive.
>
> I'm planning another Ceph storage cluster, this time a "classic" Ceph
> design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and journal.
Christian, do yo have many clusters in production? Are there any
advantages with many clusters vs different pools per cluster? What is
the right way to go?, maintain a big cluster or different clusters?
>
> When juggling the budget for it the 12 DC3700 200GB SSDs of my first
> draft stood out like the proverbial sore thumb, nearly 1/6th of the total
> budget. 
> I really like those SSDs with their smooth performance and durability of
> 1TB/day writes (over 5 years, same for all the other numbers below), but
> wondered if that was really needed. 
>
> This cluster is supposed to provide the storage for VMs (Vservers
> really) that are currently on 3 DRBD cluster pairs.
> Not particular write intensive, all of them just total about 20GB/day.
> With 2 journals per SSD that's 5GB/day of writes, well within the Intel
> specification of 20GB/day for their 530 drives (180GB version).
>
> However the uneven IOPS of the 530 and potential future changes in write
> patterns make this 300% safety margin still to slim for my liking.
>
> Alas a DC3500 240GB SSD will perform well enough at half the price of the
> DC3700 and give me enough breathing room at about 80GB/day writes, so this
> is what I will order in the end.
Did you consider DC3700 100G with similar price?
>
> Christian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Fwd: What is link and unlink options used for in radosgw-admin

2014-05-13 Thread Wenjun Huang



Begin forwarded message:

> From: Wenjun Huang 
> Subject: What is link and unlink options used for in radosgw-admin
> Date: May 13, 2014 at 2:55:18 PM GMT+8
> To: ceph-us...@ceph.com
> 
> Hello, everyone
> 
> I am now confused with the options of link & unlink in radosgw-admin utility.
> 
> In my option, if I link the ownerA’s bucketA to ownerB through the command 
> below:
> 
> radosgw-admin bucket link —uid=ownerB —bucket=bucketA
> 
> then, I think the owner of bucketA is ownerB.
> 
> But, in my test, there is nothing changed after I run the command except that 
> the displayed “owner: “ has changed in the result of the command:
>  radosgw-admin bucket stats —bucket=bucketA
> 
> I can still do nothing to bucketA through user ownerB. It seems that the ACL 
> policy of the related bucket do not change.
> 
> Has I misunderstand the usage of link & unlink, but what are they really for?
> 
> Thanks
> Wenjun

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bulk storage use case

2014-05-13 Thread Cédric Lemarchand

Thanks for your answers Craig, it seems this is a niche use case for Ceph, not 
a lot of replies from the ML. 

Cheers

--
Cédric Lemarchand

> Le 11 mai 2014 à 00:35, Craig Lewis  a écrit :
> 
>> On 5/10/14 12:43 , Cédric Lemarchand wrote:
>> Hi Craig,
>> 
>> Thanks, I really appreciate the well detailed response.
>> 
>> I carefully note your advices, specifically about the CPU starvation 
>> scenario, which as you said sounds scary.
>> 
>> About IO, datas will be very resilient, in case of crash, loosing not fully 
>> written objects will not be a problem (they will be re uploaded later), so I 
>> think in this specific case, disabling journaling could be a way to improve 
>> IO.
>> How Ceph will handle that, are there caveats other than just loosing objects 
>> that was in the data path when the crash occurs ? I know it could sounds 
>> weird, but clients workflow could support such thing. 
>> 
>> Thanks !
>> 
>> --
>> Cédric Lemarchand
>> 
>> Le 10 mai 2014 à 04:30, Craig Lewis  a écrit :
> 
> Disabling the journal does make sense in some cases, like all the data is a 
> backup copy.  
> 
> I don't know anything about how Ceph behaves in that setup.  Maybe somebody 
> else can chime in?
> 
> -- 
> Craig Lewis 
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
> 
> Central Desktop. Work together in ways you never thought possible. 
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

>>It might, but at the IOPS I'm seeing anybody using SSD for file storage 
>>should have screamed out already. 
>>Also given the CPU usage I'm seeing during that test run such a setup 
>>would probably require 32+ cores. 

Just found this:

https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf

page12:

" Note: As of Ceph Dumpling release (10/2013), a per-OSD read performance is 
approximately 4,000 IOPS and a per node limit of around 
35,000 IOPS when doing reads directly from pagecache. This appears to indicate 
that Ceph can make good use of spinning disks for data 
storage and may benefit from SSD backed OSDs, though may also be limited on 
high performance SSDs."


Maybe Intank could comment about the 4000iops by osd ?


- Mail original - 

De: "Christian Balzer"  
À: ceph-users@lists.ceph.com 
Cc: "Alexandre DERUMIER"  
Envoyé: Mardi 13 Mai 2014 11:51:37 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 


Hello, 

On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: 

> Hi Christian, 
> 
> I'm going to test a full ssd cluster in coming months, 
> I'll send result on the mailing. 
> 
Looking forward to that. 

> 
> Do you have tried to use 1 osd by physical disk ? (without raid6) 
> 
No, if you look back to the last year December "Sanity check..." thread 
by me, it gives the reasons. 
In short, highest density (thus replication of 2 and to make that safe 
based on RAID6) and operational maintainability (it is a remote data 
center, so replacing broken disks is a pain). 

That cluster is fast enough for my purposes and that fio test isn't a 
typical load for it when it goes into production. 
But for designing a general purpose or high performance Ceph cluster in 
the future I'd really love to have this mystery solved. 

> Maybe they are bottleneck in osd daemon, 
> and using osd daemon by disk could help. 
> 
It might, but at the IOPS I'm seeing anybody using SSD for file storage 
should have screamed out already. 
Also given the CPU usage I'm seeing during that test run such a setup 
would probably require 32+ cores. 

Christian 

> 
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 11:03:47 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
> devices 
> 
> 
> I'm clearly talking to myself, but whatever. 
> 
> For Greg, I've played with all the pertinent journal and filestore 
> options and TCP nodelay, no changes at all. 
> 
> Is there anybody on this ML who's running a Ceph cluster with a fast 
> network and FAST filestore, so like me with a big HW cache in front of a 
> RAID/JBODs or using SSDs for final storage? 
> 
> If so, what results do you get out of the fio statement below per OSD? 
> In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, 
> which is of course vastly faster than the normal indvidual HDDs could 
> do. 
> 
> So I'm wondering if I'm hitting some inherent limitation of how fast a 
> single OSD (as in the software) can handle IOPS, given that everything 
> else has been ruled out from where I stand. 
> 
> This would also explain why none of the option changes or the use of 
> RBD caching has any measurable effect in the test case below. 
> As in, a slow OSD aka single HDD with journal on the same disk would 
> clearly benefit from even the small 32MB standard RBD cache, while in my 
> test case the only time the caching becomes noticeable is if I increase 
> the cache size to something larger than the test data size. ^o^ 
> 
> On the other hand if people here regularly get thousands or tens of 
> thousands IOPS per OSD with the appropriate HW I'm stumped. 
> 
> Christian 
> 
> On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 
> 
> > On Wed, 7 May 2014 22:13:53 -0700 Gregory Farnum wrote: 
> > 
> > > Oh, I didn't notice that. I bet you aren't getting the expected 
> > > throughput on the RAID array with OSD access patterns, and that's 
> > > applying back pressure on the journal. 
> > > 
> > 
> > In the a "picture" being worth a thousand words tradition, I give you 
> > this iostat -x output taken during a fio run: 
> > 
> > avg-cpu: %user %nice %system %iowait %steal %idle 
> > 50.82 0.00 19.43 0.17 0.00 29.58 
> > 
> > Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s 
> > avgrq-sz avgqu-sz await r_await w_await svctm %util 
> > sda 0.00 51.50 0.00 1633.50 0.00 7460.00 
> > 9.13 0.18 0.11 0.00 0.11 0.01 1.40 sdb 
> > 0.00 0.00 0.00 1240.50 0.00 5244.00 8.45 0.30 
> > 0.25 0.00 0.25 0.02 2.00 sdc 0.00 5.00 
> > 0.00 2468.50 0.00 13419.00 10.87 0.24 0.10 0.00 
> > 0.10 0.09 22.00 sdd 0.00 6.50 0.00 1913.00 
> > 0.00 10313.00 10.78 0.20 0.10 0.00 0.10 0.09 16.60 
> > 
> > The %user CPU utilization is pretty much entirely the 2 OSD processes, 
> > note the nearly complete absence of iowait. 
> > 
> > sda and sdb are the OSDs RAIDs, sdc and sdd are the journal SSDs. 
> > Look at these nu

Re: [ceph-users] Bulk storage use case

2014-05-13 Thread Dan van der Ster


Hi,
I think you're not getting many replies simply because those are rather 
large servers and not many have such hardware in prod.


We run with 24x3TB drives, 64GB ram, one 10Gbit NIC. Memory-wise there 
are no problems. Throughput-wise, the bottleneck is somewhere between 
the NIC (~1GB/s) and the HBA / SAS backplane (~1.6GB/s). Since writes 
coming in over the network are multiplied by at least 2 times to the 
disks, in our case the HBA is the bottleneck (so we have a practical 
limit of ~800-900MBps).


Regarding IOPS, spinning disks with co-located journals leave a lot to 
be desired. But for your use-case without RBD depending on low 
latencies, I don't think this will be a problem most of the time. Re: 
running without a journal .. is that even possible? (unless you use the 
KV store, which is experimental and doesn't really show a big speedup 
anyway).


The other factor which which makes it hard to judge your plan is how the 
erasure coding will perform, especially given only a 2Gig network 
between servers. I would guess there is very little prod experience with 
the EC code as of today -- and probably zero with boxes similar to what 
you propose. But my gut tells me that with your proposed stripe width of 
12/3, combined with the slow network, getting good performance might be 
a challenge.


I would suggest you start some smaller scale tests to get a feeling for 
the performance before committing to a large purchase of this hardware type.


Cheers, Dan

Cédric Lemarchand wrote:

Thanks for your answers Craig, it seems this is a niche use case for
Ceph, not a lot of replies from the ML.

Cheers

--
Cédric Lemarchand

Le 11 mai 2014 à 00:35, Craig Lewis mailto:cle...@centraldesktop.com>> a écrit :


On 5/10/14 12:43 , Cédric Lemarchand wrote:

Hi Craig,

Thanks, I really appreciate the well detailed response.

I carefully note your advices, specifically about the CPU starvation
scenario, which as you said sounds scary.

About IO, datas will be very resilient, in case of crash, loosing not
fully written objects will not be a problem (they will be re uploaded
later), so I think in this specific case, disabling journaling could
be a way to improve IO.
How Ceph will handle that, are there caveats other than just loosing
objects that was in the data path when the crash occurs ? I know it
could sounds weird, but clients workflow could support such thing.

Thanks !

--
Cédric Lemarchand

Le 10 mai 2014 à 04:30, Craig Lewis mailto:cle...@centraldesktop.com>> a écrit :


Disabling the journal does make sense in some cases, like all the data
is a backup copy.

I don't know anything about how Ceph behaves in that setup. Maybe
somebody else can chime in?

--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website  | Twitter
 | Facebook
 | LinkedIn
 | Blog


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Christian Balzer

On Tue, 13 May 2014 22:03:11 +1200 Mark Kirkwood wrote:

> On thing that would put me off the 530 is lack on power off safety 
> (capacitor or similar). Given the job of the journal, I think an SSD 
> that has some guarantee of write integrity is crucial - so yeah the 
> DC3500 or DC3700 seem like the best choices.
> 

All my machines have redundant PSUs fed from redundant circuits in very
high end datacenters backed up by the usual gambit of batteries and
diesel monsters. 
So while you (and the people who's first comment about RAID controllers is
to mention that one should get a BBU) certainly have a point I'm happily
deploying 530s where they are useful. 

If that power should ever fail, I'm most likely buried under a ton of
(optionally radioactive) rubble (Tokyo here) or if I'm lucky just that one
DC is flooded, in which case the data is lost as well. ^o^

My beef with the 530 is that is spiky, you can't really rely on it for
consistent throughput and IOPS.

Christian

> Regards
> 
> Mark
> 
> On 13/05/14 21:31, Christian Balzer wrote:
> >
> > Hello,
> >
> > No actual question, just some food for thought and something that later
> > generations can scour from the ML archive.
> >
> > I'm planning another Ceph storage cluster, this time a "classic" Ceph
> > design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and
> > journal.
> >
> > When juggling the budget for it the 12 DC3700 200GB SSDs of my first
> > draft stood out like the proverbial sore thumb, nearly 1/6th of the
> > total budget.
> > I really like those SSDs with their smooth performance and durability
> > of 1TB/day writes (over 5 years, same for all the other numbers
> > below), but wondered if that was really needed.
> >
> > This cluster is supposed to provide the storage for VMs (Vservers
> > really) that are currently on 3 DRBD cluster pairs.
> > Not particular write intensive, all of them just total about 20GB/day.
> > With 2 journals per SSD that's 5GB/day of writes, well within the Intel
> > specification of 20GB/day for their 530 drives (180GB version).
> >
> > However the uneven IOPS of the 530 and potential future changes in
> > write patterns make this 300% safety margin still to slim for my
> > liking.
> >
> > Alas a DC3500 240GB SSD will perform well enough at half the price of
> > the DC3700 and give me enough breathing room at about 80GB/day writes,
> > so this is what I will order in the end.
> >
> > Christian
> >
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Christian Balzer

On Tue, 13 May 2014 12:07:12 +0200 Xabier Elkano wrote:

> El 13/05/14 11:31, Christian Balzer escribió:
> > Hello,
> >
> > No actual question, just some food for thought and something that later
> > generations can scour from the ML archive.
> >
> > I'm planning another Ceph storage cluster, this time a "classic" Ceph
> > design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and
> > journal.
> Christian, do yo have many clusters in production? Are there any
> advantages with many clusters vs different pools per cluster? What is
> the right way to go?, maintain a big cluster or different clusters?

Nope, I'm certainly a Ceph newb in many ways. That will be my third.

The reasons for having different clusters can be locality (one is not at
our main DC) and also special use cases (speed vs. size vs. cost vs.
density, etc).

Pools can do pretty much cover a lot of reasons why one would have
different clusters and I think the lower administrative overhead makes
them quite attractive.

> >
> > When juggling the budget for it the 12 DC3700 200GB SSDs of my first
> > draft stood out like the proverbial sore thumb, nearly 1/6th of the
> > total budget. 
> > I really like those SSDs with their smooth performance and durability
> > of 1TB/day writes (over 5 years, same for all the other numbers
> > below), but wondered if that was really needed. 
> >
> > This cluster is supposed to provide the storage for VMs (Vservers
> > really) that are currently on 3 DRBD cluster pairs.
> > Not particular write intensive, all of them just total about 20GB/day.
> > With 2 journals per SSD that's 5GB/day of writes, well within the Intel
> > specification of 20GB/day for their 530 drives (180GB version).
> >
> > However the uneven IOPS of the 530 and potential future changes in
> > write patterns make this 300% safety margin still to slim for my
> > liking.
> >
> > Alas a DC3500 240GB SSD will perform well enough at half the price of
> > the DC3700 and give me enough breathing room at about 80GB/day writes,
> > so this is what I will order in the end.
> Did you consider DC3700 100G with similar price?

The 3500 is already potentially slower than the actual HDDs when doing
sequential writes, the 100GB 3700 most definitely so.

Christian.

> >
> > Christian
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Rados GW Method not allowed

2014-05-13 Thread Georg Höllrigl


Hello,

System Ubuntu 14.04
Ceph 0.80

I'm getting either a 405 Method Not Allowed or a 403 Permission Denied 
from Radosgw.



Here is what I get from radosgw:

HTTP/1.1 405 Method Not Allowed
Date: Tue, 13 May 2014 12:21:43 GMT
Server: Apache
Accept-Ranges: bytes
Content-Length: 82
Content-Type: application/xml

encoding="UTF-8"?>MethodNotAllowed


I can see that the user exists using:
"radosgw-admin --name client.radosgw.ceph-m-01 metadata list user"

I can get the credentials via:

#radosgw-admin user info --uid=test
{ "user_id": "test",
  "display_name": "test",
  "email": "",
  "suspended": 0,
  "max_buckets": 1000,
  "auid": 0,
  "subusers": [],
  "keys": [
{ "user": "test",
  "access_key": "95L2C7BFQ8492LVZ271N",
  "secret_key": "f2tqIet+LrD0kAXYAUrZXydL+1nsO6Gs+we+94U5"}],
  "swift_keys": [],
  "caps": [],
  "op_mask": "read, write, delete",
  "default_placement": "",
  "placement_tags": [],
  "bucket_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "user_quota": { "enabled": false,
  "max_size_kb": -1,
  "max_objects": -1},
  "temp_url_keys": []}

I've also found some hints about a broken redirect in apache - but not 
really a working version.


Any hints? Any thoughts about how to solve that? Where to get more 
detailed logs, why it's not supporting to create a bucket?



KInd Regards,
Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Christian Balzer


Hello,

On Tue, 13 May 2014 13:36:49 +0200 (CEST) Alexandre DERUMIER wrote:

> >>It might, but at the IOPS I'm seeing anybody using SSD for file
> >>storage should have screamed out already. 
> >>Also given the CPU usage I'm seeing during that test run such a setup 
> >>would probably require 32+ cores. 
> 
> Just found this:
> 
> https://objects.dreamhost.com/inktankweb/Inktank_Hardware_Configuration_Guide.pdf
> 
That's and interesting find indeed.

The CPU to OSD chart clearly assumes the OSD to be backed by spinning rust
or doing 4MB block transactions. 
As stated before, at the 4KB blocksize below one OSD eats up slightly over
2 cores on the 4332HE at full speed.

> page12:
> 
> " Note: As of Ceph Dumpling release (10/2013), a per-OSD read
> performance is approximately 4,000 IOPS and a per node limit of around
> 35,000 IOPS when doing reads directly from pagecache. This appears to
> indicate that Ceph can make good use of spinning disks for data storage
> and may benefit from SSD backed OSDs, though may also be limited on high
> performance SSDs."
> 
Node that this a read test and like nearly all IOPS statements utterly
worthless unless qualified by things as block size, working set size, type
of I/O (random or sequential).

For what it's worth, my cluster gives me 4100 IOPS with the sequential fio
run below and 7200 when doing random reads (go figure). Of course I made
sure these came come the pagecache of the storage nodes, no disk I/O
reported at all and the CPUs used just 1 core per OSD.
---
fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=read --name=fiojob --blocksize=4k --iodepth=64
---


Christian

> 
> Maybe Intank could comment about the 4000iops by osd ?
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: ceph-users@lists.ceph.com 
> Cc: "Alexandre DERUMIER"  
> Envoyé: Mardi 13 Mai 2014 11:51:37 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> 
> Hello, 
> 
> On Tue, 13 May 2014 11:33:27 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > Hi Christian, 
> > 
> > I'm going to test a full ssd cluster in coming months, 
> > I'll send result on the mailing. 
> > 
> Looking forward to that. 
> 
> > 
> > Do you have tried to use 1 osd by physical disk ? (without raid6) 
> > 
> No, if you look back to the last year December "Sanity check..." thread 
> by me, it gives the reasons. 
> In short, highest density (thus replication of 2 and to make that safe 
> based on RAID6) and operational maintainability (it is a remote data 
> center, so replacing broken disks is a pain). 
> 
> That cluster is fast enough for my purposes and that fio test isn't a 
> typical load for it when it goes into production. 
> But for designing a general purpose or high performance Ceph cluster in 
> the future I'd really love to have this mystery solved. 
> 
> > Maybe they are bottleneck in osd daemon, 
> > and using osd daemon by disk could help. 
> > 
> It might, but at the IOPS I'm seeing anybody using SSD for file storage 
> should have screamed out already. 
> Also given the CPU usage I'm seeing during that test run such a setup 
> would probably require 32+ cores. 
> 
> Christian 
> 
> > 
> > 
> > 
> > - Mail original - 
> > 
> > De: "Christian Balzer"  
> > À: ceph-users@lists.ceph.com 
> > Envoyé: Mardi 13 Mai 2014 11:03:47 
> > Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and
> > backing devices 
> > 
> > 
> > I'm clearly talking to myself, but whatever. 
> > 
> > For Greg, I've played with all the pertinent journal and filestore 
> > options and TCP nodelay, no changes at all. 
> > 
> > Is there anybody on this ML who's running a Ceph cluster with a fast 
> > network and FAST filestore, so like me with a big HW cache in front of
> > a RAID/JBODs or using SSDs for final storage? 
> > 
> > If so, what results do you get out of the fio statement below per OSD? 
> > In my case with 4 OSDs and 3200 IOPS that's about 800 IOPS per OSD, 
> > which is of course vastly faster than the normal indvidual HDDs could 
> > do. 
> > 
> > So I'm wondering if I'm hitting some inherent limitation of how fast a 
> > single OSD (as in the software) can handle IOPS, given that everything 
> > else has been ruled out from where I stand. 
> > 
> > This would also explain why none of the option changes or the use of 
> > RBD caching has any measurable effect in the test case below. 
> > As in, a slow OSD aka single HDD with journal on the same disk would 
> > clearly benefit from even the small 32MB standard RBD cache, while in
> > my test case the only time the caching becomes noticeable is if I
> > increase the cache size to something larger than the test data size.
> > ^o^ 
> > 
> > On the other hand if people here regularly get thousands or tens of 
> > thousands IOPS per OSD with the appropriate HW I'm stumped. 
> > 
> > Christian 
> > 
> > On Fri, 9 May 2014 11:01:26 +0900 Christian Balzer wrote: 
> > 
> > > On Wed, 7 May 2

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Xabier Elkano

El 13/05/14 14:23, Christian Balzer escribió:
> On Tue, 13 May 2014 12:07:12 +0200 Xabier Elkano wrote:
>
>> El 13/05/14 11:31, Christian Balzer escribió:
>>> Hello,
>>>
>>> No actual question, just some food for thought and something that later
>>> generations can scour from the ML archive.
>>>
>>> I'm planning another Ceph storage cluster, this time a "classic" Ceph
>>> design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and
>>> journal.
>> Christian, do yo have many clusters in production? Are there any
>> advantages with many clusters vs different pools per cluster? What is
>> the right way to go?, maintain a big cluster or different clusters?
> Nope, I'm certainly a Ceph newb in many ways. That will be my third.
>
> The reasons for having different clusters can be locality (one is not at
> our main DC) and also special use cases (speed vs. size vs. cost vs.
> density, etc).
>
> Pools can do pretty much cover a lot of reasons why one would have
> different clusters and I think the lower administrative overhead makes
> them quite attractive.
>
>>> When juggling the budget for it the 12 DC3700 200GB SSDs of my first
>>> draft stood out like the proverbial sore thumb, nearly 1/6th of the
>>> total budget. 
>>> I really like those SSDs with their smooth performance and durability
>>> of 1TB/day writes (over 5 years, same for all the other numbers
>>> below), but wondered if that was really needed. 
>>>
>>> This cluster is supposed to provide the storage for VMs (Vservers
>>> really) that are currently on 3 DRBD cluster pairs.
>>> Not particular write intensive, all of them just total about 20GB/day.
>>> With 2 journals per SSD that's 5GB/day of writes, well within the Intel
>>> specification of 20GB/day for their 530 drives (180GB version).
>>>
>>> However the uneven IOPS of the 530 and potential future changes in
>>> write patterns make this 300% safety margin still to slim for my
>>> liking.
>>>
>>> Alas a DC3500 240GB SSD will perform well enough at half the price of
>>> the DC3700 and give me enough breathing room at about 80GB/day writes,
>>> so this is what I will order in the end.
>> Did you consider DC3700 100G with similar price?
> The 3500 is already potentially slower than the actual HDDs when doing
> sequential writes, the 100GB 3700 most definitely so.
>
> Christian.
What type of disks are you going to use for OSDs? 3700 100G can handle
200MB/s in sequential writes. Is this not enought for 2 SAS disk journal?

Xabier
>
>>> Christian
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Christian Balzer

On Tue, 13 May 2014 14:46:23 +0200 Xabier Elkano wrote:

> El 13/05/14 14:23, Christian Balzer escribió:
> > On Tue, 13 May 2014 12:07:12 +0200 Xabier Elkano wrote:
> >
> >> El 13/05/14 11:31, Christian Balzer escribió:
> >>> Hello,
> >>>
> >>> No actual question, just some food for thought and something that
> >>> later generations can scour from the ML archive.
> >>>
> >>> I'm planning another Ceph storage cluster, this time a "classic" Ceph
> >>> design, 3 storage nodes with 8 HDDs for OSDs and 4 SSDs for OS and
> >>> journal.
> >> Christian, do yo have many clusters in production? Are there any
> >> advantages with many clusters vs different pools per cluster? What is
> >> the right way to go?, maintain a big cluster or different clusters?
> > Nope, I'm certainly a Ceph newb in many ways. That will be my third.
> >
> > The reasons for having different clusters can be locality (one is not
> > at our main DC) and also special use cases (speed vs. size vs. cost vs.
> > density, etc).
> >
> > Pools can do pretty much cover a lot of reasons why one would have
> > different clusters and I think the lower administrative overhead makes
> > them quite attractive.
> >
> >>> When juggling the budget for it the 12 DC3700 200GB SSDs of my first
> >>> draft stood out like the proverbial sore thumb, nearly 1/6th of the
> >>> total budget. 
> >>> I really like those SSDs with their smooth performance and durability
> >>> of 1TB/day writes (over 5 years, same for all the other numbers
> >>> below), but wondered if that was really needed. 
> >>>
> >>> This cluster is supposed to provide the storage for VMs (Vservers
> >>> really) that are currently on 3 DRBD cluster pairs.
> >>> Not particular write intensive, all of them just total about
> >>> 20GB/day. With 2 journals per SSD that's 5GB/day of writes, well
> >>> within the Intel specification of 20GB/day for their 530 drives
> >>> (180GB version).
> >>>
> >>> However the uneven IOPS of the 530 and potential future changes in
> >>> write patterns make this 300% safety margin still to slim for my
> >>> liking.
> >>>
> >>> Alas a DC3500 240GB SSD will perform well enough at half the price of
> >>> the DC3700 and give me enough breathing room at about 80GB/day
> >>> writes, so this is what I will order in the end.
> >> Did you consider DC3700 100G with similar price?
> > The 3500 is already potentially slower than the actual HDDs when doing
> > sequential writes, the 100GB 3700 most definitely so.
> >
> > Christian.
> What type of disks are you going to use for OSDs? 3700 100G can handle
> 200MB/s in sequential writes. Is this not enought for 2 SAS disk journal?
> 
Toshiba DT01ACA300, which according to the link below and my own testing
can do sustained sequential writes of 140MB/s.

http://www.tomshardware.com/charts/hdd-charts-2013/-04-Write-Throughput-Average-h2benchw-3.16,2904.html

In the cluster I'm talking about in the other thread I basically (and
knowingly) crippled sequential writes by even having a SSD in front of the
storage device (I don't need that sequential speed). 
So with this new one I will try to have as little bottlenecks as possible
to make sure I get a very good understanding of how fast one can make
things when planning for future large scale deployments.

Mind, the higher IOPS and WAY higher endurance of the 3700 may have me
reconsider my choice, but given this cluster will actually have 2GB/s
(Byte, not bit) network bandwidth (front and backend Infiniband) and
1.12GB/s of HDD bandwidth per storage node only the 200GB DC3700 would
really be "good enough". ^o^

Christian

> Xabier
> >
> >>> Christian
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Performance stats

2014-05-13 Thread yalla.gnan.kumar

Hi All,

Is there a way by which we can measure the performance of Ceph block devices ? 
(Example :  I/O stats, data to identify bottlenecks etc).
Also what are the available ways in which we can compare Ceph storage 
performance with other storage solutions  ?


Thanks
Kumar



This message is for the designated recipient only and may contain privileged, 
proprietary, or otherwise confidential information. If you have received it in 
error, please notify the sender immediately and delete the original. Any other 
use of the e-mail by you is prohibited. Where allowed by local law, electronic 
communications with Accenture and its affiliates, including e-mail and instant 
messaging (including content), may be scanned by our systems for the purposes 
of information security and assessment of internal compliance with Accenture 
policy.
__

www.accenture.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph with VMWare / XenServer

2014-05-13 Thread Gilles Mocellin


Le 12/05/2014 15:45, Uwe Grohnwaldt a écrit :

Hi,

yes, we use it in production. I can stop/kill the tgt on one server and 
XenServer goes to the second one. We enabled multipathing in xenserver. In our 
setup we haven't multiple ip-ranges so we scan/login the second target on 
xenserverstartup with iscsiadm in rc.local.

Thats based on history - we used Dell Equallogic before ceph came in and there 
was no need to use multipathing (only LACP-channels). No we enabled 
multipathing and use tgt, but without diffent ip-ranges.



So you use multipathing in failover mode, that's certainly why it works 
without state sharing between the tgtd servers.

Still, I think you need to deactivate all sort of caching server side.
IO must be committed to ceph when the iSCSI initiator think it is.

What are the multipath parameters in XenServer (timeout, retry...) ?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Monitoring ceph statistics

2014-05-13 Thread Adrian Banasiak

Hi, i am working with test Ceph cluster and now I want to implement Zabbix
monitoring with items such as:

- whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
objects/s)
- pg statistics

I would like to create single script in python to retrive values using
rados python module, but there are only few informations in documentation
about module usage. I've created single function which calculates all pools
current read/write statistics but i cant find out how to add recovery IO
usage and pg statistics:

read = 0
write = 0
for pool in conn.list_pools():
io = conn.open_ioctx(pool)
stats[pool] = io.get_stats()
read+=int(stats[pool]['num_rd'])
write+=int(stats[pool]['num_wr'])

Could someone share his knowledge about rados module for retriving ceph
statistics?

BTW Ceph is awesome!

-- 
Best regards, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

>>For what it's worth, my cluster gives me 4100 IOPS with the sequential fio 
>>run below and 7200 when doing random reads (go figure). Of course I made 
>>sure these came come the pagecache of the storage nodes, no disk I/O 
>>reported at all and the CPUs used just 1 core per OSD. 
>>--- 
>>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
>>--rw=read --name=fiojob --blocksize=4k --iodepth=64 
>>--- 

This seem pretty low,

I can get around 6000iops seq or rand read,
with a pretty old cluster

3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning in 
ceph.conf

each node:
--
-2x quad xeon E5430  @ 2.66GHz
-4 osd, seageate 7,2k sas   (with 512MB cache on controller).  (journal on same 
disk than osd, no dedicated ssd)
-2 gigabit link (lacp)
-switch cisco 2960



each osd process are around 30% 1core during benchmark
no disk access (pagecache on ceph nodes)



sequential
--
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb 
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 
2.0.8 
Starting 1 process 
Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 00m:00s] 
fiojob: (groupid=0, jobs=1): err= 0: pid=4158 
read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec 
slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 
clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 
lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 
clat percentiles (msec): 
| 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], 
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], 
| 99.99th=[ 404] 
bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, stdev=3341.21 
lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 50=0.23% 
lat (msec) : 250=0.13%, 500=0.06% 
cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% 
issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 


Run status group 0 (all jobs): 
READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, 
mint=18404msec, maxt=18404msec 


Disk stats (read/write): 
vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, util=99.58% 


random read
---
# fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=rand-read --name=fiojob --blocksize=4k --iodepth=64 --filename=/dev/vdb 
valid values: read Sequential read 
: write Sequential write 
: randread Random read 
: randwrite Random write 
: rw Sequential read and write mix 
: readwrite Sequential read and write mix 
: randrw Random read and write mix 


fio: failed parsing rw=rand-read 
fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 
2.0.8 
Starting 1 process 
Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta 00m:01s] 
fiojob: (groupid=0, jobs=1): err= 0: pid=4172 
read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec 
slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38 
clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24 
lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24 
clat percentiles (msec): 
| 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
| 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
| 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], 
| 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359], 
| 99.99th=[ 404] 
bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77, stdev=2657.48 
lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%, 50=0.21% 
lat (msec) : 100=0.05%, 250=0.01%, 500=0.06% 
cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88 
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% 
issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 


Run status group 0 (all jobs): 
READ: io=409600KB, aggrb=22886KB/s, minb=22886KB/s, maxb=22886KB/s, 
mint=17897msec, maxt=17897msec 


Disk stats (read/write): 
vdb: ios=100981/0, merge=0/0, ticks=1124768/0, in_queue=1125492, util=99.57% 






MonSiteEstLent.com - Blog dédié à la webperformance et la gestion de pics de 
trafic 

- Mail original -

De: "Christian Balzer"  
À: "Alexandre DERUMIER"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 14:38:57 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 


Hello, 

On Tue, 13 May 2014 13:36:49 +0200 (CEST) Alexandre DERUMIER wrote: 

> >>It might, but at the IOPS I'm seeing anybody using SSD for file 
>

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Kyle Bader

> Anyway replacing set of monitors means downtime for every client, so
> I`m in doubt if 'no outage' word is still applicable there.

Taking the entire quorum down for migration would be bad. It's better
to add one in the new location, remove one at the old, ad infinitum.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Christian Balzer

On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote:

> >>For what it's worth, my cluster gives me 4100 IOPS with the sequential
> >>fio run below and 7200 when doing random reads (go figure). Of course
> >>I made sure these came come the pagecache of the storage nodes, no
> >>disk I/O reported at all and the CPUs used just 1 core per OSD. 
> >>--- 
> >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- 
> 
> This seem pretty low,
> 
> I can get around 6000iops seq or rand read,
Actually check your random read output again, you gave it the wrong
parameter, it needs to be randread, not rand-read.

> with a pretty old cluster
> 
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^

Remember, all this is coming from RAM, so what it boils down is CPU
(memory and bus transfer speeds) and of course your network.
Which is probably why your cluster isn't even more faster than mine.

Either way, that number isn't anywhere near 4000 read IOPS per OSD either,
yours is about 500, mine about 1000...

Christian 

> 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning
> in ceph.conf
> 
> each node:
> --
> -2x quad xeon E5430  @ 2.66GHz
> -4 osd, seageate 7,2k sas   (with 512MB cache on controller).  (journal
> on same disk than osd, no dedicated ssd) -2 gigabit link (lacp)
> -switch cisco 2960
> 
> 
> 
> each osd process are around 30% 1core during benchmark
> no disk access (pagecache on ceph nodes)
> 
> 
> 
> sequential
> --
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64
> --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K,
> ioengine=libaio, iodepth=64 2.0.8 Starting 1 process 
> Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta
> 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 
> read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec 
> slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 
> clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 
> lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 
> clat percentiles (msec): 
> | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
> | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
> | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], 
> | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], 
> | 99.99th=[ 404] 
> bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06,
> stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%,
> 50=0.23% lat (msec) : 250=0.13%, 500=0.06% 
> cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%,
> >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 
> 
> 
> Run status group 0 (all jobs): 
> READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s,
> mint=18404msec, maxt=18404msec 
> 
> 
> Disk stats (read/write): 
> vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380,
> util=99.58% 
> 
> 
> random read
> ---
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1
> --numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64
> --filename=/dev/vdb valid values: read Sequential read : write
> Sequential write : randread Random read 
> : randwrite Random write 
> : rw Sequential read and write mix 
> : readwrite Sequential read and write mix 
> : randrw Random read and write mix 
> 
> 
> fio: failed parsing rw=rand-read 
> fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 
> 2.0.8 
> Starting 1 process 
> Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta
> 00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172 
> read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec 
> slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38 
> clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24 
> lat (msec): min=1 , max=407 , avg=11.18, stdev= 9.24 
> clat percentiles (msec): 
> | 1.00th=[ 6], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
> | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
> | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 14], 
> | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 60], 99.95th=[ 359], 
> | 99.99th=[ 404] 
> bw (KB/s) : min= 8112, max=25120, per=100.00%, avg=22967.77,
> stdev=2657.48 lat (msec) : 2=0.05%, 4=0.46%, 10=22.83%, 20=76.34%,
> 50=0.21% lat (msec) : 100=0.05%, 250=0.01%, 500=0.06% 
> cpu : usr=4.14%, sys=10.01%, ctx=44760, majf=0, minf=88 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete

Re: [ceph-users] NFS over CEPH - best practice

2014-05-13 Thread Andrei Mikhailovsky

Dima, do you have any examples / howtos for this? I would love to give it a go. 

Cheers 
- Original Message -

From: "Dimitri Maziuk"  
To: ceph-users@lists.ceph.com 
Sent: Monday, 12 May, 2014 3:38:11 PM 
Subject: Re: [ceph-users] NFS over CEPH - best practice 

On 5/12/2014 4:52 AM, Andrei Mikhailovsky wrote: 
> Leen, 
> 
> thanks for explaining things. I does make sense now. 
> 
> Unfortunately, it does look like this technology would not fulfill my 
> requirements as I do need to have an ability to perform maintenance 
> without shutting down vms. 

I've no idea how much state you need to share for iscsi failover; with 
nfs you put the "cluster" ip address, the lock directories & the daemons 
on a heartbeat'ed pair of machines. With automount you don't need 
multiple active servers, you can do (much simpler) active-passive. 

Dima 

___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph with VMWare / XenServer

2014-05-13 Thread Andrei Mikhailovsky

Uwe, do you mind sharing your storage and xenserver iscsi config files? 

Also, what is your performance like? 

Thanks 

- Original Message -

From: "Uwe Grohnwaldt"  
To: ceph-users@lists.ceph.com 
Sent: Monday, 12 May, 2014 2:45:43 PM 
Subject: Re: [ceph-users] Ceph with VMWare / XenServer 

Hi, 

yes, we use it in production. I can stop/kill the tgt on one server and 
XenServer goes to the second one. We enabled multipathing in xenserver. In our 
setup we haven't multiple ip-ranges so we scan/login the second target on 
xenserverstartup with iscsiadm in rc.local. 

Thats based on history - we used Dell Equallogic before ceph came in and there 
was no need to use multipathing (only LACP-channels). No we enabled 
multipathing and use tgt, but without diffent ip-ranges. 

Mit freundlichen Grüßen / Best Regards, 
-- 
Consultant 
Dipl.-Inf. Uwe Grohnwaldt 
Gutleutstr. 351 
60327 Frankfurt a. M. 

eMail: u...@grohnwaldt.eu 
Telefon: +49-69-34878906 
Mobil: +49-172-3209285 
Fax: +49-69-348789069 

- Original Message - 
> From: "Andrei Mikhailovsky"  
> To: "Uwe Grohnwaldt"  
> Cc: ceph-users@lists.ceph.com 
> Sent: Montag, 12. Mai 2014 14:48:58 
> Subject: Re: [ceph-users] Ceph with VMWare / XenServer 
> 
> 
> Uwe, thanks for your quick reply. 
> 
> Do you run the Xenserver setup on production env and have you tried 
> to test some failover scenarios to see if the xenserver guest vms 
> are working during the failover of storage servers? 
> 
> Also, how did you set up the xenserver iscsi? Have you used the 
> multipath option to set up the LUNs? 
> 
> Cheers 
> 
> 
> 
> 
> - Original Message - 
> 
> From: "Uwe Grohnwaldt"  
> To: ceph-users@lists.ceph.com 
> Sent: Monday, 12 May, 2014 12:57:48 PM 
> Subject: Re: [ceph-users] Ceph with VMWare / XenServer 
> 
> Hi, 
> 
> at the moment we are using tgt with RBD backend compiled from source 
> on Ubuntu 12.04 and 14.04 LTS. We have two machines within two 
> ip-ranges (e.g. 192.168.1.0/24 and 192.168.2.0/24). One machine in 
> 192.168.1.0/24 and one machine in 192.168.2.0/24. The config for tgt 
> is the same on both machines, they export the same rbd. This works 
> well for XenServer. 
> 
> For VMWare you have to disable VAAI to use it with tgt 
> (http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1033665)
>  
> If you don't disable it, ESXi becomes very slow and unresponsive. 
> 
> I think the problem is the iSCSI Write Same Support but I haven't 
> tried which of the settings of VAAI is responsible for this 
> behavior. 
> 
> Mit freundlichen Grüßen / Best Regards, 
> -- 
> Consultant 
> Dipl.-Inf. Uwe Grohnwaldt 
> Gutleutstr. 351 
> 60327 Frankfurt a. M. 
> 
> eMail: u...@grohnwaldt.eu 
> Telefon: +49-69-34878906 
> Mobil: +49-172-3209285 
> Fax: +49-69-348789069 
> 
> - Original Message - 
> > From: "Andrei Mikhailovsky"  
> > To: ceph-users@lists.ceph.com 
> > Sent: Montag, 12. Mai 2014 12:00:48 
> > Subject: [ceph-users] Ceph with VMWare / XenServer 
> > 
> > 
> > 
> > Hello guys, 
> > 
> > I am currently running a ceph cluster for running vms with qemu + 
> > rbd. It works pretty well and provides a good degree of failover. I 
> > am able to run maintenance tasks on the ceph nodes without 
> > interrupting vms IO. 
> > 
> > I would like to do the same with VMWare / XenServer hypervisors, 
> > but 
> > I am not really sure how to achieve this. Initially I thought of 
> > using iscsi multipathing, however, as it turns out, multipathing is 
> > more for load balancing and nic/switch failure. It does not allow 
> > me 
> > to perform maintenance on the iscsi target without interrupting 
> > service to vms. 
> > 
> > Has anyone done either a PoC or better a production environment 
> > where 
> > they've used ceph as a backend storage with vmware / xenserver? The 
> > important element for me is to have the ability of performing 
> > maintenance tasks and resilience to failovers without interrupting 
> > IO to vms. Are there any recommendations or howtos on how this 
> > could 
> > be achieved? 
> > 
> > Many thanks 
> > 
> > Andrei 
> > 
> > 
> > ___ 
> > ceph-users mailing list 
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> > 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 
> 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Adrian Banasiak

Hi, i am working with test Ceph cluster and now I want to implement Zabbix
monitoring with items such as:

- whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
objects/s)
- pg statistics

I would like to create single script in python to retrive values using
rados python module, but there are only few informations in documentation
about module usage. I've created single function which calculates all pools
current read/write statistics but i cant find out how to add recovery IO
usage and pg statistics:

read = 0
write = 0
for pool in conn.list_pools():
io = conn.open_ioctx(pool)
stats[pool] = io.get_stats()
read+=int(stats[pool]['num_rd'])
write+=int(stats[pool]['num_wr'])

Could someone share his knowledge about rados module for retriving ceph
statistics?

BTW Ceph is awesome!

-- 
Best regards, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] NFS over CEPH - best practice

2014-05-13 Thread Dimitri Maziuk


On 5/13/2014 9:43 AM, Andrei Mikhailovsky wrote:

Dima, do you have any examples / howtos for this? I would love to give
it a go.


Not really: I haven't done this myself. Google for "tgtd failover with 
heartbeat", you should find something useful.


The setups I have are heartbeat (3.0.x) managing drbd (8.4), "cluster 
ip", and nfs daemons. With nothing else failover takes a couple of 
seconds & it's practically seamless. When I also throw ldap and dns in 
the mix, it's more like 30 seconds and clients with /home on nfs freeze 
for a bit -- but not long enough to break things.


This is a pretty standard HA setup, in theory it should work for tgtd.

Dima (what happens in practice, OTOH...)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

>>Actually check your random read output again, you gave it the wrong
>>parameter, it needs to be randread, not rand-read.

oops, sorry. I got around 7500iops with randread.

>>Your cluster isn't that old (the CPUs are in the same ballpark)
Yes, this is 6-7 year old server. (this xeons were released in 2007...)

So, it miss some features like crc32 and sse4 for examples, which can help a 
lot ceph



(I'll try to do some osd tuning (threads,...) to see if I can improve 
performance.


- Mail original - 

De: "Christian Balzer"  
À: "Alexandre DERUMIER"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 16:39:58 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 

On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: 

> >>For what it's worth, my cluster gives me 4100 IOPS with the sequential 
> >>fio run below and 7200 when doing random reads (go figure). Of course 
> >>I made sure these came come the pagecache of the storage nodes, no 
> >>disk I/O reported at all and the CPUs used just 1 core per OSD. 
> >>--- 
> >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- 
> 
> This seem pretty low, 
> 
> I can get around 6000iops seq or rand read, 
Actually check your random read output again, you gave it the wrong 
parameter, it needs to be randread, not rand-read. 

> with a pretty old cluster 
> 
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12 
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ 

Remember, all this is coming from RAM, so what it boils down is CPU 
(memory and bus transfer speeds) and of course your network. 
Which is probably why your cluster isn't even more faster than mine. 

Either way, that number isn't anywhere near 4000 read IOPS per OSD either, 
yours is about 500, mine about 1000... 

Christian 

> 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning 
> in ceph.conf 
> 
> each node: 
> -- 
> -2x quad xeon E5430 @ 2.66GHz 
> -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal 
> on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) 
> -switch cisco 2960 
> 
> 
> 
> each osd process are around 30% 1core during benchmark 
> no disk access (pagecache on ceph nodes) 
> 
> 
> 
> sequential 
> -- 
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 
> --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, 
> ioengine=libaio, iodepth=64 2.0.8 Starting 1 process 
> Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 
> 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 
> read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec 
> slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 
> clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 
> lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 
> clat percentiles (msec): 
> | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
> | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
> | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], 
> | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], 
> | 99.99th=[ 404] 
> bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, 
> stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 
> 50=0.23% lat (msec) : 250=0.13%, 500=0.06% 
> cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, 
> >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 
> 
> 
> Run status group 0 (all jobs): 
> READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, 
> mint=18404msec, maxt=18404msec 
> 
> 
> Disk stats (read/write): 
> vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, 
> util=99.58% 
> 
> 
> random read 
> --- 
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> --numjobs=1 --rw=rand-read --name=fiojob --blocksize=4k --iodepth=64 
> --filename=/dev/vdb valid values: read Sequential read : write 
> Sequential write : randread Random read 
> : randwrite Random write 
> : rw Sequential read and write mix 
> : readwrite Sequential read and write mix 
> : randrw Random read and write mix 
> 
> 
> fio: failed parsing rw=rand-read 
> fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64 
> 2.0.8 
> Starting 1 process 
> Jobs: 1 (f=1): [R] [94.7% done] [23752K/0K /s] [5938 /0 iops] [eta 
> 00m:01s] fiojob: (groupid=0, jobs=1): err= 0: pid=4172 
> read : io=409600KB, bw=22887KB/s, iops=5721 , runt= 17897msec 
> slat (usec): min=3 , max=929 , avg=11.75, stdev=11.38 
> clat (msec): min=1 , max=407 , avg=11.17, stdev= 9.24 
> lat (mse

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Haomai Wang

Not sure your demand.

I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
get the monitor infos. And the result can be parsed by simplejson
easily via python.

On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak  wrote:
> Hi, i am working with test Ceph cluster and now I want to implement Zabbix
> monitoring with items such as:
>
> - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
> objects/s)
> - pg statistics
>
> I would like to create single script in python to retrive values using rados
> python module, but there are only few informations in documentation about
> module usage. I've created single function which calculates all pools
> current read/write statistics but i cant find out how to add recovery IO
> usage and pg statistics:
>
> read = 0
> write = 0
> for pool in conn.list_pools():
> io = conn.open_ioctx(pool)
> stats[pool] = io.get_stats()
> read+=int(stats[pool]['num_rd'])
> write+=int(stats[pool]['num_wr'])
>
> Could someone share his knowledge about rados module for retriving ceph
> statistics?
>
> BTW Ceph is awesome!
>
> --
> Best regards, Adrian Banasiak
> email: adr...@banasiak.it
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Lost access to radosgw after crash?

2014-05-13 Thread Brian Rak


I hit a "bug" where radosgw crashed with

-101> 2014-05-13 15:26:07.188494 7fde82886820  0 ERROR: FCGX_Accept_r 
returned -24


0> 2014-05-13 15:26:07.193772 7fde82886820 -1 rgw/rgw_main.cc: In 
function 'virtual void RGWProcess::RGWWQ::_clear()' thread 7fde82886820 
time 2014-05-13 15:26:07.192212

rgw/rgw_main.cc: 181: FAILED assert(process->m_req_queue.empty())

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (ThreadPool::WorkQueue::_process(RGWRequest*)+0) [0x4ae9a0]
 2: (ThreadPool::stop(bool)+0x1f5) [0x7fde8193ee05]
 3: (RGWProcess::run()+0x358) [0x4ab808]
 4: (main()+0x866) [0x4ac796]
 5: (__libc_start_main()+0xfd) [0x7fde7fe02d1d]
 6: radosgw() [0x45d239]


It seems it ran out of fds, so I increased the limit and restarted it.  
However, now the user I was using can no longer access anything.  Every 
action the user attempts is rejected with a 403 error.


How do I enable logging of authentication issues?  I don't see anything 
obvious in the config reference, and running radosgw with --debug 
doesn't produce output that makes sense.


Even after creating an entirely new user, that user lacks permission to 
do anything (create bucket fails, list buckets fails).


I'm still on ceph version 0.72.2 
(a913ded2ff138aefb8cb84d347d72164099cfd60), is this a known issue?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

2014-05-13 Thread Fabrizio G. Ventola

I've upgraded to 0.80.1 on a testing instance: the cluster gets
cyclically active+clean+deep scrubbing for a little while and then
reaches active+clean status. I'm not worried about this, I think it's
normal, but I didn't have this behaviour on emperor 0.72.2.

Cheers,
Fabrizio

On 13 May 2014 06:08, Alexandre DERUMIER  wrote:
> 0.80.1 update has fixed the problem.
>
> thanks to ceph team !
>
> - Mail original -
>
> De: "Simon Ironside" 
> À: ceph-users@lists.ceph.com
> Envoyé: Lundi 12 Mai 2014 18:13:32
> Objet: Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state
>
> Hi,
>
> I'm sure I saw on the IRC channel yesterday that this is a known problem
> with Firefly which is due to be fixed with the release (possibly today?)
> of 0.80.1.
>
> Simon
>
> On 12/05/14 14:53, Alexandre DERUMIER wrote:
>> Hi, I observe the same behaviour on a test ceph cluster (upgrade from 
>> emperor to firefly)
>>
>>
>> cluster 819ea8af-c5e2-4e92-81f5-4348e23ae9e8
>> health HEALTH_OK
>> monmap e3: 3 mons at ..., election epoch 12, quorum 0,1,2 0,1,2
>> osdmap e94: 12 osds: 12 up, 12 in
>> pgmap v19001: 592 pgs, 4 pools, 30160 MB data, 7682 objects
>> 89912 MB used, 22191 GB / 22279 GB avail
>> 588 active+clean
>> 4 active+clean+scrubbing
>>
>> - Mail original -
>>
>> De: "Fabrizio G. Ventola" 
>> À: ceph-users@lists.ceph.com
>> Envoyé: Lundi 12 Mai 2014 15:42:03
>> Objet: [ceph-users] ceph firefly PGs in active+clean+scrubbing state
>>
>> Hello, last week I've upgraded from 0.72.2 to last stable firefly 0.80
>> following the suggested procedure (upgrade in order monitors, OSDs,
>> MDSs, clients) on my 2 different clusters.
>>
>> Everything is ok, I've HEALTH_OK on both, the only weird thing is that
>> few PGs remain in active+clean+scrubbing. I've tried to query the PG
>> and reboot the involved OSD daemons and hosts but the issue is still
>> present and the involved PGs with +scrubbing state changes.
>>
>> I've tried as well to put noscrub on OSDs with "ceph osd set noscrub"
>> nut nothing changed.
>>
>> What can I do? I attach the cluster statuses and their cluster maps:
>>
>> FIRST CLUSTER:
>>
>> health HEALTH_OK
>> mdsmap e510: 1/1/1 up {0=ceph-mds1=up:active}, 1 up:standby
>> osdmap e4604: 5 osds: 5 up, 5 in
>> pgmap v138288: 1332 pgs, 4 pools, 117 GB data, 30178 objects
>> 353 GB used, 371 GB / 724 GB avail
>> 1331 active+clean
>> 1 active+clean+scrubbing
>>
>> # id weight type name up/down reweight
>> -1 0.84 root default
>> -7 0.28 rack rack1
>> -2 0.14 host cephosd1-dev
>> 0 0.14 osd.0 up 1
>> -3 0.14 host cephosd2-dev
>> 1 0.14 osd.1 up 1
>> -8 0.28 rack rack2
>> -4 0.14 host cephosd3-dev
>> 2 0.14 osd.2 up 1
>> -5 0.14 host cephosd4-dev
>> 3 0.14 osd.3 up 1
>> -9 0.28 rack rack3
>> -6 0.28 host cephosd5-dev
>> 4 0.28 osd.4 up 1
>>
>> SECOND CLUSTER:
>>
>> health HEALTH_OK
>> osdmap e158: 10 osds: 10 up, 10 in
>> pgmap v9724: 2001 pgs, 6 pools, 395 MB data, 139 objects
>> 1192 MB used, 18569 GB / 18571 GB avail
>> 1998 active+clean
>> 3 active+clean+scrubbing
>>
>> # id weight type name up/down reweight
>> -1 18.1 root default
>> -2 9.05 host wn-recas-uniba-30
>> 0 1.81 osd.0 up 1
>> 1 1.81 osd.1 up 1
>> 2 1.81 osd.2 up 1
>> 3 1.81 osd.3 up 1
>> 4 1.81 osd.4 up 1
>> -3 9.05 host wn-recas-uniba-32
>> 5 1.81 osd.5 up 1
>> 6 1.81 osd.6 up 1
>> 7 1.81 osd.7 up 1
>> 8 1.81 osd.8 up 1
>> 9 1.81 osd.9 up 1
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Mike Dawson


All,

I have a recurring issue where the admin sockets 
(/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the 
daemons keep running (or restart without my knowledge). I see this issue 
on a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with 
ceph-deploy using Upstart to control daemons. I never see this issue on 
Ubuntu / Dumpling / sysvinit.


Has anyone else seen this issue or know the likely cause?

--
Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC
6330 East 75th Street, Suite 170
Indianapolis, IN 46250
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

I have just done some test,

with fio-rbd,
(http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)

directly from the kvm host,(not from the vm).


1 fio job: around 8000iops
2 differents parralel fio job (on different rbd volume) : around 8000iops by 
fio job !

cpu on client is at 100%
cpu of osd are around 70%/1core now.


So, seem to have a bottleneck client side somewhere.

(I remember some tests from Stefan Priebe on this mailing, with a full ssd 
cluster,
 having almost same results)



- Mail original - 

De: "Alexandre DERUMIER"  
À: "Christian Balzer"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 17:16:25 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 

>>Actually check your random read output again, you gave it the wrong 
>>parameter, it needs to be randread, not rand-read. 

oops, sorry. I got around 7500iops with randread. 

>>Your cluster isn't that old (the CPUs are in the same ballpark) 
Yes, this is 6-7 year old server. (this xeons were released in 2007...) 

So, it miss some features like crc32 and sse4 for examples, which can help a 
lot ceph 



(I'll try to do some osd tuning (threads,...) to see if I can improve 
performance. 


- Mail original - 

De: "Christian Balzer"  
À: "Alexandre DERUMIER"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 16:39:58 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 

On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: 

> >>For what it's worth, my cluster gives me 4100 IOPS with the sequential 
> >>fio run below and 7200 when doing random reads (go figure). Of course 
> >>I made sure these came come the pagecache of the storage nodes, no 
> >>disk I/O reported at all and the CPUs used just 1 core per OSD. 
> >>--- 
> >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- 
> 
> This seem pretty low, 
> 
> I can get around 6000iops seq or rand read, 
Actually check your random read output again, you gave it the wrong 
parameter, it needs to be randread, not rand-read. 

> with a pretty old cluster 
> 
Your cluster isn't that old (the CPUs are in the same ballpark) and has 12 
OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ 

Remember, all this is coming from RAM, so what it boils down is CPU 
(memory and bus transfer speeds) and of course your network. 
Which is probably why your cluster isn't even more faster than mine. 

Either way, that number isn't anywhere near 4000 read IOPS per OSD either, 
yours is about 500, mine about 1000... 

Christian 

> 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning 
> in ceph.conf 
> 
> each node: 
> -- 
> -2x quad xeon E5430 @ 2.66GHz 
> -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal 
> on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) 
> -switch cisco 2960 
> 
> 
> 
> each osd process are around 30% 1core during benchmark 
> no disk access (pagecache on ceph nodes) 
> 
> 
> 
> sequential 
> -- 
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 
> --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, 
> ioengine=libaio, iodepth=64 2.0.8 Starting 1 process 
> Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 
> 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 
> read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec 
> slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 
> clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 
> lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 
> clat percentiles (msec): 
> | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
> | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
> | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], 
> | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], 
> | 99.99th=[ 404] 
> bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, 
> stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 
> 50=0.23% lat (msec) : 250=0.13%, 500=0.06% 
> cpu : usr=3.76%, sys=10.32%, ctx=45280, majf=0, minf=88 
> IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=99.9% 
> submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% 
> complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, 
> >=64=0.0% issued : total=r=102400/w=0/d=0, short=r=0/w=0/d=0 
> 
> 
> Run status group 0 (all jobs): 
> READ: io=409600KB, aggrb=22256KB/s, minb=22256KB/s, maxb=22256KB/s, 
> mint=18404msec, maxt=18404msec 
> 
> 
> Disk stats (read/write): 
> vdb: ios=101076/0, merge=0/0, ticks=1157172/0, in_queue=1157380, 
> util=99.58% 
> 
> 
> random read 
> --- 
> # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> --numjobs=1 --rw=rand-

Re: [ceph-users] Lost access to radosgw after crash?

2014-05-13 Thread Yehuda Sadeh

On Tue, May 13, 2014 at 8:52 AM, Brian Rak  wrote:
> I hit a "bug" where radosgw crashed with
>
> -101> 2014-05-13 15:26:07.188494 7fde82886820  0 ERROR: FCGX_Accept_r
> returned -24

too many files opened. You probably need to adjust your limits.


> 
> 0> 2014-05-13 15:26:07.193772 7fde82886820 -1 rgw/rgw_main.cc: In function
> 'virtual void RGWProcess::RGWWQ::_clear()' thread 7fde82886820 time
> 2014-05-13 15:26:07.192212
> rgw/rgw_main.cc: 181: FAILED assert(process->m_req_queue.empty())
>
>  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>  1: (ThreadPool::WorkQueue::_process(RGWRequest*)+0) [0x4ae9a0]
>  2: (ThreadPool::stop(bool)+0x1f5) [0x7fde8193ee05]
>  3: (RGWProcess::run()+0x358) [0x4ab808]
>  4: (main()+0x866) [0x4ac796]
>  5: (__libc_start_main()+0xfd) [0x7fde7fe02d1d]
>  6: radosgw() [0x45d239]
>
>
> It seems it ran out of fds, so I increased the limit and restarted it.
> However, now the user I was using can no longer access anything.  Every
> action the user attempts is rejected with a 403 error.
>
> How do I enable logging of authentication issues?  I don't see anything
> obvious in the config reference, and running radosgw with --debug doesn't
> produce output that makes sense.


I usualy set "debug rgw = 20", and "debug ms = 1".


>
> Even after creating an entirely new user, that user lacks permission to do
> anything (create bucket fails, list buckets fails).
>
> I'm still on ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60),
> is this a known issue?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lost access to radosgw after crash?

2014-05-13 Thread Brian Rak

I upgraded to 0.80.1 to see if that helped.  It didn't change anything, 
but I'm now seeing more useful errors:


2014-05-13 16:27:32.954007 7f5183cfc700  0 RGWGC::process() failed to 
acquire lock on gc.10
2014-05-13 16:27:48.098428 7f5183cfc700  0 RGWGC::process() failed to 
acquire lock on gc.14
2014-05-13 16:27:49.792050 7f51828fa700  0 ERROR: can't read user 
header: ret=-2
2014-05-13 16:27:49.792055 7f51828fa700  0 ERROR: sync_user() failed, 
user=centosmirror2 ret=-2


Any ideas?

On 5/13/2014 11:52 AM, Brian Rak wrote:

I hit a "bug" where radosgw crashed with

-101> 2014-05-13 15:26:07.188494 7fde82886820  0 ERROR: FCGX_Accept_r 
returned -24


0> 2014-05-13 15:26:07.193772 7fde82886820 -1 rgw/rgw_main.cc: In 
function 'virtual void RGWProcess::RGWWQ::_clear()' thread 
7fde82886820 time 2014-05-13 15:26:07.192212

rgw/rgw_main.cc: 181: FAILED assert(process->m_req_queue.empty())

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 1: (ThreadPool::WorkQueue::_process(RGWRequest*)+0) 
[0x4ae9a0]

 2: (ThreadPool::stop(bool)+0x1f5) [0x7fde8193ee05]
 3: (RGWProcess::run()+0x358) [0x4ab808]
 4: (main()+0x866) [0x4ac796]
 5: (__libc_start_main()+0xfd) [0x7fde7fe02d1d]
 6: radosgw() [0x45d239]


It seems it ran out of fds, so I increased the limit and restarted 
it.  However, now the user I was using can no longer access anything.  
Every action the user attempts is rejected with a 403 error.


How do I enable logging of authentication issues?  I don't see 
anything obvious in the config reference, and running radosgw with 
--debug doesn't produce output that makes sense.


Even after creating an entirely new user, that user lacks permission 
to do anything (create bucket fails, list buckets fails).


I'm still on ceph version 0.72.2 
(a913ded2ff138aefb8cb84d347d72164099cfd60), is this a known issue?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Christian Balzer


On Tue, 13 May 2014 18:10:25 +0200 (CEST) Alexandre DERUMIER wrote:

> I have just done some test,
> 
> with fio-rbd,
> (http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)
> 
> directly from the kvm host,(not from the vm).
> 
> 
> 1 fio job: around 8000iops
> 2 differents parralel fio job (on different rbd volume) : around
> 8000iops by fio job !
> 
> cpu on client is at 100%
> cpu of osd are around 70%/1core now.
> 
> 
> So, seem to have a bottleneck client side somewhere.
> 
You didn't specify what you did, but i assume you did read test.
Those scale, as in running fio in multiple VMs in parallel gives me about
6200 IOPS each, so much better than the 7200 for a single one.
And yes, the client CPU is quite busy.

However my real, original question is about writes. And they are stuck at
3200 IOPS, cluster wide, no matter how many parallel VMs are running fio...

Christian

> (I remember some tests from Stefan Priebe on this mailing, with a full
> ssd cluster, having almost same results)
> 
> 
> 
> - Mail original - 
> 
> De: "Alexandre DERUMIER"  
> À: "Christian Balzer"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 17:16:25 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> >>Actually check your random read output again, you gave it the wrong 
> >>parameter, it needs to be randread, not rand-read. 
> 
> oops, sorry. I got around 7500iops with randread. 
> 
> >>Your cluster isn't that old (the CPUs are in the same ballpark) 
> Yes, this is 6-7 year old server. (this xeons were released in 2007...) 
> 
> So, it miss some features like crc32 and sse4 for examples, which can
> help a lot ceph 
> 
> 
> 
> (I'll try to do some osd tuning (threads,...) to see if I can improve
> performance. 
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: "Alexandre DERUMIER"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 16:39:58 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing
> devices 
> 
> On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > >>For what it's worth, my cluster gives me 4100 IOPS with the
> > >>sequential fio run below and 7200 when doing random reads (go
> > >>figure). Of course I made sure these came come the pagecache of the
> > >>storage nodes, no disk I/O reported at all and the CPUs used just 1
> > >>core per OSD. --- 
> > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- 
> > 
> > This seem pretty low, 
> > 
> > I can get around 6000iops seq or rand read, 
> Actually check your random read output again, you gave it the wrong 
> parameter, it needs to be randread, not rand-read. 
> 
> > with a pretty old cluster 
> > 
> Your cluster isn't that old (the CPUs are in the same ballpark) and has
> 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ 
> 
> Remember, all this is coming from RAM, so what it boils down is CPU 
> (memory and bus transfer speeds) and of course your network. 
> Which is probably why your cluster isn't even more faster than mine. 
> 
> Either way, that number isn't anywhere near 4000 read IOPS per OSD
> either, yours is about 500, mine about 1000... 
> 
> Christian 
> 
> > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning 
> > in ceph.conf 
> > 
> > each node: 
> > -- 
> > -2x quad xeon E5430 @ 2.66GHz 
> > -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal 
> > on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) 
> > -switch cisco 2960 
> > 
> > 
> > 
> > each osd process are around 30% 1core during benchmark 
> > no disk access (pagecache on ceph nodes) 
> > 
> > 
> > 
> > sequential 
> > -- 
> > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 
> > --filename=/dev/vdb fiojob: (g=0): rw=read, bs=4K-4K/4K-4K, 
> > ioengine=libaio, iodepth=64 2.0.8 Starting 1 process 
> > Jobs: 1 (f=1): [R] [100.0% done] [23968K/0K /s] [5992 /0 iops] [eta 
> > 00m:00s] fiojob: (groupid=0, jobs=1): err= 0: pid=4158 
> > read : io=409600KB, bw=22256KB/s, iops=5564 , runt= 18404msec 
> > slat (usec): min=3 , max=1124 , avg=12.03, stdev=12.72 
> > clat (msec): min=1 , max=405 , avg=11.48, stdev=12.10 
> > lat (msec): min=1 , max=405 , avg=11.50, stdev=12.10 
> > clat percentiles (msec): 
> > | 1.00th=[ 5], 5.00th=[ 9], 10.00th=[ 10], 20.00th=[ 10], 
> > | 30.00th=[ 11], 40.00th=[ 11], 50.00th=[ 11], 60.00th=[ 12], 
> > | 70.00th=[ 12], 80.00th=[ 12], 90.00th=[ 13], 95.00th=[ 15], 
> > | 99.00th=[ 19], 99.50th=[ 20], 99.90th=[ 206], 99.95th=[ 404], 
> > | 99.99th=[ 404] 
> > bw (KB/s) : min= 7542, max=24720, per=100.00%, avg=22321.06, 
> > stdev=3341.21 lat (msec) : 2=0.04%, 4=0.60%, 10=21.40%, 20=77.54%, 
> > 50=0.23% lat (msec) : 250=0.13%, 500=0.06% 
> > cpu : usr=3

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Adrian Banasiak

Thanks for sugestion with admin daemon but it looks like single osd
oriented. I have used perf dump on mon socket and it output some
interesting data in case of monitoring whole cluster:
{ "cluster": { "num_mon": 4,
  "num_mon_quorum": 4,
  "num_osd": 29,
  "num_osd_up": 29,
  "num_osd_in": 29,
  "osd_epoch": 1872,
  "osd_kb": 20218112516,
  "osd_kb_used": 5022202696,
  "osd_kb_avail": 15195909820,
  "num_pool": 4,
  "num_pg": 3500,
  "num_pg_active_clean": 3500,
  "num_pg_active": 3500,
  "num_pg_peering": 0,
  "num_object": 400746,
  "num_object_degraded": 0,
  "num_object_unfound": 0,
  "num_bytes": 1678788329609,
  "num_mds_up": 0,
  "num_mds_in": 0,
  "num_mds_failed": 0,
  "mds_epoch": 1},

Unfortunately cluster wide IO statistics are still missing.

2014-05-13 17:17 GMT+02:00 Haomai Wang :

> Not sure your demand.
>
> I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
> get the monitor infos. And the result can be parsed by simplejson
> easily via python.
>
> On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak 
> wrote:
> > Hi, i am working with test Ceph cluster and now I want to implement
> Zabbix
> > monitoring with items such as:
> >
> > - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
> > objects/s)
> > - pg statistics
> >
> > I would like to create single script in python to retrive values using
> rados
> > python module, but there are only few informations in documentation about
> > module usage. I've created single function which calculates all pools
> > current read/write statistics but i cant find out how to add recovery IO
> > usage and pg statistics:
> >
> > read = 0
> > write = 0
> > for pool in conn.list_pools():
> > io = conn.open_ioctx(pool)
> > stats[pool] = io.get_stats()
> > read+=int(stats[pool]['num_rd'])
> > write+=int(stats[pool]['num_wr'])
> >
> > Could someone share his knowledge about rados module for retriving ceph
> > statistics?
> >
> > BTW Ceph is awesome!
> >
> > --
> > Best regards, Adrian Banasiak
> > email: adr...@banasiak.it
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Best Regards,
>
> Wheat
>

-- 
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lost access to radosgw after crash?

2014-05-13 Thread Brian Rak



On 5/13/2014 12:29 PM, Yehuda Sadeh wrote:

On Tue, May 13, 2014 at 8:52 AM, Brian Rak  wrote:

I hit a "bug" where radosgw crashed with

-101> 2014-05-13 15:26:07.188494 7fde82886820  0 ERROR: FCGX_Accept_r
returned -24

too many files opened. You probably need to adjust your limits.




0> 2014-05-13 15:26:07.193772 7fde82886820 -1 rgw/rgw_main.cc: In function
'virtual void RGWProcess::RGWWQ::_clear()' thread 7fde82886820 time
2014-05-13 15:26:07.192212
rgw/rgw_main.cc: 181: FAILED assert(process->m_req_queue.empty())

  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
  1: (ThreadPool::WorkQueue::_process(RGWRequest*)+0) [0x4ae9a0]
  2: (ThreadPool::stop(bool)+0x1f5) [0x7fde8193ee05]
  3: (RGWProcess::run()+0x358) [0x4ab808]
  4: (main()+0x866) [0x4ac796]
  5: (__libc_start_main()+0xfd) [0x7fde7fe02d1d]
  6: radosgw() [0x45d239]


It seems it ran out of fds, so I increased the limit and restarted it.
However, now the user I was using can no longer access anything.  Every
action the user attempts is rejected with a 403 error.

How do I enable logging of authentication issues?  I don't see anything
obvious in the config reference, and running radosgw with --debug doesn't
produce output that makes sense.


I usualy set "debug rgw = 20", and "debug ms = 1".
Ah, that helped.  It seems that s3cmd and ceph are calculating different 
signatures for some reason.   Aside from a mismatch between 
public/secret keys, is there something else that could be causing this?



2014-05-13 16:39:30.895180 7fb0db5fe700 20 enqueued request 
req=0x7fb0f0012ce0

2014-05-13 16:39:30.895201 7fb0db5fe700 20 RGWWQ:
2014-05-13 16:39:30.895202 7fb0db5fe700 20 req: 0x7fb0f0012ce0
2014-05-13 16:39:30.895206 7fb0db5fe700 10 allocated request 
req=0x7fb0f00148d0
2014-05-13 16:39:30.895281 7fb0cb1e4700 20 dequeued request 
req=0x7fb0f0012ce0

2014-05-13 16:39:30.895286 7fb0cb1e4700 20 RGWWQ: empty
2014-05-13 16:39:30.895310 7fb0cb1e4700 20 CONTENT_LENGTH=0
2014-05-13 16:39:30.895311 7fb0cb1e4700 20 CONTENT_TYPE=
2014-05-13 16:39:30.895313 7fb0cb1e4700 20 FCGI_ROLE=RESPONDER
2014-05-13 16:39:30.895314 7fb0cb1e4700 20 HTTP_ACCEPT_ENCODING=identity
2014-05-13 16:39:30.895315 7fb0cb1e4700 20 HTTP_AUTHORIZATION=AWS 
MYACCESSKEY:FmglMmhJONIhpRHw7z9DgtvdnDI=

2014-05-13 16:39:30.895316 7fb0cb1e4700 20 HTTP_CONTENT_LENGTH=0
2014-05-13 16:39:30.895317 7fb0cb1e4700 20 HTTP_HOST=MYHOST
2014-05-13 16:39:30.895318 7fb0cb1e4700 20 HTTP_X_AMZ_DATE=Tue, 13 May 
2014 16:39:30 +

2014-05-13 16:39:30.895319 7fb0cb1e4700 20 QUERY_STRING=
2014-05-13 16:39:30.895320 7fb0cb1e4700 20 REQUEST_METHOD=GET
2014-05-13 16:39:30.895322 7fb0cb1e4700  1 == starting new request 
req=0x7fb0f0012ce0 =
2014-05-13 16:39:30.895335 7fb0cb1e4700  2 req 5:0.13::GET 
::initializing

2014-05-13 16:39:30.895340 7fb0cb1e4700 10 host=MYHOST rgw_dns_name=MYHOST
2014-05-13 16:39:30.895353 7fb0cb1e4700 10 meta>> HTTP_X_AMZ_DATE
2014-05-13 16:39:30.895358 7fb0cb1e4700 10 x>> x-amz-date:Tue, 13 May 
2014 16:39:30 +

2014-05-13 16:39:30.895374 7fb0cb1e4700 10 s->object= s->bucket=
2014-05-13 16:39:30.895380 7fb0cb1e4700  2 req 5:0.59:s3:GET 
::getting op
2014-05-13 16:39:30.895384 7fb0cb1e4700  2 req 5:0.62:s3:GET 
:list_buckets:authorizing
2014-05-13 16:39:30.895423 7fb0cb1e4700 20 get_obj_state: 
rctx=0x7fb0f00141f0 obj=.users:MYACCESSKEY state=0x7fb0f0017d48 
s->prefetch_data=0
2014-05-13 16:39:30.895432 7fb0cb1e4700 10 cache get: 
name=.users+MYACCESSKEY : hit
2014-05-13 16:39:30.895439 7fb0cb1e4700 20 get_obj_state: s->obj_tag was 
set empty
2014-05-13 16:39:30.895445 7fb0cb1e4700 10 cache get: 
name=.users+MYACCESSKEY : hit
2014-05-13 16:39:30.895478 7fb0cb1e4700 20 get_obj_state: 
rctx=0x7fb0f00141f0 obj=.users.uid:centosmirror state=0x7fb0f00186f8 
s->prefetch_data=0
2014-05-13 16:39:30.895483 7fb0cb1e4700 10 cache get: 
name=.users.uid+centosmirror : hit
2014-05-13 16:39:30.895486 7fb0cb1e4700 20 get_obj_state: s->obj_tag was 
set empty
2014-05-13 16:39:30.895490 7fb0cb1e4700 10 cache get: 
name=.users.uid+centosmirror : hit

2014-05-13 16:39:30.895579 7fb0cb1e4700 10 get_canon_resource(): dest=
2014-05-13 16:39:30.895583 7fb0cb1e4700 10 auth_hdr:
GET



x-amz-date:Tue, 13 May 2014 16:39:30 +

2014-05-13 16:39:30.895650 7fb0cb1e4700 15 calculated 
digest=ck/6o9TgR73JLPT43SxIt39KgBI=
2014-05-13 16:39:30.895653 7fb0cb1e4700 15 
auth_sign=FmglMmhJONIhpRHw7z9DgtvdnDI=

2014-05-13 16:39:30.895654 7fb0cb1e4700 15 compare=-1
2014-05-13 16:39:30.895657 7fb0cb1e4700 10 failed to authorize request
2014-05-13 16:39:30.895681 7fb0cb1e4700  5 nothing to log for operation
2014-05-13 16:39:30.895684 7fb0cb1e4700  2 req 5:0.000362:s3:GET 
:list_buckets:http status=403
2014-05-13 16:39:30.895688 7fb0cb1e4700  1 == req done 
req=0x7fb0f0012ce0 http_status=403 ==

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow IOPS on RBD compared to journal and backing devices

2014-05-13 Thread Alexandre DERUMIER

>>You didn't specify what you did, but i assume you did read test. 
yes, indeed

>>Those scale, as in running fio in multiple VMs in parallel gives me about 
>>6200 IOPS each, so much better than the 7200 for a single one. 
>>And yes, the client CPU is quite busy. 

oh ok !

>>However my real, original question is about writes. And they are stuck at 
>>3200 IOPS, cluster wide, no matter how many parallel VMs are running fio... 

Sorry, can test for write, don't have ssd journal for now.
I'll try to send result when I'll have my ssd cluster.

(But I remember some talk from Sage saying than indeed small direct write could 
be pretty slow,
 that why rbd_cache is recommended, to aggregate small writes in bigger one)



- Mail original - 

De: "Christian Balzer"  
À: "Alexandre DERUMIER"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Mardi 13 Mai 2014 18:31:18 
Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
devices 


On Tue, 13 May 2014 18:10:25 +0200 (CEST) Alexandre DERUMIER wrote: 

> I have just done some test, 
> 
> with fio-rbd, 
> (http://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html)
>  
> 
> directly from the kvm host,(not from the vm). 
> 
> 
> 1 fio job: around 8000iops 
> 2 differents parralel fio job (on different rbd volume) : around 
> 8000iops by fio job ! 
> 
> cpu on client is at 100% 
> cpu of osd are around 70%/1core now. 
> 
> 
> So, seem to have a bottleneck client side somewhere. 
> 
You didn't specify what you did, but i assume you did read test. 
Those scale, as in running fio in multiple VMs in parallel gives me about 
6200 IOPS each, so much better than the 7200 for a single one. 
And yes, the client CPU is quite busy. 

However my real, original question is about writes. And they are stuck at 
3200 IOPS, cluster wide, no matter how many parallel VMs are running fio... 

Christian 

> (I remember some tests from Stefan Priebe on this mailing, with a full 
> ssd cluster, having almost same results) 
> 
> 
> 
> - Mail original - 
> 
> De: "Alexandre DERUMIER"  
> À: "Christian Balzer"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 17:16:25 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
> devices 
> 
> >>Actually check your random read output again, you gave it the wrong 
> >>parameter, it needs to be randread, not rand-read. 
> 
> oops, sorry. I got around 7500iops with randread. 
> 
> >>Your cluster isn't that old (the CPUs are in the same ballpark) 
> Yes, this is 6-7 year old server. (this xeons were released in 2007...) 
> 
> So, it miss some features like crc32 and sse4 for examples, which can 
> help a lot ceph 
> 
> 
> 
> (I'll try to do some osd tuning (threads,...) to see if I can improve 
> performance. 
> 
> 
> - Mail original - 
> 
> De: "Christian Balzer"  
> À: "Alexandre DERUMIER"  
> Cc: ceph-users@lists.ceph.com 
> Envoyé: Mardi 13 Mai 2014 16:39:58 
> Objet: Re: [ceph-users] Slow IOPS on RBD compared to journal and backing 
> devices 
> 
> On Tue, 13 May 2014 16:09:28 +0200 (CEST) Alexandre DERUMIER wrote: 
> 
> > >>For what it's worth, my cluster gives me 4100 IOPS with the 
> > >>sequential fio run below and 7200 when doing random reads (go 
> > >>figure). Of course I made sure these came come the pagecache of the 
> > >>storage nodes, no disk I/O reported at all and the CPUs used just 1 
> > >>core per OSD. --- 
> > >>fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > >>--numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 --- 
> > 
> > This seem pretty low, 
> > 
> > I can get around 6000iops seq or rand read, 
> Actually check your random read output again, you gave it the wrong 
> parameter, it needs to be randread, not rand-read. 
> 
> > with a pretty old cluster 
> > 
> Your cluster isn't that old (the CPUs are in the same ballpark) and has 
> 12 OSDs instead of my 4. Plus it has the supposedly faster firefly. ^o^ 
> 
> Remember, all this is coming from RAM, so what it boils down is CPU 
> (memory and bus transfer speeds) and of course your network. 
> Which is probably why your cluster isn't even more faster than mine. 
> 
> Either way, that number isn't anywhere near 4000 read IOPS per OSD 
> either, yours is about 500, mine about 1000... 
> 
> Christian 
> 
> > 3 nodes cluster (replication x3), firefly, kernel 3.10, xfs, no tuning 
> > in ceph.conf 
> > 
> > each node: 
> > -- 
> > -2x quad xeon E5430 @ 2.66GHz 
> > -4 osd, seageate 7,2k sas (with 512MB cache on controller). (journal 
> > on same disk than osd, no dedicated ssd) -2 gigabit link (lacp) 
> > -switch cisco 2960 
> > 
> > 
> > 
> > each osd process are around 30% 1core during benchmark 
> > no disk access (pagecache on ceph nodes) 
> > 
> > 
> > 
> > sequential 
> > -- 
> > # fio --size=400m --ioengine=libaio --invalidate=1 --direct=1 
> > --numjobs=1 --rw=read --name=fiojob --blocksize=4k --iodepth=64 
> > --filename=/dev/vdb fiojob:

Re: [ceph-users] Lost access to radosgw after crash?

2014-05-13 Thread Brian Rak

This turns out to have been a configuration change to nginx that I 
forgot I had made.  It wasn't passing all the http options through any 
more, so authentication was failing.


On 5/13/2014 12:43 PM, Brian Rak wrote:


On 5/13/2014 12:29 PM, Yehuda Sadeh wrote:

On Tue, May 13, 2014 at 8:52 AM, Brian Rak  wrote:

I hit a "bug" where radosgw crashed with

-101> 2014-05-13 15:26:07.188494 7fde82886820  0 ERROR: FCGX_Accept_r
returned -24

too many files opened. You probably need to adjust your limits.




0> 2014-05-13 15:26:07.193772 7fde82886820 -1 rgw/rgw_main.cc: In 
function

'virtual void RGWProcess::RGWWQ::_clear()' thread 7fde82886820 time
2014-05-13 15:26:07.192212
rgw/rgw_main.cc: 181: FAILED assert(process->m_req_queue.empty())

  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
  1: (ThreadPool::WorkQueue::_process(RGWRequest*)+0) 
[0x4ae9a0]

  2: (ThreadPool::stop(bool)+0x1f5) [0x7fde8193ee05]
  3: (RGWProcess::run()+0x358) [0x4ab808]
  4: (main()+0x866) [0x4ac796]
  5: (__libc_start_main()+0xfd) [0x7fde7fe02d1d]
  6: radosgw() [0x45d239]


It seems it ran out of fds, so I increased the limit and restarted it.
However, now the user I was using can no longer access anything.  Every
action the user attempts is rejected with a 403 error.

How do I enable logging of authentication issues?  I don't see anything
obvious in the config reference, and running radosgw with --debug 
doesn't

produce output that makes sense.


I usualy set "debug rgw = 20", and "debug ms = 1".
Ah, that helped.  It seems that s3cmd and ceph are calculating 
different signatures for some reason.   Aside from a mismatch between 
public/secret keys, is there something else that could be causing this?



2014-05-13 16:39:30.895180 7fb0db5fe700 20 enqueued request 
req=0x7fb0f0012ce0

2014-05-13 16:39:30.895201 7fb0db5fe700 20 RGWWQ:
2014-05-13 16:39:30.895202 7fb0db5fe700 20 req: 0x7fb0f0012ce0
2014-05-13 16:39:30.895206 7fb0db5fe700 10 allocated request 
req=0x7fb0f00148d0
2014-05-13 16:39:30.895281 7fb0cb1e4700 20 dequeued request 
req=0x7fb0f0012ce0

2014-05-13 16:39:30.895286 7fb0cb1e4700 20 RGWWQ: empty
2014-05-13 16:39:30.895310 7fb0cb1e4700 20 CONTENT_LENGTH=0
2014-05-13 16:39:30.895311 7fb0cb1e4700 20 CONTENT_TYPE=
2014-05-13 16:39:30.895313 7fb0cb1e4700 20 FCGI_ROLE=RESPONDER
2014-05-13 16:39:30.895314 7fb0cb1e4700 20 HTTP_ACCEPT_ENCODING=identity
2014-05-13 16:39:30.895315 7fb0cb1e4700 20 HTTP_AUTHORIZATION=AWS 
MYACCESSKEY:FmglMmhJONIhpRHw7z9DgtvdnDI=

2014-05-13 16:39:30.895316 7fb0cb1e4700 20 HTTP_CONTENT_LENGTH=0
2014-05-13 16:39:30.895317 7fb0cb1e4700 20 HTTP_HOST=MYHOST
2014-05-13 16:39:30.895318 7fb0cb1e4700 20 HTTP_X_AMZ_DATE=Tue, 13 May 
2014 16:39:30 +

2014-05-13 16:39:30.895319 7fb0cb1e4700 20 QUERY_STRING=
2014-05-13 16:39:30.895320 7fb0cb1e4700 20 REQUEST_METHOD=GET
2014-05-13 16:39:30.895322 7fb0cb1e4700  1 == starting new request 
req=0x7fb0f0012ce0 =
2014-05-13 16:39:30.895335 7fb0cb1e4700  2 req 5:0.13::GET 
::initializing
2014-05-13 16:39:30.895340 7fb0cb1e4700 10 host=MYHOST 
rgw_dns_name=MYHOST

2014-05-13 16:39:30.895353 7fb0cb1e4700 10 meta>> HTTP_X_AMZ_DATE
2014-05-13 16:39:30.895358 7fb0cb1e4700 10 x>> x-amz-date:Tue, 13 May 
2014 16:39:30 +
2014-05-13 16:39:30.895374 7fb0cb1e4700 10 s->object= 
s->bucket=
2014-05-13 16:39:30.895380 7fb0cb1e4700  2 req 5:0.59:s3:GET 
::getting op
2014-05-13 16:39:30.895384 7fb0cb1e4700  2 req 5:0.62:s3:GET 
:list_buckets:authorizing
2014-05-13 16:39:30.895423 7fb0cb1e4700 20 get_obj_state: 
rctx=0x7fb0f00141f0 obj=.users:MYACCESSKEY state=0x7fb0f0017d48 
s->prefetch_data=0
2014-05-13 16:39:30.895432 7fb0cb1e4700 10 cache get: 
name=.users+MYACCESSKEY : hit
2014-05-13 16:39:30.895439 7fb0cb1e4700 20 get_obj_state: s->obj_tag 
was set empty
2014-05-13 16:39:30.895445 7fb0cb1e4700 10 cache get: 
name=.users+MYACCESSKEY : hit
2014-05-13 16:39:30.895478 7fb0cb1e4700 20 get_obj_state: 
rctx=0x7fb0f00141f0 obj=.users.uid:centosmirror state=0x7fb0f00186f8 
s->prefetch_data=0
2014-05-13 16:39:30.895483 7fb0cb1e4700 10 cache get: 
name=.users.uid+centosmirror : hit
2014-05-13 16:39:30.895486 7fb0cb1e4700 20 get_obj_state: s->obj_tag 
was set empty
2014-05-13 16:39:30.895490 7fb0cb1e4700 10 cache get: 
name=.users.uid+centosmirror : hit

2014-05-13 16:39:30.895579 7fb0cb1e4700 10 get_canon_resource(): dest=
2014-05-13 16:39:30.895583 7fb0cb1e4700 10 auth_hdr:
GET



x-amz-date:Tue, 13 May 2014 16:39:30 +

2014-05-13 16:39:30.895650 7fb0cb1e4700 15 calculated 
digest=ck/6o9TgR73JLPT43SxIt39KgBI=
2014-05-13 16:39:30.895653 7fb0cb1e4700 15 
auth_sign=FmglMmhJONIhpRHw7z9DgtvdnDI=

2014-05-13 16:39:30.895654 7fb0cb1e4700 15 compare=-1
2014-05-13 16:39:30.895657 7fb0cb1e4700 10 failed to authorize request
2014-05-13 16:39:30.895681 7fb0cb1e4700  5 nothing to log for operation
2014-05-13 16:39:30.895684 7fb0cb1e4700  2 req 5:0.000362:s3:GET 
:list_buckets:http status=403
2014-05-13 16:39:30.895688 7fb0cb1e4700  1 == req do

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Mike Dawson

Adrian,

Yes, it is single OSD oriented.

Like Haomai, we monitor perf dumps from individual OSD admin sockets. On 
new enough versions of ceph, you can do 'ceph daemon osd.x perf dump', 
which is a shorter way to ask for the same output as 'ceph 
--admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump'. Keep in mind, 
either version has to be run locally on the host where osd.x is running.

We use Sensu to take samples and push them to Graphite. We have the 
ability to then build dashboards showing the whole cluster, units in our 
CRUSH tree, hosts, or an individual OSDs.

I have found that monitoring each OSD's admin daemon is critical. Often 
times a single OSD can affect performance of the entire cluster. Without 
individual data, these types of issues can be quite difficult to pinpoint.

Also, note that Inktank has developed Calamari. There are rumors that it 
may be open sourced at some point in the future.

Cheers,
Mike Dawson

On 5/13/2014 12:33 PM, Adrian Banasiak wrote:

Thanks for sugestion with admin daemon but it looks like single osd
oriented. I have used perf dump on mon socket and it output some
interesting data in case of monitoring whole cluster:
{ "cluster": { "num_mon": 4,
   "num_mon_quorum": 4,
   "num_osd": 29,
   "num_osd_up": 29,
   "num_osd_in": 29,
   "osd_epoch": 1872,
   "osd_kb": 20218112516,
   "osd_kb_used": 5022202696,
   "osd_kb_avail": 15195909820,
   "num_pool": 4,
   "num_pg": 3500,
   "num_pg_active_clean": 3500,
   "num_pg_active": 3500,
   "num_pg_peering": 0,
   "num_object": 400746,
   "num_object_degraded": 0,
   "num_object_unfound": 0,
   "num_bytes": 1678788329609,
   "num_mds_up": 0,
   "num_mds_in": 0,
   "num_mds_failed": 0,
   "mds_epoch": 1},

Unfortunately cluster wide IO statistics are still missing.

2014-05-13 17:17 GMT+02:00 Haomai Wang mailto:haomaiw...@gmail.com>>:

Not sure your demand.

I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
get the monitor infos. And the result can be parsed by simplejson
easily via python.

On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak
mailto:adr...@banasiak.it>> wrote:
 > Hi, i am working with test Ceph cluster and now I want to
implement Zabbix
 > monitoring with items such as:
 >
 > - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
 > objects/s)
 > - pg statistics
 >
 > I would like to create single script in python to retrive values
using rados
 > python module, but there are only few informations in
documentation about
 > module usage. I've created single function which calculates all pools
 > current read/write statistics but i cant find out how to add
recovery IO
 > usage and pg statistics:
 >
 > read = 0
 > write = 0
 > for pool in conn.list_pools():
 > io = conn.open_ioctx(pool)
 > stats[pool] = io.get_stats()
 > read+=int(stats[pool]['num_rd'])
 > write+=int(stats[pool]['num_wr'])
 >
 > Could someone share his knowledge about rados module for
retriving ceph
 > statistics?
 >
 > BTW Ceph is awesome!
 >
 > --
 > Best regards, Adrian Banasiak
 > email: adr...@banasiak.it 
 >
 > ___
 > ceph-users mailing list
 > ceph-users@lists.ceph.com 
 > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 >

--
Best Regards,

Wheat

--
Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Don Talton (dotalton)

python-cephclient may be of some use to you

https://github.com/dmsimard/python-cephclient



> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mike Dawson
> Sent: Tuesday, May 13, 2014 10:04 AM
> To: Adrian Banasiak; Haomai Wang
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] Monitoring ceph statistics using rados python module
> 
> Adrian,
> 
> Yes, it is single OSD oriented.
> 
> Like Haomai, we monitor perf dumps from individual OSD admin sockets. On
> new enough versions of ceph, you can do 'ceph daemon osd.x perf dump',
> which is a shorter way to ask for the same output as 'ceph --admin-daemon
> /var/run/ceph/ceph-osd.x.asok perf dump'. Keep in mind, either version has to
> be run locally on the host where osd.x is running.
> 
> We use Sensu to take samples and push them to Graphite. We have the ability
> to then build dashboards showing the whole cluster, units in our CRUSH tree,
> hosts, or an individual OSDs.
> 
> I have found that monitoring each OSD's admin daemon is critical. Often times
> a single OSD can affect performance of the entire cluster. Without individual
> data, these types of issues can be quite difficult to pinpoint.
> 
> Also, note that Inktank has developed Calamari. There are rumors that it may
> be open sourced at some point in the future.
> 
> Cheers,
> Mike Dawson
> 
> 
> On 5/13/2014 12:33 PM, Adrian Banasiak wrote:
> > Thanks for sugestion with admin daemon but it looks like single osd
> > oriented. I have used perf dump on mon socket and it output some
> > interesting data in case of monitoring whole cluster:
> > { "cluster": { "num_mon": 4,
> >"num_mon_quorum": 4,
> >"num_osd": 29,
> >"num_osd_up": 29,
> >"num_osd_in": 29,
> >"osd_epoch": 1872,
> >"osd_kb": 20218112516,
> >"osd_kb_used": 5022202696,
> >"osd_kb_avail": 15195909820,
> >"num_pool": 4,
> >"num_pg": 3500,
> >"num_pg_active_clean": 3500,
> >"num_pg_active": 3500,
> >"num_pg_peering": 0,
> >"num_object": 400746,
> >"num_object_degraded": 0,
> >"num_object_unfound": 0,
> >"num_bytes": 1678788329609,
> >"num_mds_up": 0,
> >"num_mds_in": 0,
> >"num_mds_failed": 0,
> >"mds_epoch": 1},
> >
> > Unfortunately cluster wide IO statistics are still missing.
> >
> >
> > 2014-05-13 17:17 GMT+02:00 Haomai Wang  > >:
> >
> > Not sure your demand.
> >
> > I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
> > get the monitor infos. And the result can be parsed by simplejson
> > easily via python.
> >
> > On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak
> > mailto:adr...@banasiak.it>> wrote:
> >  > Hi, i am working with test Ceph cluster and now I want to
> > implement Zabbix
> >  > monitoring with items such as:
> >  >
> >  > - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
> >  > objects/s)
> >  > - pg statistics
> >  >
> >  > I would like to create single script in python to retrive values
> > using rados
> >  > python module, but there are only few informations in
> > documentation about
> >  > module usage. I've created single function which calculates all pools
> >  > current read/write statistics but i cant find out how to add
> > recovery IO
> >  > usage and pg statistics:
> >  >
> >  > read = 0
> >  > write = 0
> >  > for pool in conn.list_pools():
> >  > io = conn.open_ioctx(pool)
> >  > stats[pool] = io.get_stats()
> >  > read+=int(stats[pool]['num_rd'])
> >  > write+=int(stats[pool]['num_wr'])
> >  >
> >  > Could someone share his knowledge about rados module for
> > retriving ceph
> >  > statistics?
> >  >
> >  > BTW Ceph is awesome!
> >  >
> >  > --
> >  > Best regards, Adrian Banasiak
> >  > email: adr...@banasiak.it 
> >  >
> >  > ___
> >  > ceph-users mailing list
> >  > ceph-users@lists.ceph.com 
> >  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >  >
> >
> >
> >
> > --
> > Best Regards,
> >
> > Wheat
> >
> >
> >
> >
> > --
> > Pozdrawiam, Adrian Banasiak
> > email: adr...@banasiak.it 
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com

Re: [ceph-users] too slowly upload on ceph object storage

2014-05-13 Thread Stephen Taylor

By way of follow-up, I have done quite a bit more additional testing here and 
my problem seems to be exclusive to rbd-fuse on Ubuntu 14.04. So probably not 
related to this thread I chimed in on last week.

I am able to get read and write speeds in the hundreds of megabytes per second 
with this setup via the kernel driver and librbd, but rbd-fuse writes are still 
at 5MB/s. I also downgraded to 0.72 on Ubuntu 14.04 and observed the same. 
Reads with rbd-fuse are at about 150MB/s, still slower than any other channel, 
but obviously MUCH better than my 5MB/s writes.

Steve

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stephen Taylor
Sent: Friday, May 09, 2014 11:36 AM
To: wsnote; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] too slowly upload on ceph object storage

+1 with 0.80 on Ubuntu 14.04. I have a 7-node cluster with 10 OSDs per node on 
a 10Gbps network. 3 of the nodes are also acting as monitors. Each node has a 
single 6-core CPU, 32GB of memory, 1 SSD for journals, and 36 3TB hard drives. 
I'm currently using 11 of the hard drives in each node, one for the OS and 10 
for OSDs.

I do have one SSD (75% full) providing journals for all 10 OSDs on each node, 
but the same setup used to be much faster. I had previously been testing writes 
with rbd-fuse and librbd on 0.72 and Ubuntu 12.04 on this cluster. I was 
averaging about 200MB/s with peaks well beyond that.

This week I upgraded (clean install, not a true upgrade) all 7 nodes to Ubuntu 
14.04 and Ceph 0.80, and now I'm getting 5MB/s writes with rbd-fuse.

I think my next step is to go back to 0.72 on 14.04 to see if it's somehow 
related to the OS, but this feels like an issue related to the new Firefly 
features so far. I have tested both with an EC pool using an overlaid, 
writeback cache pool and with a 2x replicated pool directly with the same 
results.

Any suggestions are welcome. I'll let you know if I find anything.

Steve

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of wsnote
Sent: Thursday, May 08, 2014 7:45 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] too slowly upload on ceph object storage

Hi, everyone!
I am testing ceph rgw and found it upload slowly. I don't know where may be the 
bottleneck.

OS: CentOS 6.5
Version: Ceph 0.79

Hardware:
CPU: 2 * quad-core
Mem: 32GB
Disk: 2TB*1+3TB*11
Network: 1*1GB Ethernet NIC

Ceph Cluster:
My cluster was composed of 4 servers(called ceph1-4).
I installed a monister, a radosgw and 11 osd in every server.
So the cluster had 4 servers(ceph1-4), 4 moniter, 4 radosgw, 44 osd.
I configured ceph as ceph.com's documents said and didn't do any special config.

Then I use s3cmd to test the Ceph Cluster.
Test 1:
In ceph1, upload a big file to the rgw in ceph1.Do test several times.
The speed is about 10MB/s!
It's too slowly! I upload files from ceph1 to ceph1.
There is not any network latency.

Test 2:
In ceph1, upload a big file to the rgw in ceph2.Do test several times.
The speed is also about 10MB/s!

Test3:
In each ceph server, upload a big file to the rgw in their own rgw in the mean 
time.Do test several times.
The speed in each ceph server is about 1-3MB/s. The sum of the speeds is about 
10MB/s!

I use the command "iostat -kx 3" to watch the stress of disks.
When testing, the iowait is lower than 1%, and %util is lower than 1% too.

There may be some problem.The speed in one server is too slowly.And the total 
speed didn't increase with the number of rgw.
Can anyone give some suggestion?
Thanks!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Where is the SDK of ceph object storage

2014-05-13 Thread Gregory Farnum

On Mon, May 12, 2014 at 11:55 PM, wsnote  wrote:
> Hi, everyone!
> Where can I find the SDK of ceph object storage?
> Python: boto
> C++: libs3 which I found in the src of ceph and github.com/ceph/libs3.
> where are that of other language? Does ceph supply them?
> Otherwise I use the SDK of Amazon S3 directly?

If you're using the RADOS Gateway, you should just use the normal S3
or Swift client libraries. Since we're borrowing their APIs to begin
with, we don't maintain our own client libraries for it. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crushmap question

2014-05-13 Thread Gregory Farnum

You just use a type other than "rack" in your chooseleaf rule. In your
case, "host". When using chooseleaf, the bucket type you specify is
the failure domain which it must segregate across.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 13, 2014 at 12:52 AM, Cao, Buddy  wrote:
> Hi,
>
>
>
> I have a crushmap structure likes root->rack->host->osds. I designed the
> rule below, since I used “chooseleaf…rack” in rule definition, if there is
> only one rack in the cluster, the ceph gps will always stay at stuck unclean
> state (that is because the default metadata/data/rbd pool set 2 replicas).
> Could you let me know how do I configure the rule to let it can also work in
> a cluster with only one rack?
>
>
>
> rule ssd{
>
> ruleset 1
>
> type replicated
>
> min_size 0
>
> max_size 10
>
> step take root
>
> step chooseleaf firstn 0 type rack
>
> step emit
>
> }
>
>
>
> BTW, if I add a new rack into the crushmap, the pg status will finally get
> to active+clean. However, my customer do ONLY have one rack in their env, so
> hard for me to have workaround to ask him setup several racks.
>
>
>
> Wei Cao (Buddy)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Gregory Farnum

On Tue, May 13, 2014 at 9:06 AM, Mike Dawson  wrote:
> All,
>
> I have a recurring issue where the admin sockets
> (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the
> daemons keep running

Hmm.

>(or restart without my knowledge).

I'm guessing this might be involved:

> I see this issue on
> a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with
> ceph-deploy using Upstart to control daemons. I never see this issue on
> Ubuntu / Dumpling / sysvinit.

*goes and greps the git log*

I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38
(which has, in fact, been backported to dumpling and emperor),
intended so that turning on a new daemon wouldn't remove the admin
socket of an existing one. But I think that means that if you activate
the new daemon before the old one has finished shutting down and
unlinking, you would end up with a daemon that had no admin socket.
Perhaps it's an incomplete fix and we need a tracker ticket?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Fred Yang

I have to say I'm shocked to see the suggestion is rbd import/export if
'you care the data'. These kind of operation is common use case and should
be an essential part of any distributed storage. What if I have a hundred
node cluster running for years and need to do hardware refresh? There are
no clear procedure itself sounds scary to me..

Sent from my Samsung Galaxy S3
On May 9, 2014 1:31 PM, "Gregory Farnum"  wrote:

> I don't think anybody's done this before, but that will functionally
> work, yes. Depending on how much of the data in the cluster you
> actually care about, you might be better off just taking it out (rbd
> export/import or something) instead of trying to incrementally move
> all the data over, but...*shrug*
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Fri, May 9, 2014 at 3:31 AM, Gandalf Corvotempesta
>  wrote:
> > Let's assume a test cluster up and running with real data on it.
> > Which is the best way to migrate everything to a production (and
> > larger) cluster?
> >
> > I'm thinking to add production MONs to the test cluster, after that,
> > add productions OSDs to the test cluster, waiting for a full rebalance
> > and then starting to remove test OSDs and test mons.
> >
> > This should migrate everything with no outage.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Mike Dawson


Greg/Loic,

I can confirm that "logrotate --force /etc/logrotate.d/ceph" removes the 
monitor admin socket on my boxes running 0.80.1 just like the 
description in Issue 7188 [0].


0: http://tracker.ceph.com/issues/7188

Should that bug be reopened?

Thanks,
Mike Dawson


On 5/13/2014 2:10 PM, Gregory Farnum wrote:

On Tue, May 13, 2014 at 9:06 AM, Mike Dawson  wrote:

All,

I have a recurring issue where the admin sockets
(/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the
daemons keep running


Hmm.


(or restart without my knowledge).


I'm guessing this might be involved:


I see this issue on
a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with
ceph-deploy using Upstart to control daemons. I never see this issue on
Ubuntu / Dumpling / sysvinit.


*goes and greps the git log*

I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38
(which has, in fact, been backported to dumpling and emperor),
intended so that turning on a new daemon wouldn't remove the admin
socket of an existing one. But I think that means that if you activate
the new daemon before the old one has finished shutting down and
unlinking, you would end up with a daemon that had no admin socket.
Perhaps it's an incomplete fix and we need a tracker ticket?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Gregory Farnum

On Tue, May 13, 2014 at 11:36 AM, Fred Yang  wrote:
> I have to say I'm shocked to see the suggestion is rbd import/export if 'you
> care the data'. These kind of operation is common use case and should be an
> essential part of any distributed storage. What if I have a hundred node
> cluster running for years and need to do hardware refresh? There are no
> clear procedure itself sounds scary to me..

You misunderstand. Migrating between machines for incrementally
upgrading your hardware is normal behavior and well-tested (likewise
for swapping in all-new hardware, as long as you understand the IO
requirements involved). So is decommissioning old hardware. But if you
only care about (for instance, numbers pulled out of thin air) 30GB
out of 100TB of data in the cluster, it will be *faster* to move only
the 30GB you care about, instead of rebalancing all the data in the
cluster across to new machines. :)
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Occasional Missing Admin Sockets

2014-05-13 Thread Gregory Farnum

Yeah, I just did so. :(
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 13, 2014 at 11:41 AM, Mike Dawson  wrote:
> Greg/Loic,
>
> I can confirm that "logrotate --force /etc/logrotate.d/ceph" removes the
> monitor admin socket on my boxes running 0.80.1 just like the description in
> Issue 7188 [0].
>
> 0: http://tracker.ceph.com/issues/7188
>
> Should that bug be reopened?
>
> Thanks,
> Mike Dawson
>
>
>
> On 5/13/2014 2:10 PM, Gregory Farnum wrote:
>>
>> On Tue, May 13, 2014 at 9:06 AM, Mike Dawson 
>> wrote:
>>>
>>> All,
>>>
>>> I have a recurring issue where the admin sockets
>>> (/var/run/ceph/ceph-*.*.asok) may vanish on a running cluster while the
>>> daemons keep running
>>
>>
>> Hmm.
>>
>>> (or restart without my knowledge).
>>
>>
>> I'm guessing this might be involved:
>>
>>> I see this issue on
>>> a dev cluster running Ubuntu and Ceph Emperor/Firefly, deployed with
>>> ceph-deploy using Upstart to control daemons. I never see this issue on
>>> Ubuntu / Dumpling / sysvinit.
>>
>>
>> *goes and greps the git log*
>>
>> I'm betting it was commit 45600789f1ca399dddc5870254e5db883fb29b38
>> (which has, in fact, been backported to dumpling and emperor),
>> intended so that turning on a new daemon wouldn't remove the admin
>> socket of an existing one. But I think that means that if you activate
>> the new daemon before the old one has finished shutting down and
>> unlinking, you would end up with a daemon that had no admin socket.
>> Perhaps it's an incomplete fix and we need a tracker ticket?
>> -Greg
>> Software Engineer #42 @ http://inktank.com | http://ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

2014-05-13 Thread Michael

Anyone still seeing this issue on 0.80.1 you'll probable need to dump 
out your scrub list "ceph pg dump | grep scrub" then find the OSD listed 
as the acting primary for the PG stuck scrubbing and stop it a bit more 
aggressively. I found that the acting primary for a PG stuck in scrub 
status was completely ignoring standard restart commands which prevented 
any scrubbing from continuing within the cluster even after update.


-Michael

On 13/05/2014 17:03, Fabrizio G. Ventola wrote:

I've upgraded to 0.80.1 on a testing instance: the cluster gets
cyclically active+clean+deep scrubbing for a little while and then
reaches active+clean status. I'm not worried about this, I think it's
normal, but I didn't have this behaviour on emperor 0.72.2.

Cheers,
Fabrizio

On 13 May 2014 06:08, Alexandre DERUMIER  wrote:

0.80.1 update has fixed the problem.

thanks to ceph team !

- Mail original -

De: "Simon Ironside" 
À: ceph-users@lists.ceph.com
Envoyé: Lundi 12 Mai 2014 18:13:32
Objet: Re: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

Hi,

I'm sure I saw on the IRC channel yesterday that this is a known problem
with Firefly which is due to be fixed with the release (possibly today?)
of 0.80.1.

Simon

On 12/05/14 14:53, Alexandre DERUMIER wrote:

Hi, I observe the same behaviour on a test ceph cluster (upgrade from emperor 
to firefly)


cluster 819ea8af-c5e2-4e92-81f5-4348e23ae9e8
health HEALTH_OK
monmap e3: 3 mons at ..., election epoch 12, quorum 0,1,2 0,1,2
osdmap e94: 12 osds: 12 up, 12 in
pgmap v19001: 592 pgs, 4 pools, 30160 MB data, 7682 objects
89912 MB used, 22191 GB / 22279 GB avail
588 active+clean
4 active+clean+scrubbing

- Mail original -

De: "Fabrizio G. Ventola" 
À: ceph-users@lists.ceph.com
Envoyé: Lundi 12 Mai 2014 15:42:03
Objet: [ceph-users] ceph firefly PGs in active+clean+scrubbing state

Hello, last week I've upgraded from 0.72.2 to last stable firefly 0.80
following the suggested procedure (upgrade in order monitors, OSDs,
MDSs, clients) on my 2 different clusters.

Everything is ok, I've HEALTH_OK on both, the only weird thing is that
few PGs remain in active+clean+scrubbing. I've tried to query the PG
and reboot the involved OSD daemons and hosts but the issue is still
present and the involved PGs with +scrubbing state changes.

I've tried as well to put noscrub on OSDs with "ceph osd set noscrub"
nut nothing changed.

What can I do? I attach the cluster statuses and their cluster maps:

FIRST CLUSTER:

health HEALTH_OK
mdsmap e510: 1/1/1 up {0=ceph-mds1=up:active}, 1 up:standby
osdmap e4604: 5 osds: 5 up, 5 in
pgmap v138288: 1332 pgs, 4 pools, 117 GB data, 30178 objects
353 GB used, 371 GB / 724 GB avail
1331 active+clean
1 active+clean+scrubbing

# id weight type name up/down reweight
-1 0.84 root default
-7 0.28 rack rack1
-2 0.14 host cephosd1-dev
0 0.14 osd.0 up 1
-3 0.14 host cephosd2-dev
1 0.14 osd.1 up 1
-8 0.28 rack rack2
-4 0.14 host cephosd3-dev
2 0.14 osd.2 up 1
-5 0.14 host cephosd4-dev
3 0.14 osd.3 up 1
-9 0.28 rack rack3
-6 0.28 host cephosd5-dev
4 0.28 osd.4 up 1

SECOND CLUSTER:

health HEALTH_OK
osdmap e158: 10 osds: 10 up, 10 in
pgmap v9724: 2001 pgs, 6 pools, 395 MB data, 139 objects
1192 MB used, 18569 GB / 18571 GB avail
1998 active+clean
3 active+clean+scrubbing

# id weight type name up/down reweight
-1 18.1 root default
-2 9.05 host wn-recas-uniba-30
0 1.81 osd.0 up 1
1 1.81 osd.1 up 1
2 1.81 osd.2 up 1
3 1.81 osd.3 up 1
4 1.81 osd.4 up 1
-3 9.05 host wn-recas-uniba-32
5 1.81 osd.5 up 1
6 1.81 osd.6 up 1
7 1.81 osd.7 up 1
8 1.81 osd.8 up 1
9 1.81 osd.9 up 1
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph 0.80.1 delete/recreate data/metadata pools

2014-05-13 Thread Michael


Hi All,

Seems commit 2adc534a72cc199c8b11dbdf436258cbe147101b has removed the 
ability to delete and recreate the data and metadata pools using osd 
pool delete (Returns Error EBUSY - Is in use by CephFS). Currently have 
no mds running as I'm no longer using CephFS and so it's not mounted 
anywhere either. Any way I can clean out these pools now and reset the 
pgp num etc?


Thanks,
-Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph 0.80.1 delete/recreate data/metadata pools

2014-05-13 Thread Michael

Answered my own question. Created two new pools, used mds newfs on them 
and then deleted the original pools and renamed the new ones.


-Michael

On 13/05/2014 22:20, Michael wrote:

Hi All,

Seems commit 2adc534a72cc199c8b11dbdf436258cbe147101b has removed the 
ability to delete and recreate the data and metadata pools using osd 
pool delete (Returns Error EBUSY - Is in use by CephFS). Currently 
have no mds running as I'm no longer using CephFS and so it's not 
mounted anywhere either. Any way I can clean out these pools now and 
reset the pgp num etc?


Thanks,
-Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Gandalf Corvotempesta

2014-05-13 21:21 GMT+02:00 Gregory Farnum :
> You misunderstand. Migrating between machines for incrementally
> upgrading your hardware is normal behavior and well-tested (likewise
> for swapping in all-new hardware, as long as you understand the IO
> requirements involved). So is decommissioning old hardware. But if you
> only care about (for instance, numbers pulled out of thin air) 30GB
> out of 100TB of data in the cluster, it will be *faster* to move only
> the 30GB you care about, instead of rebalancing all the data in the
> cluster across to new machines. :)

Subject on this thread is : "migrate WHOLE cluster", so, I meant to migrate
THE WHOLE CLUSTER not only a part of it.

If my cluster is made by 100TB, I have to migrate 100TB of datas.

So, can I manually replace all mons and osds one per time?
For example: add 1 mon, remove 1 mon, add 1 mon, remove 1 mon and so
on until all mons are replace.
Then: add 1 osd, wait for rebalance, remove 1 osd, wait for rebalance
and so on ultil all OSD are migrated.

This should work with no downtime and no data loss.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Migrate whole clusters

2014-05-13 Thread Gregory Farnum

Assuming you have the spare throughput-/IOPS for Ceph to do its thing
without disturbing your clients, this will work fine.
-Greg

On Tuesday, May 13, 2014, Gandalf Corvotempesta <
gandalf.corvotempe...@gmail.com> wrote:

> 2014-05-13 21:21 GMT+02:00 Gregory Farnum 
> >:
> > You misunderstand. Migrating between machines for incrementally
> > upgrading your hardware is normal behavior and well-tested (likewise
> > for swapping in all-new hardware, as long as you understand the IO
> > requirements involved). So is decommissioning old hardware. But if you
> > only care about (for instance, numbers pulled out of thin air) 30GB
> > out of 100TB of data in the cluster, it will be *faster* to move only
> > the 30GB you care about, instead of rebalancing all the data in the
> > cluster across to new machines. :)
>
> Subject on this thread is : "migrate WHOLE cluster", so, I meant to migrate
> THE WHOLE CLUSTER not only a part of it.
>
> If my cluster is made by 100TB, I have to migrate 100TB of datas.
>
> So, can I manually replace all mons and osds one per time?
> For example: add 1 mon, remove 1 mon, add 1 mon, remove 1 mon and so
> on until all mons are replace.
> Then: add 1 osd, wait for rebalance, remove 1 osd, wait for rebalance
> and so on ultil all OSD are migrated.
>
> This should work with no downtime and no data loss.
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Craig Lewis


On 5/13/14 05:15 , Christian Balzer wrote:

On Tue, 13 May 2014 22:03:11 +1200 Mark Kirkwood wrote:


On thing that would put me off the 530 is lack on power off safety
(capacitor or similar). Given the job of the journal, I think an SSD
that has some guarantee of write integrity is crucial - so yeah the
DC3500 or DC3700 seem like the best choices.


All my machines have redundant PSUs fed from redundant circuits in very
high end datacenters backed up by the usual gambit of batteries and
diesel monsters.
So while you (and the people who's first comment about RAID controllers is
to mention that one should get a BBU) certainly have a point I'm happily
deploying 530s where they are useful.

If that power should ever fail, I'm most likely buried under a ton of
(optionally radioactive) rubble (Tokyo here) or if I'm lucky just that one
DC is flooded, in which case the data is lost as well. ^o^

Christian



TL;DR: Power outages are more common than your colo facility will admit.


In my 15 years, I've experienced 4 data center power outages in 3 
different facilities.


I don't remember the reason for the first one... maybe a power company 
transformer failure that went way outside of voltage specs? This was in 
a "Tier 1" data-center, the day after a press release bragging about 
their uptime.


For the second one, the facility was doing a pressure test of the Halon 
system.  Some pipes failed and dumped halon, which triggered facility 
power cut.  This was in a "Tier 1" data-center.


The third and forth were the same root cause in the same facility. The 
electrical switch that cuts from utility power to UPS power failed half 
way through the cut over.  While replacing the switch (before everything 
was wired up), a fail-over was accidentally triggered.  This was a "Tier 
2" data-center.



The first two happened before SSDs (although I lost a bunch of HDDs 
during the Halon deployment).  The 3rd power failure had no problems.  
The 4th power failure killed a mirrored pair of Intel 320's that were 
connected to a battery backed RAID controller.  Both lost their sector 
map, and had to be erased using the Intel SSD tool before they could be 
used again.




I'm using the DC3700's for my journals.




--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Rados GW Method not allowed

2014-05-13 Thread Andrei Mikhailovsky

Georg, 

I've had similar issues when I had a "+" character in my secret key. Not all 
clients support it. You might need to escape this with \ and see if it works. 

Andrei 



- Original Message -

From: "Georg Höllrigl"  
To: ceph-users@lists.ceph.com 
Sent: Tuesday, 13 May, 2014 1:30:14 PM 
Subject: [ceph-users] Rados GW Method not allowed 

Hello, 

System Ubuntu 14.04 
Ceph 0.80 

I'm getting either a 405 Method Not Allowed or a 403 Permission Denied 
from Radosgw. 


Here is what I get from radosgw: 

HTTP/1.1 405 Method Not Allowed 
Date: Tue, 13 May 2014 12:21:43 GMT 
Server: Apache 
Accept-Ranges: bytes 
Content-Length: 82 
Content-Type: application/xml 

MethodNotAllowed 

I can see that the user exists using: 
"radosgw-admin --name client.radosgw.ceph-m-01 metadata list user" 

I can get the credentials via: 

#radosgw-admin user info --uid=test 
{ "user_id": "test", 
"display_name": "test", 
"email": "", 
"suspended": 0, 
"max_buckets": 1000, 
"auid": 0, 
"subusers": [], 
"keys": [ 
{ "user": "test", 
"access_key": "95L2C7BFQ8492LVZ271N", 
"secret_key": "f2tqIet+LrD0kAXYAUrZXydL+1nsO6Gs+we+94U5"}], 
"swift_keys": [], 
"caps": [], 
"op_mask": "read, write, delete", 
"default_placement": "", 
"placement_tags": [], 
"bucket_quota": { "enabled": false, 
"max_size_kb": -1, 
"max_objects": -1}, 
"user_quota": { "enabled": false, 
"max_size_kb": -1, 
"max_objects": -1}, 
"temp_url_keys": []} 

I've also found some hints about a broken redirect in apache - but not 
really a working version. 

Any hints? Any thoughts about how to solve that? Where to get more 
detailed logs, why it's not supporting to create a bucket? 


KInd Regards, 
Georg 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Kyle Bader

> TL;DR: Power outages are more common than your colo facility will admit.

Seconded. I've seen power failures in at least 4 different facilities
and all of them had the usual gamut of batteries/generators/etc. Some
of those facilities I've seen problems multiple times in a single
year. Even a datacenter with five nines power availability is going to
see > 5m of downtime per year, and that would qualify for the highest
rating from the Uptime Institute (Tier IV)! I've lost power to Ceph
clusters on several occasions, in all cases the journals were on
spinning media.

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Journal SSD durability

2014-05-13 Thread Dimitri Maziuk

On 05/13/2014 05:49 PM, Kyle Bader wrote:
>> TL;DR: Power outages are more common than your colo facility will admit.
> 
> Seconded. I've seen power failures in at least 4 different facilities
> and all of them had the usual gamut of batteries/generators/etc. Some
> of those facilities I've seen problems multiple times in a single
> year.

We have (as long we're swapping horror stories) a building here with
redundant power feeds from 2 different substations. Since you can't have
both actually connected to each other, there is a room with Very Big
Switch in it. In case of maintenance shutdown on one side, somebody must
manually throw the switch.

The first time powerco had to do maintenance it turned out nobody there
knew they needed to call the building first. Which was just as well
since nobody in the building knew to take that call. Or was certified to
throw that switch.

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu

signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Craig Lewis


On 5/13/14 09:33 , Adrian Banasiak wrote:
Thanks for sugestion with admin daemon but it looks like single osd 
oriented. I have used perf dump on mon socket and it output some 
interesting data in case of monitoring whole cluster:

{ "cluster": { "num_mon": 4,
  "num_mon_quorum": 4,
  "num_osd": 29,
  "num_osd_up": 29,
  "num_osd_in": 29,
  "osd_epoch": 1872,
  "osd_kb": 20218112516,
  "osd_kb_used": 5022202696,
  "osd_kb_avail": 15195909820,
  "num_pool": 4,
  "num_pg": 3500,
  "num_pg_active_clean": 3500,
  "num_pg_active": 3500,
  "num_pg_peering": 0,
  "num_object": 400746,
  "num_object_degraded": 0,
  "num_object_unfound": 0,
  "num_bytes": 1678788329609,
  "num_mds_up": 0,
  "num_mds_in": 0,
  "num_mds_failed": 0,
  "mds_epoch": 1},

Unfortunately cluster wide IO statistics are still missing.



I'm getting cluster wide OPs and Bandwidth from ceph pg stat -f json.  
I'm using this section:

{
  "pg_stats_delta": {
"stat_sum": {
"num_bytes": 0,
  "num_objects": 31851793,
  "num_object_clones": 0,
  "num_object_copies": 100208267,
  "num_objects_missing_on_primary": 0,
  "num_objects_degraded": 4687903,
  "num_objects_unfound": 0,
  "num_read": 315072058,
  "num_read_kb": 55549447422,
  "num_write": 223701235,
  "num_write_kb": 20457441876,
  "num_scrub_errors": 0,
  "num_shallow_scrub_errors": 0,
  "num_deep_scrub_errors": 0,
  "num_objects_recovered": 74138172,
  "num_bytes_recovered": 62776621391330,
  "num_keys_recovered": 1129447173},
  "stat_cat_sum": {},
  "log_size": 7191821,
  "ondisk_log_size": 7191821},

I'm tracking num_write, num_write_kb, num_read, and num_read_kb. 
Although I see some other things that I should be tracking too


Those values appear to be counters, so you probably want to track the 
change from the previous sample rather than the absolute value.



--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitoring ceph statistics using rados python module

2014-05-13 Thread Kai Zhang

Hi Adrian,

You may be interested in "rados -p poo_name df --format json", although it's 
pool oriented, you could probably add the values together :)

Regards,
Kai

在 2014-05-13 08:33:11，"Adrian Banasiak"  写道：

Thanks for sugestion with admin daemon but it looks like single osd oriented. I 
have used perf dump on mon socket and it output some interesting data in case 
of monitoring whole cluster:
{ "cluster": { "num_mon": 4,
  "num_mon_quorum": 4,
  "num_osd": 29,
  "num_osd_up": 29,
  "num_osd_in": 29,
  "osd_epoch": 1872,
  "osd_kb": 20218112516,
  "osd_kb_used": 5022202696,
  "osd_kb_avail": 15195909820,
  "num_pool": 4,
  "num_pg": 3500,
  "num_pg_active_clean": 3500,
  "num_pg_active": 3500,
  "num_pg_peering": 0,
  "num_object": 400746,
  "num_object_degraded": 0,
  "num_object_unfound": 0,
  "num_bytes": 1678788329609,
  "num_mds_up": 0,
  "num_mds_in": 0,
  "num_mds_failed": 0,
  "mds_epoch": 1},

Unfortunately cluster wide IO statistics are still missing.

2014-05-13 17:17 GMT+02:00 Haomai Wang :
Not sure your demand.

I use "ceph --admin-daemon /var/run/ceph/ceph-osd.x.asok perf dump" to
get the monitor infos. And the result can be parsed by simplejson
easily via python.

On Tue, May 13, 2014 at 10:56 PM, Adrian Banasiak  wrote:
> Hi, i am working with test Ceph cluster and now I want to implement Zabbix
> monitoring with items such as:
>
> - whoe cluster IO (for example ceph -s -> recovery io 143 MB/s, 35
> objects/s)
> - pg statistics
>
> I would like to create single script in python to retrive values using rados
> python module, but there are only few informations in documentation about
> module usage. I've created single function which calculates all pools
> current read/write statistics but i cant find out how to add recovery IO
> usage and pg statistics:
>
> read = 0
> write = 0
> for pool in conn.list_pools():
> io = conn.open_ioctx(pool)
> stats[pool] = io.get_stats()
> read+=int(stats[pool]['num_rd'])
> write+=int(stats[pool]['num_wr'])
>
> Could someone share his knowledge about rados module for retriving ceph
> statistics?
>
> BTW Ceph is awesome!
>
> --
> Best regards, Adrian Banasiak
> email: adr...@banasiak.it
>

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

--
Best Regards,

Wheat

--

Pozdrawiam, Adrian Banasiak
email: adr...@banasiak.it___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Error while initializing OSD directory

2014-05-13 Thread Srinivasa Rao Ragolu

Hi All,

I am following manual steps to create osd node.

while executing below command, i am facing error like below

#ceph-osd -i 1 --mkfs --mkkey
2014-05-14 05:04:12.097585 7f91c99007c0 -1  ** ERROR: error creating empty
object store in /var/lib/ceph/osd/-: (2) No such file or directory

But directory /var/lib/ceph/osd/ceph-1 is already available

Please help me in fixing this issue.

Thanks,
Srinivas.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crushmap question

2014-05-13 Thread Cao, Buddy

Thanks Gregory so much，it solved the problem!


Wei Cao (Buddy)

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com] 
Sent: Wednesday, May 14, 2014 2:00 AM
To: Cao, Buddy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] crushmap question

You just use a type other than "rack" in your chooseleaf rule. In your case, 
"host". When using chooseleaf, the bucket type you specify is the failure 
domain which it must segregate across.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 13, 2014 at 12:52 AM, Cao, Buddy  wrote:
> Hi,
>
>
>
> I have a crushmap structure likes root->rack->host->osds. I designed 
> the rule below, since I used “chooseleaf…rack” in rule definition, if 
> there is only one rack in the cluster, the ceph gps will always stay 
> at stuck unclean state (that is because the default metadata/data/rbd pool 
> set 2 replicas).
> Could you let me know how do I configure the rule to let it can also 
> work in a cluster with only one rack?
>
>
>
> rule ssd{
>
> ruleset 1
>
> type replicated
>
> min_size 0
>
> max_size 10
>
> step take root
>
> step chooseleaf firstn 0 type rack
>
> step emit
>
> }
>
>
>
> BTW, if I add a new rack into the crushmap, the pg status will finally 
> get to active+clean. However, my customer do ONLY have one rack in 
> their env, so hard for me to have workaround to ask him setup several racks.
>
>
>
> Wei Cao (Buddy)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] crushmap question

2014-05-13 Thread Cao, Buddy

BTW, I'd like to know, after I change the "from rack" to "from host", if I add 
more racks with host/osds in the cluster, will ceph choose the osds for pg only 
from one zone? or ceph will randomly choose from several different zones?


Wei Cao (Buddy)

-Original Message-
From: Cao, Buddy 
Sent: Wednesday, May 14, 2014 1:30 PM
To: 'Gregory Farnum'
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] crushmap question

Thanks Gregory so much，it solved the problem!


Wei Cao (Buddy)

-Original Message-
From: Gregory Farnum [mailto:g...@inktank.com]
Sent: Wednesday, May 14, 2014 2:00 AM
To: Cao, Buddy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] crushmap question

You just use a type other than "rack" in your chooseleaf rule. In your case, 
"host". When using chooseleaf, the bucket type you specify is the failure 
domain which it must segregate across.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, May 13, 2014 at 12:52 AM, Cao, Buddy  wrote:
> Hi,
>
>
>
> I have a crushmap structure likes root->rack->host->osds. I designed 
> the rule below, since I used “chooseleaf…rack” in rule definition, if 
> there is only one rack in the cluster, the ceph gps will always stay 
> at stuck unclean state (that is because the default metadata/data/rbd pool 
> set 2 replicas).
> Could you let me know how do I configure the rule to let it can also 
> work in a cluster with only one rack?
>
>
>
> rule ssd{
>
> ruleset 1
>
> type replicated
>
> min_size 0
>
> max_size 10
>
> step take root
>
> step chooseleaf firstn 0 type rack
>
> step emit
>
> }
>
>
>
> BTW, if I add a new rack into the crushmap, the pg status will finally 
> get to active+clean. However, my customer do ONLY have one rack in 
> their env, so hard for me to have workaround to ask him setup several racks.
>
>
>
> Wei Cao (Buddy)
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

65 matches

Mail list logo