[ceph-users] [SOLVED] RE: failing on 0.67.1 radosgw install

2013-08-22 Thread Fuchs, Andreas (SwissTXT)
My radosgw is up now.
There were two problems in my config

1.) I missed to copy the "FastCgiExternalServer /var/www/s3gw.fcgi -socket 
/tmp/radosgw.sock"  entry from the instructions into my apache config
2.) I did a mistake with the ceph conf, I entered:
[client.radosgw.gateway]
host = radosgw01, radosgw02
...
in ceph.conf, similar as it is done for mon host or mon intial member
but correct is
[client.radosgw.gateway]
host = radosgw01
...
[client.radosgw.gateway]
host = radosgw02
...

Now the real fun can start, testing

Regards
Andi

-Original Message-
From: Fuchs, Andreas (SwissTXT) 
Sent: Dienstag, 20. August 2013 16:34
To: 'ceph-users@lists.ceph.com'
Subject: failing on 0.67.1 radosgw install

Hi

I succesfully setup a ceph cluster with 0.67.1, now I try to get the radosgw on 
a separate node running

Os=ubuntu 12.04 lts
Install seemed succesfull, but if  I try to access the api I see the following 
in the logs

Apache2/error.log

2013-08-20 16:22:17.029064 7f927abdf780 -1 warning: unable to create 
/var/run/ceph: (13) Permission denied
2013-08-20 16:22:17.029343 7f927abdf780 -1 WARNING: libcurl doesn't support 
curl_multi_wait()
2013-08-20 16:22:17.029348 7f927abdf780 -1 WARNING: cross zone / region 
transfer performance may be affected [Tue Aug 20 16:22:17 2013] [warn] FastCGI: 
(dynamic) server "/var/www/s3gw.fcgi" (pid 20793) terminated by calling exit 
with status '0'

I can fix the /var/run/ceph permission error by

sudo mkdir /var/run/ceph
sudo chown www-data /var/run/ceph

but after restarting the services I still get:

[Tue Aug 20 16:24:45 2013] [notice] FastCGI: process manager initialized (pid 
24276) [Tue Aug 20 16:24:45 2013] [notice] Apache/2.2.22 (Ubuntu) 
mod_fastcgi/mod_fastcgi-SNAP-0910052141 mod_ssl/2.2.22 OpenSSL/1.0.1 configured 
-- resuming normal operations [Tue Aug 20 16:25:25 2013] [warn] FastCGI: 
(dynamic) server "/var/www/s3gw.fcgi" started (pid 24373)
2013-08-20 16:25:25.760621 7f2a63aeb780 -1 WARNING: libcurl doesn't support 
curl_multi_wait()
2013-08-20 16:25:25.760628 7f2a63aeb780 -1 WARNING: cross zone / region 
transfer performance may be affected [Tue Aug 20 16:25:25 2013] [warn] FastCGI: 
(dynamic) server "/var/www/s3gw.fcgi" (pid 24373) terminated by calling exit 
with status '0'

Also I have over time more and more of those processes:

/usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway

Ceph auth list shows

client.radosgw.gateway
key: checked witj keyfile and correct
caps: [mon] allow rw
caps: [osd] allow rwx

config is:

[client.radosgw.gateway]
host = radosgw01, radosgw02
keyring = /etc/ceph/keyring.radosgw.gateway rgw_socket_path = /tmp/radosgw.sock 
log_file = /var/log/ceph/radosgw.log rgw_dns_name = radosgw01.swisstxt.ch


/var/log/ceph/radosgw.log is empty

Any idea what could be wrong?

Regards
Andi

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] (no subject)

2013-08-22 Thread Rong Zhang

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Destroyed Ceph Cluster

2013-08-22 Thread Georg Höllrigl

Thank you - It works now as expected.
I've removed the MDS. As soon as the 2nd osd machine came up, it fixed 
the other errors!?


On 19.08.2013 18:28, Gregory Farnum wrote:

Have you ever used the FS? It's missing an object which we're
intermittently seeing failures to create (on initial setup) when the
cluster is unstable.
If so, clear out the metadata pool and check the docs for "newfs".
-Greg

On Monday, August 19, 2013, Georg Höllrigl wrote:

Hello List,

The troubles to fix such a cluster continue... I get output like
this now:

# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck unclean; mds cluster is
degraded; mds vvx-ceph-m-03 is laggy


When checking for the ceph-mds processes, there are now none left...
no matter which server I check. And the won't start up again!?

The log starts up with:
2013-08-19 11:23:30.503214 7f7e9dfbd780  0 ceph version 0.67
(__e3b7bc5bce8ab330ec166138107236__8af3c218a0), process ceph-mds,
pid 27636
2013-08-19 11:23:30.523314 7f7e9904b700  1 mds.-1.0 handle_mds_map
standby
2013-08-19 11:23:30.529418 7f7e9904b700  1 mds.0.26 handle_mds_map i
am now mds.0.26
2013-08-19 11:23:30.529423 7f7e9904b700  1 mds.0.26 handle_mds_map
state change up:standby --> up:replay
2013-08-19 11:23:30.529426 7f7e9904b700  1 mds.0.26 replay_start
2013-08-19 11:23:30.529434 7f7e9904b700  1 mds.0.26  recovery set is
2013-08-19 11:23:30.529436 7f7e9904b700  1 mds.0.26  need osdmap
epoch 277, have 276
2013-08-19 11:23:30.529438 7f7e9904b700  1 mds.0.26  waiting for
osdmap 277 (which blacklists prior instance)
2013-08-19 11:23:30.534090 7f7e9904b700 -1 mds.0.sessionmap
_load_finish got (2) No such file or directory
2013-08-19 11:23:30.535483 7f7e9904b700 -1 mds/SessionMap.cc: In
function 'void SessionMap::_load_finish(int, ceph::bufferlist&)'
thread 7f7e9904b700 time 2013-08-19 11:23:30.534107
mds/SessionMap.cc: 83: FAILED assert(0 == "failed to load sessionmap")


Anyone an idea how to get the cluster back running?





Georg




On 16.08.2013 16:23, Mark Nelson wrote:

Hi Georg,

I'm not an expert on the monitors, but that's probably where I would
start.  Take a look at your monitor logs and see if you can get
a sense
for why one of your monitors is down.  Some of the other devs will
probably be around later that might know if there are any known
issues
with recreating the OSDs and missing PGs.

Mark

On 08/16/2013 08:21 AM, Georg Höllrigl wrote:

Hello,

I'm still evaluating ceph - now a test cluster with the 0.67
dumpling.
I've created the setup with ceph-deploy from GIT.
I've recreated a bunch of OSDs, to give them another journal.
There already was some test data on these OSDs.
I've already recreated the missing PGs with "ceph pg
force_create_pg"


HEALTH_WARN 192 pgs stuck inactive; 192 pgs stuck unclean; 5
requests
are blocked > 32 sec; mds cluster is degraded; 1 mons down,
quorum
0,1,2 vvx-ceph-m-01,vvx-ceph-m-02,__vvx-ceph-m-03

Any idea how to fix the cluster, besides completley
rebuilding the
cluster from scratch? What if such a thing happens in a
production
environment...

The pgs from "ceph pg dump" looks all like creating for some
time now:

2.3d0   0   0   0   0   0   0
creating
   2013-08-16 13:43:08.186537   0'0 0:0 []
  [] 0'0
0.000'0 0.00

Is there a way to just dump the data, that was on the
discarded OSDs?




Kind Regards,
Georg
_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com



_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com


_
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com




--
Software Engineer #42 @ http://inktank.com | http://ceph.com


--
Dipl.-Ing. (FH) Georg Höllrigl
Technik



Xidras GmbH
Stockern 47
3744 Stockern
Austria

Tel: +43 (0) 2983 201 - 30505
Fax: +43 (

Re: [ceph-users] Network failure scenarios

2013-08-22 Thread Sage Weil
On Fri, 23 Aug 2013, Keith Phua wrote:
> Hi,
> 
> It was mentioned in the devel mailing list that for 2 networks setup, if 
> the cluster network failed, the cluster behave pretty badly. Ref: 
> http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12285/match=cluster+network+fail
> 
> May I know if this problem still exist in cuttlefish or dumpling?

This is fixed in dumpling.  When an osd is marked down, it verifies that 
it is able to connect to other hosts on both its public and cluster 
network before trying to add itself back into the cluster.
 
> If I have 2 racks of servers in a cluster and a total of 5 mons. Rack1 
> contains 3 mons, 120 osds and rack2 contains 2 mons, 120 osds. In a 2 
> networks setup, May I know what will happen when the following problem 
> occurs:
> 
> 1. Public network links between rack1 and rack2 failed resulting rack1 
> mons uncontactable with rack2 mons. osds of both racks still connected.  
> Will the cluster see it as 2 out of 5 mons failed or 3 out of 5 mons 
> failed?

This is a classic partition.  One rack will see 3 working and 2 failed 
mons, and the cluster will appear "up".  The other rack will see 2 working 
and 3 failed mons, and will be effectively down.

> 2. Cluster network links between rack1 and rack2 failed resulting osds 
> in rack1 and osds in rack2 disconnected as mentioned above.

Here all the mons are available.  OSDs will get marked down by peers in 
the opposite rack because the cluster network link has failed.  They will 
only try to mark themselves back up if they are able to reach 1/3 of their 
peers.  This value is currently hard-coded; we can easily make it tunable.  
(https://github.com/ceph/ceph/pull/533)

> 3. Both network links between rack1 and rack2 failed. Split-brain seems 
> to occur.  Will the cluster halt? Or rack 1 starts to self-healed and 
> replicate data in rack1 since rack1 will have 3 mons out of 5 mons 
> working?

This is really the same as 1.  Only the half with a majority of 
communicating monitors will be 'up'; the other part of the cluster will 
not be allowed to do anything.

sage

> In the above scenarios, all links within the rack are all working.
> 
> Your valuable comments are greatly appreciated.
> 
> Keith
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Network failure scenarios

2013-08-22 Thread Keith Phua
Hi,

It was mentioned in the devel mailing list that for 2 networks setup, if the 
cluster network failed, the cluster behave pretty badly. Ref: 
http://article.gmane.org/gmane.comp.file-systems.ceph.devel/12285/match=cluster+network+fail

May I know if this problem still exist in cuttlefish or dumpling?

If I have 2 racks of servers in a cluster and a total of 5 mons.
Rack1 contains 3 mons, 120 osds and rack2 contains 2 mons, 120 osds. In a 2 
networks setup, May I know what will happen when the following problem occurs:

1. Public network links between rack1 and rack2 failed resulting rack1 mons 
uncontactable with rack2 mons. osds of both racks still connected.  Will the 
cluster see it as 2 out of 5 mons failed or 3 out of 5 mons failed?

2. Cluster network links between rack1 and rack2 failed resulting osds in rack1 
and osds in rack2 disconnected as mentioned above.  

3. Both network links between rack1 and rack2 failed. Split-brain seems to 
occur.  Will the cluster halt? Or rack 1 starts to self-healed and replicate 
data in rack1 since rack1 will have 3 mons out of 5 mons working?

In the above scenarios, all links within the rack are all working.

Your valuable comments are greatly appreciated.

Keith

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum  wrote:

> You don't appear to have accounted for the 2x replication (where all
>  writes go to two OSDs) in these calculations. I assume your pools have
>

Ah. Right. So I should then be looking at:

# OSDs * Throughput per disk / 2 / repl factor ?

Which makes 300-400 MB/s aggregate throughput actually sort of reasonable.


> size 2 (or 3?) for these tests. 3 would explain the performance
> difference entirely; 2x replication leaves it still a bit low but
> takes the difference down to ~350/600 instead of ~350/1200. :)
>

Yeah. We're doing 2x repl now, and haven't yet made the decision if we're
going to move to 3x repl or not.


> You mentioned that your average osd bench throughput was ~50MB/s;
> what's the range?


41.9 - 54.7 MB/s

The actual average is 47.1 MB/s


> Have you run any rados bench tests?


Yessir.

rados bench write:

2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat:
0.196411
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   900  14 73322 73308   325.764   316   0.13978  0.196411
 Total time run: 900.239317
Total writes made:  73322
Write size: 4194304
Bandwidth (MB/sec): 325.789

Stddev Bandwidth:   35.102
Max bandwidth (MB/sec): 440
Min bandwidth (MB/sec): 0
Average Latency:0.196436
Stddev Latency: 0.121463
Max latency:1.77006
Min latency:0.071682

I haven't had any luck with the seq bench. It just errors every time.



> What is your PG count across the cluster?
>

pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used,
28523 GB / 30417 GB avail; 498MB/s wr, 124op/s

Thanks again.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
I should have also said that we experienced similar performance on
Cuttlefish. I have run identical benchmarks on both.


On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey  wrote:

> Hey Greg,
>
> I encountered a similar problem and we're just in the process of
> tracking it down here on the list.  Try downgrading your OSD-binaries to
> 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> you're probably experiencing the same problem I have with Dumpling.
>
> PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> start with a database that has been touched by a Dumpling-monitor and
> don't talk to them, either.
>
> PPS: I've also had OSDs no longer start with an assert while processing
> the journal during these upgrade/downgrade-tests, mostly when coming
> down from Dumpling to Cuttlefish.  If you encounter those, delete your
> journal and re-create with `ceph-osd -i  --mkjournal'.  Your
> data-store will be OK, as far as I can tell.
>
>
>Regards,
>
>  Oliver
>
> On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
> > I have been benchmarking our Ceph installation for the last week or
> > so, and I've come across an issue that I'm having some difficulty
> > with.
> >
> >
> > Ceph bench reports reasonable write throughput at the OSD level:
> >
> >
> > ceph tell osd.0 bench
> > { "bytes_written": 1073741824,
> >   "blocksize": 4194304,
> >   "bytes_per_sec": "47288267.00"}
> >
> >
> > Running this across all OSDs produces on average 50-55 MB/s, which is
> > fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
> > on same disk, separate partitions).
> >
> >
> > What I wasn't expecting was the following:
> >
> >
> > I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
> > against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
> >
> >
> > 1  196.013671875
> > 2  285.8759765625
> > 4  351.9169921875
> > 8  386.455078125
> > 16 363.8583984375
> > 24 353.6298828125
> > 32 348.9697265625
> >
> >
> >
> > I was hoping to see something closer to # OSDs * Average value for
> > ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
> >
> >
> > We're seeing excellent read, randread performance, but writes are a
> > bit of a bother.
> >
> >
> > Does anyone have any suggestions?
> >
> >
> > We have 20 Gb/s network
> > I used Fio w/ 16 thread concurrency
> > We're running Scientific Linux 6.4
> > 2.6.32 kernel
> > Ceph Dumpling 0.67.1-0.el6
> > OpenStack Grizzly
> > Libvirt 0.10.2
> > qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
> >
> > (I'm using qemu-kvm from the ceph-extras repository, which doesn't
> > appear to have a -.dumpling version yet).
> >
> >
> > Thanks very much for any assistance.
> >
> >
> > Greg
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

I didn't know that option, but I'm always careful to downgrade and
upgrade the OSDs one by one and wait for the cluster to report healthy
again before proceeding to the next, so, as you said, chances of losing
data should have been minimal.  Will flush the journals too next time.
Thanks!


   Regards,

 Oliver

On do, 2013-08-22 at 14:52 -0700, Gregory Farnum wrote:
> On Thu, Aug 22, 2013 at 2:47 PM, Oliver Daudey  wrote:
> > Hey Greg,
> >
> > Thanks for the tip!  I was assuming a clean shutdown of the OSD should
> > flush the journal for you and have the OSD try to exit with it's
> > data-store in a clean state?  Otherwise, I would first have to stop
> > updates a that particular OSD, then flush the journal, then stop it?
> 
> Nope, clean shutdown doesn't force a flush as it could potentially
> block on the filesystem. --flush-journal is a CLI option, so you would
> turn off the OSD, then run it with that option (it won't join the
> cluster or anything, just look at and update local disk state), then
> downgrade the binary.
> In all likelihood this won't have caused you to lose any data because
> in many/most situations the OSD actually will have written out
> everything in the journal to the local FS before you tell it to shut
> down, and as long as one of the other OSDs either did that or turned
> back on without crashing then it will propagate the newer updates to
> everybody. But wiping the journal without flushing is certainly not
> the sort of thing you should get in the habit of doing.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Gregory Farnum
On Thu, Aug 22, 2013 at 2:47 PM, Oliver Daudey  wrote:
> Hey Greg,
>
> Thanks for the tip!  I was assuming a clean shutdown of the OSD should
> flush the journal for you and have the OSD try to exit with it's
> data-store in a clean state?  Otherwise, I would first have to stop
> updates a that particular OSD, then flush the journal, then stop it?

Nope, clean shutdown doesn't force a flush as it could potentially
block on the filesystem. --flush-journal is a CLI option, so you would
turn off the OSD, then run it with that option (it won't join the
cluster or anything, just look at and update local disk state), then
downgrade the binary.
In all likelihood this won't have caused you to lose any data because
in many/most situations the OSD actually will have written out
everything in the journal to the local FS before you tell it to shut
down, and as long as one of the other OSDs either did that or turned
back on without crashing then it will propagate the newer updates to
everybody. But wiping the journal without flushing is certainly not
the sort of thing you should get in the habit of doing.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

Thanks for the tip!  I was assuming a clean shutdown of the OSD should
flush the journal for you and have the OSD try to exit with it's
data-store in a clean state?  Otherwise, I would first have to stop
updates a that particular OSD, then flush the journal, then stop it?


   Regards,

  Oliver

On do, 2013-08-22 at 14:34 -0700, Gregory Farnum wrote:
> On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey  wrote:
> > Hey Greg,
> >
> > I encountered a similar problem and we're just in the process of
> > tracking it down here on the list.  Try downgrading your OSD-binaries to
> > 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> > you're probably experiencing the same problem I have with Dumpling.
> >
> > PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> > start with a database that has been touched by a Dumpling-monitor and
> > don't talk to them, either.
> >
> > PPS: I've also had OSDs no longer start with an assert while processing
> > the journal during these upgrade/downgrade-tests, mostly when coming
> > down from Dumpling to Cuttlefish.  If you encounter those, delete your
> > journal and re-create with `ceph-osd -i  --mkjournal'.  Your
> > data-store will be OK, as far as I can tell.
> 
> Careful — deleting the journal is potentially throwing away updates to
> your data store! If this is a problem you should flush the journal
> with the dumpling binary before downgrading.
> 
> >
> >
> >Regards,
> >
> >  Oliver
> >
> > On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
> >> I have been benchmarking our Ceph installation for the last week or
> >> so, and I've come across an issue that I'm having some difficulty
> >> with.
> >>
> >>
> >> Ceph bench reports reasonable write throughput at the OSD level:
> >>
> >>
> >> ceph tell osd.0 bench
> >> { "bytes_written": 1073741824,
> >>   "blocksize": 4194304,
> >>   "bytes_per_sec": "47288267.00"}
> >>
> >>
> >> Running this across all OSDs produces on average 50-55 MB/s, which is
> >> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
> >> on same disk, separate partitions).
> >>
> >>
> >> What I wasn't expecting was the following:
> >>
> >>
> >> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
> >> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
> >>
> >>
> >> 1  196.013671875
> >> 2  285.8759765625
> >> 4  351.9169921875
> >> 8  386.455078125
> >> 16 363.8583984375
> >> 24 353.6298828125
> >> 32 348.9697265625
> >>
> >>
> >>
> >> I was hoping to see something closer to # OSDs * Average value for
> >> ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
> >>
> >>
> >> We're seeing excellent read, randread performance, but writes are a
> >> bit of a bother.
> >>
> >>
> >> Does anyone have any suggestions?
> You don't appear to have accounted for the 2x replication (where all
> writes go to two OSDs) in these calculations. I assume your pools have
> size 2 (or 3?) for these tests. 3 would explain the performance
> difference entirely; 2x replication leaves it still a bit low but
> takes the difference down to ~350/600 instead of ~350/1200. :)
> You mentioned that your average osd bench throughput was ~50MB/s;
> what's the range? Have you run any rados bench tests? What is your PG
> count across the cluster?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Gregory Farnum
On Thu, Aug 22, 2013 at 2:23 PM, Oliver Daudey  wrote:
> Hey Greg,
>
> I encountered a similar problem and we're just in the process of
> tracking it down here on the list.  Try downgrading your OSD-binaries to
> 0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
> you're probably experiencing the same problem I have with Dumpling.
>
> PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
> start with a database that has been touched by a Dumpling-monitor and
> don't talk to them, either.
>
> PPS: I've also had OSDs no longer start with an assert while processing
> the journal during these upgrade/downgrade-tests, mostly when coming
> down from Dumpling to Cuttlefish.  If you encounter those, delete your
> journal and re-create with `ceph-osd -i  --mkjournal'.  Your
> data-store will be OK, as far as I can tell.

Careful — deleting the journal is potentially throwing away updates to
your data store! If this is a problem you should flush the journal
with the dumpling binary before downgrading.

>
>
>Regards,
>
>  Oliver
>
> On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
>> I have been benchmarking our Ceph installation for the last week or
>> so, and I've come across an issue that I'm having some difficulty
>> with.
>>
>>
>> Ceph bench reports reasonable write throughput at the OSD level:
>>
>>
>> ceph tell osd.0 bench
>> { "bytes_written": 1073741824,
>>   "blocksize": 4194304,
>>   "bytes_per_sec": "47288267.00"}
>>
>>
>> Running this across all OSDs produces on average 50-55 MB/s, which is
>> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
>> on same disk, separate partitions).
>>
>>
>> What I wasn't expecting was the following:
>>
>>
>> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
>> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
>>
>>
>> 1  196.013671875
>> 2  285.8759765625
>> 4  351.9169921875
>> 8  386.455078125
>> 16 363.8583984375
>> 24 353.6298828125
>> 32 348.9697265625
>>
>>
>>
>> I was hoping to see something closer to # OSDs * Average value for
>> ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
>>
>>
>> We're seeing excellent read, randread performance, but writes are a
>> bit of a bother.
>>
>>
>> Does anyone have any suggestions?
You don't appear to have accounted for the 2x replication (where all
writes go to two OSDs) in these calculations. I assume your pools have
size 2 (or 3?) for these tests. 3 would explain the performance
difference entirely; 2x replication leaves it still a bit low but
takes the difference down to ~350/600 instead of ~350/1200. :)
You mentioned that your average osd bench throughput was ~50MB/s;
what's the range? Have you run any rados bench tests? What is your PG
count across the cluster?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Oliver Daudey
Hey Greg,

I encountered a similar problem and we're just in the process of
tracking it down here on the list.  Try downgrading your OSD-binaries to
0.61.8 Cuttlefish and re-test.  If it's significantly faster on RBD,
you're probably experiencing the same problem I have with Dumpling.

PS: Only downgrade your OSDs.  Cuttlefish-monitors don't seem to want to
start with a database that has been touched by a Dumpling-monitor and
don't talk to them, either.

PPS: I've also had OSDs no longer start with an assert while processing
the journal during these upgrade/downgrade-tests, mostly when coming
down from Dumpling to Cuttlefish.  If you encounter those, delete your
journal and re-create with `ceph-osd -i  --mkjournal'.  Your
data-store will be OK, as far as I can tell.


   Regards,

 Oliver

On do, 2013-08-22 at 10:55 -0700, Greg Poirier wrote:
> I have been benchmarking our Ceph installation for the last week or
> so, and I've come across an issue that I'm having some difficulty
> with.
> 
> 
> Ceph bench reports reasonable write throughput at the OSD level:
> 
> 
> ceph tell osd.0 bench
> { "bytes_written": 1073741824,
>   "blocksize": 4194304,
>   "bytes_per_sec": "47288267.00"}
> 
> 
> Running this across all OSDs produces on average 50-55 MB/s, which is
> fine with us. We were expecting around 100 MB/s / 2 (journal and OSD
> on same disk, separate partitions).
> 
> 
> What I wasn't expecting was the following:
> 
> 
> I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing
> against 33 OSDs. Aggregate write throughput peaked under 400 MB/s:
> 
> 
> 1  196.013671875
> 2  285.8759765625
> 4  351.9169921875
> 8  386.455078125
> 16 363.8583984375
> 24 353.6298828125
> 32 348.9697265625
> 
> 
> 
> I was hoping to see something closer to # OSDs * Average value for
> ceph bench (approximately 1.2 GB/s peak aggregate write throughput).
> 
> 
> We're seeing excellent read, randread performance, but writes are a
> bit of a bother.
> 
> 
> Does anyone have any suggestions?
> 
> 
> We have 20 Gb/s network
> I used Fio w/ 16 thread concurrency
> We're running Scientific Linux 6.4
> 2.6.32 kernel
> Ceph Dumpling 0.67.1-0.el6
> OpenStack Grizzly
> Libvirt 0.10.2
> qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
> 
> (I'm using qemu-kvm from the ceph-extras repository, which doesn't
> appear to have a -.dumpling version yet).
> 
> 
> Thanks very much for any assistance.
> 
> 
> Greg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Snapshot a KVM VM with RBD backend and libvirt

2013-08-22 Thread Tobias Brunner

Hi,

I'm trying to create a snapshot from a KVM VM:

# virsh snapshot-create one-5
error: unsupported configuration: internal checkpoints require at least 
one disk to be selected for snapshot


RBD should support such snapshot, according to the wiki: 
http://ceph.com/w/index.php?title=QEMU-RBD#Snapshotting


Some system information:
* Ubuntu 13.04 64Bit
* virsh --version: 1.0.2
* ceph version 0.67.1 (e23b817ad0cf1ea19c0a7b7cb30bed37d533)
* The XML of the VM is below

Thanks for all help...

Cheers,
Tobias


  one-5
  cff38c6a-2996-7709-c4f4-0ed826c2fb02
  524288
  524288
  1
  
1024
  
  
hvm

  
  

  
  
  destroy
  restart
  destroy
  
/usr/bin/kvm

  
  
  
  
  


  
  
  
  
  
  


  
  function='0x1'/>



  
  function='0x2'/>



  
  
  

  
  
  
  function='0x0'/>




  


  
  
  function='0x0'/>



  
  function='0x0'/>


  
  
libvirt-cff38c6a-2996-7709-c4f4-0ed826c2fb02

libvirt-cff38c6a-2996-7709-c4f4-0ed826c2fb02

  


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unexpectedly slow write performance (RBD cinder volumes)

2013-08-22 Thread Greg Poirier
I have been benchmarking our Ceph installation for the last week or so, and
I've come across an issue that I'm having some difficulty with.

Ceph bench reports reasonable write throughput at the OSD level:

ceph tell osd.0 bench
{ "bytes_written": 1073741824,
  "blocksize": 4194304,
  "bytes_per_sec": "47288267.00"}

Running this across all OSDs produces on average 50-55 MB/s, which is fine
with us. We were expecting around 100 MB/s / 2 (journal and OSD on same
disk, separate partitions).

What I wasn't expecting was the following:

I tested 1, 2, 4, 8, 16, 24, and 32 VMSs simultaneously writing against 33
OSDs. Aggregate write throughput peaked under 400 MB/s:

1  196.013671875
2  285.8759765625
4  351.9169921875
8  386.455078125
16 363.8583984375
24 353.6298828125
32 348.9697265625

I was hoping to see something closer to # OSDs * Average value for ceph
bench (approximately 1.2 GB/s peak aggregate write throughput).

We're seeing excellent read, randread performance, but writes are a bit of
a bother.

Does anyone have any suggestions?

We have 20 Gb/s network
I used Fio w/ 16 thread concurrency
We're running Scientific Linux 6.4
2.6.32 kernel
Ceph Dumpling 0.67.1-0.el6
OpenStack Grizzly
Libvirt 0.10.2
qemu-kvm 0.12.1.2-2.355.el6.2.cuttlefish
(I'm using qemu-kvm from the ceph-extras repository, which doesn't appear
to have a -.dumpling version yet).

Thanks very much for any assistance.

Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to create a single mon using" ceph-deploy mon create **“

2013-08-22 Thread Nico Massenberg
Same problem with me. It took me one step further to add the public network 
parameter to all the ceph.conf files. However, ceph-deploy tells me mons are 
created but those won’t show up in ceph -w output.

Am 22.08.2013 um 18:43 schrieb Alfredo Deza :

> On Wed, Aug 21, 2013 at 10:05 PM, SOLO  wrote:
>> Hi!
>> 
>> I am trying ceph on RHEL 6.4
>> My ceph version is cuttlefish
>> I followed the intro and ceph-deploy new ..  ceph-deploy instal ..
>> --stable cuttlefish
>> It didn't appear an error until here.
>> And then I typed ceph-deploy mon create ..
>> Here comes the error as bellow
>> 
>> .
>> .
>> .
>> 
>> [ceph@cephadmin my-clusters]$ ceph-deploy mon create cephs1
>> === mon.cephs1 ===
>> Starting Ceph mon.cephs1 on cephs1...
>> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file
>> /var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf '
>> Starting ceph-create-keys on cephs1...
>> Traceback (most recent call last):
>>  File "/usr/bin/ceph-deploy", line 21, in 
>>main()
>>  File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 112, in
>> main
>>return args.func(args)
>>  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 234, in
>> mon
>>mon_create(args)
>>  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 138, in
>> mon_create
>>init=init,
>>  File "/usr/lib/python2.6/site-packages/pushy/protocol/proxy.py", line 255,
>> in 
>>(conn.operator(type_, self, args, kwargs))
>>  File "/usr/lib/python2.6/site-packages/pushy/protocol/connection.py", line
>> 66, in operator
>>return self.send_request(type_, (object, args, kwargs))
>>  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py",
>> line 329, in send_request
>>return self.__handle(m)
>>  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py",
>> line 645, in __handle
>>raise e
>> pushy.protocol.proxy.ExceptionProxy: Command '['service', 'ceph', 'start',
>> 'mon.cephs1']' returned non-zero exit status 1
>> 
>> .
> What happens when you try both failing commands directly on that host?
> 
> First the `sudo service ceph start mon.cephfs1` and then
> 
> `ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file
> /var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf `
> 
> 
> 
>> .
>> .
>> And I check my /etc/sudoers.d/ceph and my /etc/sudoers is as bellow
>> ..
>> ## Allow root to run any commands anywhere
>> rootALL=(ALL)   ALL
>> cephALL=(ALL)   ALL
>> ## Allows members of the 'sys' group to run networking, software,
>> ## service management apps and more.
>> # %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES,
>> LOCATE, DRIVERS
>> 
>> ## Allows people in group wheel to run all commands
>> # %wheelALL=(ALL)   ALL
>> 
>> ## Same thing without a password
>> %ceph  ALL=(ALL)   NOPASSWD: ALL
>> ..
>> ## Read drop-in files from /etc/sudoers.d (the # here does not mean a
>> comment)
>> #includedir /etc/sudoers.d
>> .
>> .
>> .
>> 
>> What shall I do now?
>> THX
>> FingerLiu
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Failed to create a single mon using" ceph-deploy mon create **“

2013-08-22 Thread Alfredo Deza
On Wed, Aug 21, 2013 at 10:05 PM, SOLO  wrote:
> Hi!
>
> I am trying ceph on RHEL 6.4
> My ceph version is cuttlefish
> I followed the intro and ceph-deploy new ..  ceph-deploy instal ..
> --stable cuttlefish
> It didn't appear an error until here.
> And then I typed ceph-deploy mon create ..
> Here comes the error as bellow
>
> .
> .
> .
>
> [ceph@cephadmin my-clusters]$ ceph-deploy mon create cephs1
> === mon.cephs1 ===
> Starting Ceph mon.cephs1 on cephs1...
> failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file
> /var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf '
> Starting ceph-create-keys on cephs1...
> Traceback (most recent call last):
>   File "/usr/bin/ceph-deploy", line 21, in 
> main()
>   File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 112, in
> main
> return args.func(args)
>   File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 234, in
> mon
> mon_create(args)
>   File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 138, in
> mon_create
> init=init,
>   File "/usr/lib/python2.6/site-packages/pushy/protocol/proxy.py", line 255,
> in 
> (conn.operator(type_, self, args, kwargs))
>   File "/usr/lib/python2.6/site-packages/pushy/protocol/connection.py", line
> 66, in operator
> return self.send_request(type_, (object, args, kwargs))
>   File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py",
> line 329, in send_request
> return self.__handle(m)
>   File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py",
> line 645, in __handle
> raise e
> pushy.protocol.proxy.ExceptionProxy: Command '['service', 'ceph', 'start',
> 'mon.cephs1']' returned non-zero exit status 1
>
> .
What happens when you try both failing commands directly on that host?

First the `sudo service ceph start mon.cephfs1` and then

`ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file
/var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf `



> .
> .
>  And I check my /etc/sudoers.d/ceph and my /etc/sudoers is as bellow
> ..
> ## Allow root to run any commands anywhere
> rootALL=(ALL)   ALL
> cephALL=(ALL)   ALL
> ## Allows members of the 'sys' group to run networking, software,
> ## service management apps and more.
> # %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES,
> LOCATE, DRIVERS
>
> ## Allows people in group wheel to run all commands
> # %wheelALL=(ALL)   ALL
>
> ## Same thing without a password
>  %ceph  ALL=(ALL)   NOPASSWD: ALL
> ..
> ## Read drop-in files from /etc/sudoers.d (the # here does not mean a
> comment)
> #includedir /etc/sudoers.d
> .
> .
> .
>
> What shall I do now?
> THX
> FingerLiu
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD hole punching

2013-08-22 Thread Michael Lowe
I use the virtio-scsi driver.

On Aug 22, 2013, at 12:05 PM, David Blundell  
wrote:

>> I see yet another caveat: According to that documentation, it only works with
>> the IDE driver, not with virtio.
>> 
>>Guido
> 
> I've just been looking into this but have not yet tested.  It looks like 
> discard is supported in the newer virtio-scsi devices but not virtio-blk.
> 
> This Sheepdog page has a little more information on qemu discard 
> https://github.com/sheepdog/sheepdog/wiki/Discard-Support
> 
> David
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD hole punching

2013-08-22 Thread David Blundell
> I see yet another caveat: According to that documentation, it only works with
> the IDE driver, not with virtio.
> 
>   Guido

I've just been looking into this but have not yet tested.  It looks like 
discard is supported in the newer virtio-scsi devices but not virtio-blk.

This Sheepdog page has a little more information on qemu discard 
https://github.com/sheepdog/sheepdog/wiki/Discard-Support

David
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NFS vs. CephFS for /var/lib/nova/instances

2013-08-22 Thread Gregory Farnum
On Thursday, August 22, 2013, Amit Vijairania wrote:

> Hello!
>
> We, in our environment, need a shared file system for
> /var/lib/nova/instances and Glance image cache (_base)..
>
> Is anyone using CephFS for this purpose?
> When folks say CephFS is not production ready, is the primary concern
> stability/data-integrity or performance?
> Is NFS (with NFS-Ganesha) is better solution?  Is anyone using it today?
>

Our primary concern about CephFS is about its stability; there are a couple
of important known bugs and it has yet to see the string QA that would
qualify it for general production use. Storing VM images is one of the use
cases it might be okay for, but:
Why not use RBD? It sounds like that's what you want, and RBD is
purpose-built for managing VM images and volumes!
-Greg



>
> Please let us know..
>
> Thanks!
> Amit
>
> Amit Vijairania  |  978.319.3684
> --*--
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-22 Thread Mike Dawson
Jumping in pretty late on this thread, but I can confirm much higher CPU 
load on ceph-osd using 0.67.1 compared to 0.61.7 under a write-heavy RBD 
workload. Under my workload, it seems like it might be 2x-5x higher CPU 
load per process.


Thanks,
Mike Dawson


On 8/22/2013 4:41 AM, Oliver Daudey wrote:

Hey Samuel,

On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:

I think the rbd cache one you'd need to run for a few minutes to get
meaningful results.  It should stabilize somewhere around the actual
throughput of your hardware.


Ok, I now also ran this test on Cuttlefish as well as Dumpling.

Cuttlefish:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13265  13252.45  3466029.67
 2 25956  12975.60  3589479.95
 3 38475  12818.61  3598590.70
 4 50184  12545.16  3530516.34
 5 59263  11852.22  3292258.13
<...>
   300   3421530  11405.08  3191555.35
   301   3430755  11397.83  3189251.09
   302   3443345  11401.73  3190694.98
   303   3455230  11403.37  3191478.97
   304   3467014  11404.62  3192136.82
   305   3475355  11394.57  3189525.71
   306   3488067  11398.90  3190553.96
   307   3499789  11399.96  3190770.21
   308   3510566  11397.93  3190289.49
   309   3519829  11390.98  3188620.93
   310   3532539  11395.25  3189544.03

Dumpling:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13201  13194.63  3353004.50
 2 25926  12957.05  3379695.03
 3 36624  12206.06  3182087.11
 4 46547  11635.35  3035794.95
 5 59290  11856.27  3090389.79
<...>
   300   3405215  11350.66  3130092.00
   301   3417789  11354.76  3131106.34
   302   3430067  11357.83  3131933.41
   303   3438792  11349.14  3129734.88
   304   3450237  11349.45  3129689.62
   305   3462840  11353.53  3130406.43
   306   3473151  11350.17  3128942.32
   307   3482327  11343.00  3126771.34
   308   3495020  11347.44  3127502.07
   309   3506894  11349.13  3127781.70
   310   3516532  11343.65  3126714.62

As you can see, the result is virtually identical.  What jumps out
during the cached tests, is that the CPU used by the OSDs is negligible
in both cases, while without caching, the OSDs get loaded quite well.
Perhaps the cache masks the problem we're seeing in Dumpling somehow?
And I'm not changing anything but the OSD-binary during my tests, so
cache-settings used in VMs are identical in both scenarios.



Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
are you using?  Try setting them to 1000 (you only need to set the
xfs ones).

For the rand test, try increasing
filestore_wbthrottle_xfs_inodes_hard_limit and
filestore_wbthrottle_xfs_inodes_start_flusher to 1 as well as
setting the above ios limits.


Ok, my current config:
 filestore wbthrottle xfs ios start flusher = 1000
 filestore wbthrottle xfs ios hard limit = 1000
 filestore wbthrottle xfs inodes hard limit = 1
 filestore wbthrottle xfs inodes start flusher = 1

Unfortunately, that still makes no difference at all in the original
standard-tests.

Random IO on Dumpling, after 120 secs of runtime:
# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   545534.98  1515804.02
 2  1162580.80  1662416.60
 3  1731576.52  1662966.61
 4  2317579.04  1695129.94
 5  2817562.56  1672754.87
<...>
   120 43564362.91  1080512.00
   121 43774361.76  1077368.28
   122 44419364.06  1083894.31
   123 45046366.22  1090518.68
   124 45287364.01  1084437.37
   125 45334361.54  1077035.12
   126 45336359.40  1070678.36
   127 45797360.60  1073985.78
   128 46388362.40  1080056.75
   129 46984364.21  1086068.63
   130 47604366.11  1092712.51

Random IO on Cuttlefish, after 120 secs of runtime:
rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1  1066   1065.54  3115713.13
 2  2099   1049.31  2936300.53
 3  3218   1072.32  3028707.50
 4  4026   1003.23  2807859.15
 5  4272793.80  2226962.63
<...>
   120 66935557.79  1612483.74
   121 68011562.01  1625419.34
   122 68428558.59  1615376.62
   123 68579557.06  1610780.38
   125 68777549.73  1589816.94
   126 69745553.52  1601671.46
   127 70855557.91  1614293.12
   128 71962562.20  1627070.81
   129 72529562.22  1627120.59
   130 73146562.66  1628818.79

Confirming your setting took properly:
# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config sho

Re: [ceph-users] RBD hole punching

2013-08-22 Thread Guido Winkelmann
Am Donnerstag, 22. August 2013, 10:32:30 schrieb Mike Lowe:
> There is TRIM/discard support and I use it with some success. There are some
> details here http://ceph.com/docs/master/rbd/qemu-rbd/  The one caveat I
> have is that I've sometimes been able to crash an osd by doing fstrim
> inside a guest.

I see yet another caveat: According to that documentation, it only works with 
the IDE driver, not with virtio.

Guido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-22 Thread Mark Nelson
For what it's worth, I was still seeing some small sequential write 
degradation with kernel RBD with dumpling, though random writes were not 
consistently slower in the testing I did.  There was also some variation 
in performance between 0.61.2 and 0.61.7 likely due to the workaround we 
had to implement for XFS.


Mark

On 08/22/2013 10:23 AM, Sage Weil wrote:

We should perhaps hack the old (cuttlefish and earlier) flushing behavior
into the new code so that we can confirm that it is really the writeback
that is causing the problem and not something else...

sage

On Thu, 22 Aug 2013, Oliver Daudey wrote:


Hey Samuel,

On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:

I think the rbd cache one you'd need to run for a few minutes to get
meaningful results.  It should stabilize somewhere around the actual
throughput of your hardware.


Ok, I now also ran this test on Cuttlefish as well as Dumpling.

Cuttlefish:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13265  13252.45  3466029.67
 2 25956  12975.60  3589479.95
 3 38475  12818.61  3598590.70
 4 50184  12545.16  3530516.34
 5 59263  11852.22  3292258.13
<...>
   300   3421530  11405.08  3191555.35
   301   3430755  11397.83  3189251.09
   302   3443345  11401.73  3190694.98
   303   3455230  11403.37  3191478.97
   304   3467014  11404.62  3192136.82
   305   3475355  11394.57  3189525.71
   306   3488067  11398.90  3190553.96
   307   3499789  11399.96  3190770.21
   308   3510566  11397.93  3190289.49
   309   3519829  11390.98  3188620.93
   310   3532539  11395.25  3189544.03

Dumpling:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
   SEC   OPS   OPS/SEC   BYTES/SEC
 1 13201  13194.63  3353004.50
 2 25926  12957.05  3379695.03
 3 36624  12206.06  3182087.11
 4 46547  11635.35  3035794.95
 5 59290  11856.27  3090389.79
<...>
   300   3405215  11350.66  3130092.00
   301   3417789  11354.76  3131106.34
   302   3430067  11357.83  3131933.41
   303   3438792  11349.14  3129734.88
   304   3450237  11349.45  3129689.62
   305   3462840  11353.53  3130406.43
   306   3473151  11350.17  3128942.32
   307   3482327  11343.00  3126771.34
   308   3495020  11347.44  3127502.07
   309   3506894  11349.13  3127781.70
   310   3516532  11343.65  3126714.62

As you can see, the result is virtually identical.  What jumps out
during the cached tests, is that the CPU used by the OSDs is negligible
in both cases, while without caching, the OSDs get loaded quite well.
Perhaps the cache masks the problem we're seeing in Dumpling somehow?
And I'm not changing anything but the OSD-binary during my tests, so
cache-settings used in VMs are identical in both scenarios.



Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
are you using?  Try setting them to 1000 (you only need to set the
xfs ones).

For the rand test, try increasing
filestore_wbthrottle_xfs_inodes_hard_limit and
filestore_wbthrottle_xfs_inodes_start_flusher to 1 as well as
setting the above ios limits.


Ok, my current config:
 filestore wbthrottle xfs ios start flusher = 1000
 filestore wbthrottle xfs ios hard limit = 1000
 filestore wbthrottle xfs inodes hard limit = 1
 filestore wbthrottle xfs inodes start flusher = 1

Unfortunately, that still makes no difference at all in the original
standard-tests.

Random IO on Dumpling, after 120 secs of runtime:
# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1   545534.98  1515804.02
 2  1162580.80  1662416.60
 3  1731576.52  1662966.61
 4  2317579.04  1695129.94
 5  2817562.56  1672754.87
<...>
   120 43564362.91  1080512.00
   121 43774361.76  1077368.28
   122 44419364.06  1083894.31
   123 45046366.22  1090518.68
   124 45287364.01  1084437.37
   125 45334361.54  1077035.12
   126 45336359.40  1070678.36
   127 45797360.60  1073985.78
   128 46388362.40  1080056.75
   129 46984364.21  1086068.63
   130 47604366.11  1092712.51

Random IO on Cuttlefish, after 120 secs of runtime:
rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
   SEC   OPS   OPS/SEC   BYTES/SEC
 1  1066   1065.54  3115713.13
 2  2099   1049.31  2936300.53
 3  3218   1072.32  3028707.50
 4  4026   1003.23  2807859.15
 5  4272793.80  2226962.63
<...>
   120 66935557.79  1612483.74
   121 68011562.01  1625419.34
   122 68428558.59  1615376.62
   123 68579557.06  1610780.38
   125 68777549.

Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-22 Thread Sage Weil
We should perhaps hack the old (cuttlefish and earlier) flushing behavior 
into the new code so that we can confirm that it is really the writeback 
that is causing the problem and not something else...

sage

On Thu, 22 Aug 2013, Oliver Daudey wrote:

> Hey Samuel,
> 
> On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:
> > I think the rbd cache one you'd need to run for a few minutes to get
> > meaningful results.  It should stabilize somewhere around the actual
> > throughput of your hardware.
> 
> Ok, I now also ran this test on Cuttlefish as well as Dumpling.
> 
> Cuttlefish:
> # rbd bench-write test --rbd-cache
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1 13265  13252.45  3466029.67
> 2 25956  12975.60  3589479.95
> 3 38475  12818.61  3598590.70
> 4 50184  12545.16  3530516.34
> 5 59263  11852.22  3292258.13
> <...>
>   300   3421530  11405.08  3191555.35
>   301   3430755  11397.83  3189251.09
>   302   3443345  11401.73  3190694.98
>   303   3455230  11403.37  3191478.97
>   304   3467014  11404.62  3192136.82
>   305   3475355  11394.57  3189525.71
>   306   3488067  11398.90  3190553.96
>   307   3499789  11399.96  3190770.21
>   308   3510566  11397.93  3190289.49
>   309   3519829  11390.98  3188620.93
>   310   3532539  11395.25  3189544.03
> 
> Dumpling:
> # rbd bench-write test --rbd-cache
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1 13201  13194.63  3353004.50
> 2 25926  12957.05  3379695.03
> 3 36624  12206.06  3182087.11
> 4 46547  11635.35  3035794.95
> 5 59290  11856.27  3090389.79
> <...>
>   300   3405215  11350.66  3130092.00
>   301   3417789  11354.76  3131106.34
>   302   3430067  11357.83  3131933.41
>   303   3438792  11349.14  3129734.88
>   304   3450237  11349.45  3129689.62
>   305   3462840  11353.53  3130406.43
>   306   3473151  11350.17  3128942.32
>   307   3482327  11343.00  3126771.34
>   308   3495020  11347.44  3127502.07
>   309   3506894  11349.13  3127781.70
>   310   3516532  11343.65  3126714.62
> 
> As you can see, the result is virtually identical.  What jumps out
> during the cached tests, is that the CPU used by the OSDs is negligible
> in both cases, while without caching, the OSDs get loaded quite well.
> Perhaps the cache masks the problem we're seeing in Dumpling somehow?
> And I'm not changing anything but the OSD-binary during my tests, so
> cache-settings used in VMs are identical in both scenarios.
> 
> > 
> > Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
> > are you using?  Try setting them to 1000 (you only need to set the
> > xfs ones).
> > 
> > For the rand test, try increasing
> > filestore_wbthrottle_xfs_inodes_hard_limit and
> > filestore_wbthrottle_xfs_inodes_start_flusher to 1 as well as
> > setting the above ios limits.
> 
> Ok, my current config:
> filestore wbthrottle xfs ios start flusher = 1000
> filestore wbthrottle xfs ios hard limit = 1000
> filestore wbthrottle xfs inodes hard limit = 1
> filestore wbthrottle xfs inodes start flusher = 1
> 
> Unfortunately, that still makes no difference at all in the original
> standard-tests.
> 
> Random IO on Dumpling, after 120 secs of runtime:
> # rbd bench-write test --io-pattern=rand
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1   545534.98  1515804.02
> 2  1162580.80  1662416.60
> 3  1731576.52  1662966.61
> 4  2317579.04  1695129.94
> 5  2817562.56  1672754.87
> <...>
>   120 43564362.91  1080512.00
>   121 43774361.76  1077368.28
>   122 44419364.06  1083894.31
>   123 45046366.22  1090518.68
>   124 45287364.01  1084437.37
>   125 45334361.54  1077035.12
>   126 45336359.40  1070678.36
>   127 45797360.60  1073985.78
>   128 46388362.40  1080056.75
>   129 46984364.21  1086068.63
>   130 47604366.11  1092712.51
> 
> Random IO on Cuttlefish, after 120 secs of runtime:
> rbd bench-write test --io-pattern=rand
> bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
>   SEC   OPS   OPS/SEC   BYTES/SEC
> 1  1066   1065.54  3115713.13
> 2  2099   1049.31  2936300.53
> 3  3218   1072.32  3028707.50
> 4  4026   1003.23  2807859.15
> 5  4272793.80  2226962.63
> <...>
>   120 66935557.79  1612483.74
>   121 68011562.01  1625419.34
>   122 68428558.59  1615376.62
>   123 68579557.06  1610780.38
>   125 68777549.73  1589816.94
>   126 69745553.52  1601671.46
>   127 70855557.91  1614293.12
>   128 71962562.20  1627070.81
>   129 72529562.22  1627120.59
>   1

[ceph-users] rbd in centos6.4

2013-08-22 Thread raj kumar
ceph cluster is running fine in centos6.4.

Now I would like to export the block device to client using rbd.

my question is,

1. I used to modprobe rbd in one of the monitor host. But I got error,
   FATAL: Module rbd not found
   I could not find rbd module. How can i do this?

2. Once the rbd is created. Do we need to create iscsi target in one of a
monitor host and present the lun to client. If so what if the monitor host
goes down. so what is the best practice to provide a lun to clients.

thanks
Raj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Failed to create a single mon using" ceph-deploy mon create **??

2013-08-22 Thread SOLO
Hi!


I am trying ceph on RHEL 6.4 
My ceph version is cuttlefish
I followed the intro and ceph-deploy new ..  ceph-deploy instal .. 
--stable cuttlefish 
It didn't appear an error until here.
And then I typed ceph-deploy mon create ..
Here comes the error as bellow


.
.
.


[ceph@cephadmin my-clusters]$ ceph-deploy mon create cephs1
=== mon.cephs1 ===
Starting Ceph mon.cephs1 on cephs1...
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file 
/var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on cephs1...
Traceback (most recent call last):
  File "/usr/bin/ceph-deploy", line 21, in 
main()
  File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 112, in main
return args.func(args)
  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 234, in mon
mon_create(args)
  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 138, in 
mon_create
init=init,
  File "/usr/lib/python2.6/site-packages/pushy/protocol/proxy.py", line 255, in 

(conn.operator(type_, self, args, kwargs))
  File "/usr/lib/python2.6/site-packages/pushy/protocol/connection.py", line 
66, in operator
return self.send_request(type_, (object, args, kwargs))
  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py", 
line 329, in send_request
return self.__handle(m)
  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py", 
line 645, in __handle
raise e
pushy.protocol.proxy.ExceptionProxy: Command '['service', 'ceph', 'start', 
'mon.cephs1']' returned non-zero exit status 1



.
.
.
 And I check my /etc/sudoers.d/ceph and my /etc/sudoers is as bellow  
..
## Allow root to run any commands anywhere
rootALL=(ALL)   ALL
cephALL=(ALL)   ALL
## Allows members of the 'sys' group to run networking, software,
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, 
LOCATE, DRIVERS


## Allows people in group wheel to run all commands
# %wheelALL=(ALL)   ALL


## Same thing without a password
 %ceph  ALL=(ALL)   NOPASSWD: ALL

..
## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

.
.
.


What shall I do now?
THX
FingerLiu___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ??????Failed to create a single mon using" ceph-deploy mon create **??

2013-08-22 Thread SOLO
And here is my ceph.log
.
.
.
[ceph@cephadmin my-clusters]$ less ceph.log
2013-08-22 09:01:27,375 ceph_deploy.new DEBUG Creating new cluster named ceph
2013-08-22 09:01:27,375 ceph_deploy.new DEBUG Resolving host cephs1
2013-08-22 09:01:27,382 ceph_deploy.new DEBUG Monitor cephs1 at 10.2.9.223
2013-08-22 09:01:27,382 ceph_deploy.new DEBUG Monitor initial members are 
['cephs1']
2013-08-22 09:01:27,382 ceph_deploy.new DEBUG Monitor addrs are ['10.2.9.223']
2013-08-22 09:01:27,382 ceph_deploy.new DEBUG Creating a random mon key...
2013-08-22 09:01:27,383 ceph_deploy.new DEBUG Writing initial config to 
ceph.conf...
2013-08-22 09:01:27,383 ceph_deploy.new DEBUG Writing monitor keyring to 
ceph.conf...
2013-08-22 09:02:02,926 ceph_deploy.install DEBUG Installing stable version 
cuttlefish on cluster ceph hosts cephs1
2013-08-22 09:02:02,926 ceph_deploy.install DEBUG Detecting platform for host 
cephs1 ...
2013-08-22 09:02:03,916 ceph_deploy.install DEBUG Distro RedHatEnterpriseServer 
release 6.4 codename Santiago
2013-08-22 09:02:03,925 ceph_deploy.install DEBUG Installing on host cephs1 ...
2013-08-22 09:04:13,713 ceph_deploy.install DEBUG Installing stable version 
cuttlefish on cluster ceph hosts cephs1
2013-08-22 09:04:13,714 ceph_deploy.install DEBUG Detecting platform for host 
cephs1 ...
2013-08-22 09:04:14,575 ceph_deploy.install DEBUG Distro RedHatEnterpriseServer 
release 6.4 codename Santiago
2013-08-22 09:04:13,714 ceph_deploy.install DEBUG Detecting platform for host 
cephs1 ...
2013-08-22 09:04:14,575 ceph_deploy.install DEBUG Distro RedHatEnterpriseServer 
release 6.4 codename
 Santiago
2013-08-22 09:04:14,582 ceph_deploy.install DEBUG Installing on host cephs1 ...
2013-08-22 09:04:57,044 ceph_deploy.install DEBUG Installing stable version 
cuttlefish on cluster ceph hosts cephs2
2013-08-22 09:04:57,045 ceph_deploy.install DEBUG Detecting platform for host 
cephs2 ...
2013-08-22 09:04:57,595 ceph_deploy.install DEBUG Distro RedHatEnterpriseServer 
release 6.4 codename Santiago
2013-08-22 09:04:57,603 ceph_deploy.install DEBUG Installing on host cephs2 ...
2013-08-22 09:07:04,398 ceph_deploy.mon DEBUG Deploying mon, cluster ceph hosts 
cephs1
2013-08-22 09:07:04,399 ceph_deploy.mon DEBUG Deploying mon to cephs1
2013-08-22 09:07:05,360 ceph_deploy.mon DEBUG Distro RedHatEnterpriseServer 
codename Santiago, will use sysvinit



.
.
.






--  --
??: "SOLO";
: 2013??8??22??(??) 10:05
??: "ceph-users"; 

: Failed to create a single mon using" ceph-deploy mon create **??



Hi!


I am trying ceph on RHEL 6.4 
My ceph version is cuttlefish
I followed the intro and ceph-deploy new ..  ceph-deploy instal .. 
--stable cuttlefish 
It didn't appear an error until here.
And then I typed ceph-deploy mon create ..
Here comes the error as bellow


.
.
.


[ceph@cephadmin my-clusters]$ ceph-deploy mon create cephs1
=== mon.cephs1 ===
Starting Ceph mon.cephs1 on cephs1...
failed: 'ulimit -n 8192;  /usr/bin/ceph-mon -i cephs1 --pid-file 
/var/run/ceph/mon.cephs1.pid -c /etc/ceph/ceph.conf '
Starting ceph-create-keys on cephs1...
Traceback (most recent call last):
  File "/usr/bin/ceph-deploy", line 21, in 
main()
  File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 112, in main
return args.func(args)
  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 234, in mon
mon_create(args)
  File "/usr/lib/python2.6/site-packages/ceph_deploy/mon.py", line 138, in 
mon_create
init=init,
  File "/usr/lib/python2.6/site-packages/pushy/protocol/proxy.py", line 255, in 

(conn.operator(type_, self, args, kwargs))
  File "/usr/lib/python2.6/site-packages/pushy/protocol/connection.py", line 
66, in operator
return self.send_request(type_, (object, args, kwargs))
  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py", 
line 329, in send_request
return self.__handle(m)
  File "/usr/lib/python2.6/site-packages/pushy/protocol/baseconnection.py", 
line 645, in __handle
raise e
pushy.protocol.proxy.ExceptionProxy: Command '['service', 'ceph', 'start', 
'mon.cephs1']' returned non-zero exit status 1



.
.
.
 And I check my /etc/sudoers.d/ceph and my /etc/sudoers is as bellow  
..
## Allow root to run any commands anywhere
rootALL=(ALL)   ALL
cephALL=(ALL)   ALL
## Allows members of the 'sys' group to run networking, software,
## service management apps and more.
# %sys ALL = NETWORKING, SOFTWARE, SERVICES, STORAGE, DELEGATING, PROCESSES, 
LOCATE, DRIVERS


## Allows people in group wheel to run all commands
# %wheelALL=(ALL)   ALL


## Same thing without a password
 %ceph  ALL=(ALL)   NOPASSWD: ALL

..
## Read drop-in files from /etc/sudoers.d (the # here does not mean a comment)
#includedir /etc/sudoers.d

.
.
.


What shall I do now?
THX
FingerLiu___
ceph-users mailing list
ceph-users@lists.ceph.com
http

[ceph-users] bucket count limit

2013-08-22 Thread Mostowiec Dominik
Hi,
I think about sharding s3 buckets in CEPH cluster, create bucket-per-XX (256 
buckets) or even bucket-per-XXX (4096 buckets) where XXX is sign from object 
md5 url.
Could this be the problem? (performance, or some limits)

--
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] One rados account, more S3 API keyes

2013-08-22 Thread Sage Weil
On Thu, 22 Aug 2013, Mih?ly ?rva-T?th wrote:
> Hello,
> 
> Is there any method to one radosgw user has more than one access/secret_key?

Yes, you can have multiple keys for each user:

 radosgw-admin key create ...

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] failing on 0.67.1 radosgw install

2013-08-22 Thread Yehuda Sadeh
On Thu, Aug 22, 2013 at 12:36 AM, Fuchs, Andreas (SwissTXT)
 wrote:
> My apache conf is as follows
>
> cat /etc/apache2/httpd.conf
> ServerName radosgw01.swisstxt.ch
>
> cat /etc/apache2/sites-enabled/000_radosgw
> 
>
> ServerName *.radosgw01.swisstxt.ch
> # ServerAdmin {email.address}
> ServerAdmin serviced...@swisstxt.ch
> DocumentRoot /var/www
> RewriteEngine On
> RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) 
> /s3gw.fcgi?page=$1¶ms=$2&%{QUERY_STRING} 
> [E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]
>
> 
> 
> Options +ExecCGI
> AllowOverride All
> SetHandler fastcgi-script
> Order allow,deny
> Allow from all
> AuthBasicAuthoritative Off
> 
> 
>
> AllowEncodedSlashes On
> ErrorLog /var/log/apache2/error.log
> CustomLog /var/log/apache2/access.log combined
> ServerSignature Off
>
> 
>
> Default site is disabled
>
> cat /var/www/s3gw.fcgi
> #!/bin/sh
> exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway
>
>
> There are NO dns entries at the moment
> radosgw01.swisstxt.ch is entered in hosts file


You're not setting up your gateway as an external fcgi server. Look at
the instructions in here:

http://ceph.com/docs/master/radosgw/config/

Specifically, you're missing the FastCgiExternalServer entry.


Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bucket count limit

2013-08-22 Thread Dominik Mostowiec
Thank's for your answer.

--
Regards
Dominik

2013/8/22 Yehuda Sadeh :
> On Thu, Aug 22, 2013 at 7:11 AM, Dominik Mostowiec
>  wrote:
>> Hi,
>> I think about sharding s3 buckets in CEPH cluster, create
>> bucket-per-XX (256 buckets) or even bucket-per-XXX (4096 buckets)
>> where XXX is sign from object md5 url.
>> Could this be the problem? (performance, or some limits)
>>
>
> The two issues that I can think of. One is that there's usually a 1000
> buckets per user limitation. This can be easily modified though.
> The second issue is that you might end up with a huge number of
> buckets per user, and at that point listing buckets may just take too
> long. We've seen in the past cases where listing large number of
> buckets (> 500k) took more than the client timeout period, which in
> turn retried and retried, which snowballed into a very high load on
> the gateway (as the original requests were still processing
> internally). However, this might not be an issue anymore, I do
> remember we had a problem with streamlining the list bucket responses
> which I think is already fixed. In any case, if you're not planning to
> list all the user's buckets then this is moot.
>
> Yehuda



-- 
Pozdrawiam
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD hole punching

2013-08-22 Thread Mike Lowe
There is TRIM/discard support and I use it with some success. There are some 
details here http://ceph.com/docs/master/rbd/qemu-rbd/  The one caveat I have 
is that I've sometimes been able to crash an osd by doing fstrim inside a guest.

On Aug 22, 2013, at 10:24 AM, Guido Winkelmann  
wrote:

> Hi,
> 
> RBD has had support for sparse allocation for some time now. However, when 
> using an RBD volume as a virtual disk for a virtual machine, the RBD volume 
> will inevitably grow until it reaches its actual nominal size, even if the 
> filesystem in the guest machine never reaches full utilization.
> 
> Is there some way to reverse this? Like going through the whole image, 
> looking 
> for large consecutive areas of zeroes and just deleting the objects for that 
> area? How about support for TRIM/discard commands used by some modern 
> filesystems?
> 
>   Guido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw crash

2013-08-22 Thread Yehuda Sadeh
On Thu, Aug 22, 2013 at 5:18 AM, Pawel Stefanski  wrote:
> hello!
>
> Today our radosgw crashed while running multiple deletions via s3 api.
>
> Is this known bug ?
>
> POST
> WSTtobXBlBrm2r78B67LtQ==
>
> Thu, 22 Aug 2013 11:38:34 GMT
> /inna-a/?delete
>-11> 2013-08-22 13:39:26.650499 7f36347d8700  2 req 95:0.000555:s3:POST
> /inna-a/:multi_object_delete:reading permissions
>-10> 2013-08-22 13:39:26.650561 7f36347d8700 10 moving .rgw+inna-a to
> cache LRU end
> -9> 2013-08-22 13:39:26.650568 7f36347d8700 10 cache get:
> name=.rgw+inna-a : hit
> -8> 2013-08-22 13:39:26.650585 7f36347d8700 10 moving .rgw+inna-a to
> cache LRU end
> -7> 2013-08-22 13:39:26.650590 7f36347d8700 10 cache get:
> name=.rgw+inna-a : hit
> -6> 2013-08-22 13:39:26.650636 7f36347d8700  2 req 95:0.000692:s3:POST
> /inna-a/:multi_object_delete:verifying op permissions
> -5> 2013-08-22 13:39:26.650644 7f36347d8700  5 Searching permissions for
> uid=7f0c560a-c405-422b-975c-98674543c0c1 mask=2
> -4> 2013-08-22 13:39:26.650648 7f36347d8700  5 Found permission: 15
> -3> 2013-08-22 13:39:26.650651 7f36347d8700 10
> uid=7f0c560a-c405-422b-975c-98674543c0c1 requested perm (type)=2, policy
> perm=2, user_perm_mask=2, acl perm=2
> -2> 2013-08-22 13:39:26.650657 7f36347d8700  2 req 95:0.000714:s3:POST
> /inna-a/:multi_object_delete:verifying op params
> -1> 2013-08-22 13:39:26.650663 7f36347d8700  2 req 95:0.000720:s3:POST
> /inna-a/:multi_object_delete:executing
>  0> 2013-08-22 13:39:26.653482 7f36347d8700 -1 *** Caught signal
> (Segmentation fault) **
>  in thread 7f36347d8700
>
>  ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
>  1: /usr/bin/radosgw() [0x4786da]
>  2: (()+0xfcb0) [0x7f36843b7cb0]
>  3: (std::basic_string, std::allocator
>>::basic_string(std::string const&)+0xb) [0x7f3683907f2b]
>  4: (RGWMultiDelDelete::xml_end(char const*)+0x13a) [0x54060a]
>  5: (RGWXMLParser::xml_end(char const*)+0x22) [0x4d88d2]
>  6: /usr/bin/radosgw() [0x4d889a]
>  7: (()+0xa6f4) [0x7f3684d426f4]
>  8: (()+0xb951) [0x7f3684d43951]
>  9: (()+0x87c7) [0x7f3684d407c7]
>  10: (()+0xa17b) [0x7f3684d4217b]
>  11: (XML_ParseBuffer()+0x6d) [0x7f3684d4575d]
>  12: (RGWXMLParser::parse(char const*, int, int)+0x90) [0x4d8e70]
>  13: (RGWDeleteMultiObj::execute()+0xef) [0x51640f]
>  14: (RGWProcess::handle_request(RGWRequest*)+0x3e6) [0x4735f6]
>  15: (RGWProcess::RGWWQ::_process(RGWRequest*)+0x36) [0x475266]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x491b56]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0x493990]
>  18: (()+0x7e9a) [0x7f36843afe9a]
>  19: (clone()+0x6d) [0x7f3683083ccd]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
>
> best regards!

Yeah, that's a known issue (#5931). A fix for it was pushed not too
long ago into the bobtail branch, although there wasn't a new build
since.

Thanks,
Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bucket count limit

2013-08-22 Thread Yehuda Sadeh
On Thu, Aug 22, 2013 at 7:11 AM, Dominik Mostowiec
 wrote:
> Hi,
> I think about sharding s3 buckets in CEPH cluster, create
> bucket-per-XX (256 buckets) or even bucket-per-XXX (4096 buckets)
> where XXX is sign from object md5 url.
> Could this be the problem? (performance, or some limits)
>

The two issues that I can think of. One is that there's usually a 1000
buckets per user limitation. This can be easily modified though.
The second issue is that you might end up with a huge number of
buckets per user, and at that point listing buckets may just take too
long. We've seen in the past cases where listing large number of
buckets (> 500k) took more than the client timeout period, which in
turn retried and retried, which snowballed into a very high load on
the gateway (as the original requests were still processing
internally). However, this might not be an issue anymore, I do
remember we had a problem with streamlining the list bucket responses
which I think is already fixed. In any case, if you're not planning to
list all the user's buckets then this is moot.

Yehuda
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD hole punching

2013-08-22 Thread Guido Winkelmann
Hi,

RBD has had support for sparse allocation for some time now. However, when 
using an RBD volume as a virtual disk for a virtual machine, the RBD volume 
will inevitably grow until it reaches its actual nominal size, even if the 
filesystem in the guest machine never reaches full utilization.

Is there some way to reverse this? Like going through the whole image, looking 
for large consecutive areas of zeroes and just deleting the objects for that 
area? How about support for TRIM/discard commands used by some modern 
filesystems?

Guido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bucket count limit

2013-08-22 Thread Dominik Mostowiec
I'm sorry for the spam :-(

--
Dominik

2013/8/22 Dominik Mostowiec :
> Hi,
> I think about sharding s3 buckets in CEPH cluster, create
> bucket-per-XX (256 buckets) or even bucket-per-XXX (4096 buckets)
> where XXX is sign from object md5 url.
> Could this be the problem? (performance, or some limits)
>
> --
> Regards
> Dominik



-- 
Pozdrawiam
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NFS vs. CephFS for /var/lib/nova/instances

2013-08-22 Thread Amit Vijairania
Hello!

We, in our environment, need a shared file system for
/var/lib/nova/instances and Glance image cache (_base)..

Is anyone using CephFS for this purpose?
When folks say CephFS is not production ready, is the primary concern
stability/data-integrity or performance?
Is NFS (with NFS-Ganesha) is better solution?  Is anyone using it today?

Please let us know..

Thanks!
Amit

Amit Vijairania  |  978.319.3684
--*--
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy and journal on separate disk

2013-08-22 Thread Alfredo Deza
On Thu, Aug 22, 2013 at 4:36 AM, Pavel Timoschenkov
 wrote:
> Hi.
> With this patch - is all ok.
> Thanks for help!
>

Thanks for confirming this, I have opened a ticket
(http://tracker.ceph.com/issues/6085 ) and will work on this patch to
get it merged.

> -Original Message-
> From: Alfredo Deza [mailto:alfredo.d...@inktank.com]
> Sent: Wednesday, August 21, 2013 7:16 PM
> To: Pavel Timoschenkov
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph-deploy and journal on separate disk
>
> On Wed, Aug 21, 2013 at 9:33 AM, Pavel Timoschenkov 
>  wrote:
>> Hi. Thanks for patch. But after patched ceph src and install it, I have not 
>> ceph-disk or ceph-deploy command.
>> I did the following steps:
>> git clone --recursive https://github.com/ceph/ceph.git patch -p0 <
>>  ./autogen.sh ./configure make make install What am I
>> doing wrong?
>
> Oh I meant to patch it directly, there was no need to rebuild/make/install 
> again because the file is a plain Python file (no compilation needed).
>
> Can you try that instead?
>>
>> -Original Message-
>> From: Alfredo Deza [mailto:alfredo.d...@inktank.com]
>> Sent: Monday, August 19, 2013 3:38 PM
>> To: Pavel Timoschenkov
>> Cc: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] ceph-deploy and journal on separate disk
>>
>> On Fri, Aug 16, 2013 at 8:32 AM, Pavel Timoschenkov 
>>  wrote:
>>> <<>> are causing this to <<>> flag with the filesystem and prevent this.
>>>
>>> Hi. Any changes (
>>>
>>> Can you create a build that passes the -t flag with mount?
>>>
>>
>> I tried going through these steps again and could not get any other ideas 
>> except to pass in that flag for mounting. Would you be willing to try a 
>> patch?
>> (http://fpaste.org/33099/37691580/)
>>
>> You would need to apply it to the `ceph-disk` executable.
>>
>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: Pavel Timoschenkov
>>> Sent: Thursday, August 15, 2013 3:43 PM
>>> To: 'Alfredo Deza'
>>> Cc: Samuel Just; ceph-us...@ceph.com
>>> Subject: RE: [ceph-users] ceph-deploy and journal on separate disk
>>>
>>>
>>>
>>> The separate commands (e.g. `ceph-disk -v prepare /dev/sda1`) works
>>> because then the journal is on the same device as the OSD data, so
>>> the execution is different to get them to a working state.
>>>
>>> I suspect that there are left over partitions in /dev/sdaa that are
>>> causing this to fail, I *think* that we could pass the `-t` flag with
>>> the filesystem and prevent this.
>>>
>>> Just to be sure, could you list all the partitions on /dev/sdaa (if
>>> /dev/sdaa is the whole device)?
>>>
>>> Something like:
>>>
>>> sudo parted /dev/sdaa print
>>>
>>> Or if you prefer any other way that could tell use what are all the
>>> partitions in that device.
>>>
>>>
>>>
>>>
>>>
>>> After
>>>
>>> ceph-deploy disk zap ceph001:sdaa ceph001:sda1
>>>
>>>
>>>
>>> root@ceph001:~# parted /dev/sdaa print
>>>
>>> Model: ATA ST3000DM001-1CH1 (scsi)
>>>
>>> Disk /dev/sdaa: 3001GB
>>>
>>> Sector size (logical/physical): 512B/4096B
>>>
>>> Partition Table: gpt
>>>
>>>
>>>
>>> Number  Start  End  Size  File system  Name  Flags
>>>
>>>
>>>
>>> root@ceph001:~# parted /dev/sda1 print
>>>
>>> Model: Unknown (unknown)
>>>
>>> Disk /dev/sda1: 10.7GB
>>>
>>> Sector size (logical/physical): 512B/512B
>>>
>>> Partition Table: gpt
>>>
>>> So that is after running `disk zap`. What does it say after using
>>> ceph-deploy and failing?
>>>
>>>
>>>
>>> Number  Start  End  Size  File system  Name  Flags
>>>
>>>
>>>
>>> After ceph-disk -v prepare /dev/sdaa /dev/sda1:
>>>
>>>
>>>
>>> root@ceph001:~# parted /dev/sdaa print
>>>
>>> Model: ATA ST3000DM001-1CH1 (scsi)
>>>
>>> Disk /dev/sdaa: 3001GB
>>>
>>> Sector size (logical/physical): 512B/4096B
>>>
>>> Partition Table: gpt
>>>
>>>
>>>
>>> Number  Start   End SizeFile system  Name   Flags
>>>
>>> 1  1049kB  3001GB  3001GB  xfs  ceph data
>>>
>>>
>>>
>>> And
>>>
>>>
>>>
>>> root@ceph001:~# parted /dev/sda1 print
>>>
>>> Model: Unknown (unknown)
>>>
>>> Disk /dev/sda1: 10.7GB
>>>
>>> Sector size (logical/physical): 512B/512B
>>>
>>> Partition Table: gpt
>>>
>>>
>>>
>>> Number  Start  End  Size  File system  Name  Flags
>>>
>>>
>>>
>>> With the same errors:
>>>
>>>
>>>
>>> root@ceph001:~# ceph-disk -v prepare /dev/sdaa /dev/sda1
>>>
>>> DEBUG:ceph-disk:Journal /dev/sda1 is a partition
>>>
>>> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the
>>> same device as the osd data
>>>
>>> DEBUG:ceph-disk:Creating osd partition on /dev/sdaa
>>>
>>> Information: Moved requested sector from 34 to 2048 in
>>>
>>> order to align on 2048-sector boundaries.
>>>
>>> The operation has completed successfully.
>>>
>>> DEBUG:ceph-disk:Creating xfs fs on /dev/sdaa1
>>>
>>> meta-data=/dev/sdaa1 isize=2048   agcount=32, agsize=22892700
>>> blks
>>>
>>>  =   sectsz=512   attr=2, projid32bit=0
>>>
>>> data =   bsize=4096   blocks=732566385, imaxpct=5
>>>
>>>  =  

[ceph-users] bucket count limit

2013-08-22 Thread Dominik Mostowiec
Hi,
I think about sharding s3 buckets in CEPH cluster, create
bucket-per-XX (256 buckets) or even bucket-per-XXX (4096 buckets)
where XXX is sign from object md5 url.
Could this be the problem? (performance, or some limits)

--
Regards
Dominik
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw crash

2013-08-22 Thread Pawel Stefanski
hello!

Today our radosgw crashed while running multiple deletions via s3 api.

Is this known bug ?

POST
WSTtobXBlBrm2r78B67LtQ==

Thu, 22 Aug 2013 11:38:34 GMT
/inna-a/?delete
   -11> 2013-08-22 13:39:26.650499 7f36347d8700  2 req 95:0.000555:s3:POST
/inna-a/:multi_object_delete:reading permissions
   -10> 2013-08-22 13:39:26.650561 7f36347d8700 10 moving .rgw+inna-a to
cache LRU end
-9> 2013-08-22 13:39:26.650568 7f36347d8700 10 cache get:
name=.rgw+inna-a : hit
-8> 2013-08-22 13:39:26.650585 7f36347d8700 10 moving .rgw+inna-a to
cache LRU end
-7> 2013-08-22 13:39:26.650590 7f36347d8700 10 cache get:
name=.rgw+inna-a : hit
-6> 2013-08-22 13:39:26.650636 7f36347d8700  2 req 95:0.000692:s3:POST
/inna-a/:multi_object_delete:verifying op permissions
-5> 2013-08-22 13:39:26.650644 7f36347d8700  5 Searching permissions
for uid=7f0c560a-c405-422b-975c-98674543c0c1 mask=2
-4> 2013-08-22 13:39:26.650648 7f36347d8700  5 Found permission: 15
-3> 2013-08-22 13:39:26.650651 7f36347d8700 10
 uid=7f0c560a-c405-422b-975c-98674543c0c1 requested perm (type)=2, policy
perm=2, user_perm_mask=2, acl perm=2
-2> 2013-08-22 13:39:26.650657 7f36347d8700  2 req 95:0.000714:s3:POST
/inna-a/:multi_object_delete:verifying op params
-1> 2013-08-22 13:39:26.650663 7f36347d8700  2 req 95:0.000720:s3:POST
/inna-a/:multi_object_delete:executing
 0> 2013-08-22 13:39:26.653482 7f36347d8700 -1 *** Caught signal
(Segmentation fault) **
 in thread 7f36347d8700

 ceph version 0.56.6 (95a0bda7f007a33b0dc7adf4b330778fa1e5d70c)
 1: /usr/bin/radosgw() [0x4786da]
 2: (()+0xfcb0) [0x7f36843b7cb0]
 3: (std::basic_string, std::allocator
>::basic_string(std::string const&)+0xb) [0x7f3683907f2b]
 4: (RGWMultiDelDelete::xml_end(char const*)+0x13a) [0x54060a]
 5: (RGWXMLParser::xml_end(char const*)+0x22) [0x4d88d2]
 6: /usr/bin/radosgw() [0x4d889a]
 7: (()+0xa6f4) [0x7f3684d426f4]
 8: (()+0xb951) [0x7f3684d43951]
 9: (()+0x87c7) [0x7f3684d407c7]
 10: (()+0xa17b) [0x7f3684d4217b]
 11: (XML_ParseBuffer()+0x6d) [0x7f3684d4575d]
 12: (RGWXMLParser::parse(char const*, int, int)+0x90) [0x4d8e70]
 13: (RGWDeleteMultiObj::execute()+0xef) [0x51640f]
 14: (RGWProcess::handle_request(RGWRequest*)+0x3e6) [0x4735f6]
 15: (RGWProcess::RGWWQ::_process(RGWRequest*)+0x36) [0x475266]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0x491b56]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x493990]
 18: (()+0x7e9a) [0x7f36843afe9a]
 19: (clone()+0x6d) [0x7f3683083ccd]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

best regards!
-- 
pawel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RE : OpenStack Cinder + Ceph, unable to remove unattached volumes, still watchers

2013-08-22 Thread HURTEVENT VINCENT
Hi Josh,

thank you for your answer, but I was in Bobtail so no listwatchers command :)

I planned a reboot of concerned compute nodes and all went fine then. I updated 
Ceph to last stable though.





De : Josh Durgin [josh.dur...@inktank.com]
Date d'envoi : mardi 20 août 2013 22:40
À : HURTEVENT VINCENT
Cc: Maciej Gałkiewicz; ceph-us...@ceph.com
Objet : Re: [ceph-users] OpenStack Cinder + Ceph, unable to remove unattached 
volumes, still watchers

On 08/20/2013 11:20 AM, Vincent Hurtevent wrote:
>
>
> I'm not the end user. It's possible that the volume has been detached
> without unmounting.
>
> As the volume is unattached and the initial kvm instance is down, I was
> expecting the rbd volume is properly unlocked even if the guest unmount
> hasn't been done, like a physical disk in fact.

Yes, detaching the volume will remove the watch regardless of the guest
having it mounted.

> Which part of the Ceph thing is allways locked or marked in use ? Do we
> have to go to the rados object level ?
> The data can be destroy.

It's a watch on the rbd header object, registered when the rbd volume
is attached, and unregistered when it is detached or 30 seconds after
the qemu/kvm process using it dies.

 From rbd info you can get the id of the image (part of the
block_name_prefix), and use the rados tool to see what ip is watching
the volume's header object, i.e.:

$ rbd info volume-name | grep prefix
 block_name_prefix: rbd_data.102f74b0dc51
$ rados -p rbd listwatchers rbd_header.102f74b0dc51
watcher=192.168.106.222:0/1029129 client.4152 cookie=1

> Reboot compute nodes could clean librbd layer and clean watchers ?

Yes, because this would kill all the qemu/kvm processes.

Josh

> 
> De : Don Talton (dotalton) [dotal...@cisco.com]
> Date d'envoi : mardi 20 août 2013 19:57
> À : HURTEVENT VINCENT
> Objet : RE: [ceph-users] OpenStack Cinder + Ceph, unable to remove
> unattached volumes, still watchers
>
> Did you unmounts them in the guest before detaching?
>
>  > -Original Message-
>  > From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>  > boun...@lists.ceph.com] On Behalf Of Vincent Hurtevent
>  > Sent: Tuesday, August 20, 2013 10:33 AM
>  > To: ceph-us...@ceph.com
>  > Subject: [ceph-users] OpenStack Cinder + Ceph, unable to remove
>  > unattached volumes, still watchers
>  >
>  > Hello,
>  >
>  > I'm using Ceph as Cinder backend. Actually it's working pretty well
> and some
>  > users are using this cloud platform for few weeks, but I come back from
>  > vacation and I've got some errors removing volumes, errors I didn't
> have few
>  > weeks ago.
>  >
>  > Here's the situation :
>  >
>  > Volumes are unattached, but Ceph is telling Cinder or I, when I try
> to remove
>  > trough rbd tools, that the volume still has watchers.
>  >
>  > rbd --pool cinder rm volume-46e241ee-ed3f-446a-87c7-1c9df560d770
>  > Removing image: 99% complete...failed.
>  > rbd: error: image still has watchers
>  > This means the image is still open or the client using it crashed.
> Try again after
>  > closing/unmapping it or waiting 30s for the crashed client to timeout.
>  > 2013-08-20 19:17:36.075524 7fedbc7e1780 -1 librbd: error removing
>  > header: (16) Device or resource busy
>  >
>  >
>  > The kvm instances on which the volumes have been attached are now
>  > terminated. There's no lock on the volume using 'rbd lock list'.
>  >
>  > I restarted all the monitors (3) one by one, with no better success.
>  >
>  >  From Openstack PoV, these volumes are well unattached.
>  >
>  > How can I unlock the volumes or trace back the watcher/process ? These
>  > could be on several and different compute nodes.
>  >
>  >
>  > Thank you for any hint,
>  >
>  >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Significant slowdown of osds since v0.67 Dumpling

2013-08-22 Thread Oliver Daudey
Hey Samuel,

On wo, 2013-08-21 at 20:27 -0700, Samuel Just wrote:
> I think the rbd cache one you'd need to run for a few minutes to get
> meaningful results.  It should stabilize somewhere around the actual
> throughput of your hardware.

Ok, I now also ran this test on Cuttlefish as well as Dumpling.

Cuttlefish:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1 13265  13252.45  3466029.67
2 25956  12975.60  3589479.95
3 38475  12818.61  3598590.70
4 50184  12545.16  3530516.34
5 59263  11852.22  3292258.13
<...>
  300   3421530  11405.08  3191555.35
  301   3430755  11397.83  3189251.09
  302   3443345  11401.73  3190694.98
  303   3455230  11403.37  3191478.97
  304   3467014  11404.62  3192136.82
  305   3475355  11394.57  3189525.71
  306   3488067  11398.90  3190553.96
  307   3499789  11399.96  3190770.21
  308   3510566  11397.93  3190289.49
  309   3519829  11390.98  3188620.93
  310   3532539  11395.25  3189544.03

Dumpling:
# rbd bench-write test --rbd-cache
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern seq
  SEC   OPS   OPS/SEC   BYTES/SEC
1 13201  13194.63  3353004.50
2 25926  12957.05  3379695.03
3 36624  12206.06  3182087.11
4 46547  11635.35  3035794.95
5 59290  11856.27  3090389.79
<...>
  300   3405215  11350.66  3130092.00
  301   3417789  11354.76  3131106.34
  302   3430067  11357.83  3131933.41
  303   3438792  11349.14  3129734.88
  304   3450237  11349.45  3129689.62
  305   3462840  11353.53  3130406.43
  306   3473151  11350.17  3128942.32
  307   3482327  11343.00  3126771.34
  308   3495020  11347.44  3127502.07
  309   3506894  11349.13  3127781.70
  310   3516532  11343.65  3126714.62

As you can see, the result is virtually identical.  What jumps out
during the cached tests, is that the CPU used by the OSDs is negligible
in both cases, while without caching, the OSDs get loaded quite well.
Perhaps the cache masks the problem we're seeing in Dumpling somehow?
And I'm not changing anything but the OSD-binary during my tests, so
cache-settings used in VMs are identical in both scenarios.

> 
> Hmm, 10k ios I guess is only 10 rbd chunks.  What replication level
> are you using?  Try setting them to 1000 (you only need to set the
> xfs ones).
> 
> For the rand test, try increasing
> filestore_wbthrottle_xfs_inodes_hard_limit and
> filestore_wbthrottle_xfs_inodes_start_flusher to 1 as well as
> setting the above ios limits.

Ok, my current config:
filestore wbthrottle xfs ios start flusher = 1000
filestore wbthrottle xfs ios hard limit = 1000
filestore wbthrottle xfs inodes hard limit = 1
filestore wbthrottle xfs inodes start flusher = 1

Unfortunately, that still makes no difference at all in the original
standard-tests.

Random IO on Dumpling, after 120 secs of runtime:
# rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
  SEC   OPS   OPS/SEC   BYTES/SEC
1   545534.98  1515804.02
2  1162580.80  1662416.60
3  1731576.52  1662966.61
4  2317579.04  1695129.94
5  2817562.56  1672754.87
<...>
  120 43564362.91  1080512.00
  121 43774361.76  1077368.28
  122 44419364.06  1083894.31
  123 45046366.22  1090518.68
  124 45287364.01  1084437.37
  125 45334361.54  1077035.12
  126 45336359.40  1070678.36
  127 45797360.60  1073985.78
  128 46388362.40  1080056.75
  129 46984364.21  1086068.63
  130 47604366.11  1092712.51

Random IO on Cuttlefish, after 120 secs of runtime:
rbd bench-write test --io-pattern=rand
bench-write  io_size 4096 io_threads 16 bytes 1073741824 pattern rand
  SEC   OPS   OPS/SEC   BYTES/SEC
1  1066   1065.54  3115713.13
2  2099   1049.31  2936300.53
3  3218   1072.32  3028707.50
4  4026   1003.23  2807859.15
5  4272793.80  2226962.63
<...>
  120 66935557.79  1612483.74
  121 68011562.01  1625419.34
  122 68428558.59  1615376.62
  123 68579557.06  1610780.38
  125 68777549.73  1589816.94
  126 69745553.52  1601671.46
  127 70855557.91  1614293.12
  128 71962562.20  1627070.81
  129 72529562.22  1627120.59
  130 73146562.66  1628818.79

Confirming your setting took properly:
# ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config show | grep
wbthrottle
  "filestore_wbthrottle_btrfs_bytes_start_flusher": "41943040",
  "filestore_wbthrottle_btrfs_bytes_hard_limit": "419430400",
  "filestore_wbthrottle_btrfs_ios_start_flusher": "500",
  "filestore_wbthrottle_btrfs_ios_hard_limit": "5000",
  "filestore_wbthrottle_btrfs_inodes_start_flusher": "500",
  "filestore_wbthrottle_xfs_bytes_start_

Re: [ceph-users] ceph-deploy and journal on separate disk

2013-08-22 Thread Pavel Timoschenkov
Hi.
With this patch - is all ok.
Thanks for help!

-Original Message-
From: Alfredo Deza [mailto:alfredo.d...@inktank.com] 
Sent: Wednesday, August 21, 2013 7:16 PM
To: Pavel Timoschenkov
Cc: ceph-us...@ceph.com
Subject: Re: [ceph-users] ceph-deploy and journal on separate disk

On Wed, Aug 21, 2013 at 9:33 AM, Pavel Timoschenkov 
 wrote:
> Hi. Thanks for patch. But after patched ceph src and install it, I have not 
> ceph-disk or ceph-deploy command.
> I did the following steps:
> git clone --recursive https://github.com/ceph/ceph.git patch -p0 < 
>  ./autogen.sh ./configure make make install What am I 
> doing wrong?

Oh I meant to patch it directly, there was no need to rebuild/make/install 
again because the file is a plain Python file (no compilation needed).

Can you try that instead?
>
> -Original Message-
> From: Alfredo Deza [mailto:alfredo.d...@inktank.com]
> Sent: Monday, August 19, 2013 3:38 PM
> To: Pavel Timoschenkov
> Cc: ceph-us...@ceph.com
> Subject: Re: [ceph-users] ceph-deploy and journal on separate disk
>
> On Fri, Aug 16, 2013 at 8:32 AM, Pavel Timoschenkov 
>  wrote:
>> <<> are causing this to <<> flag with the filesystem and prevent this.
>>
>> Hi. Any changes (
>>
>> Can you create a build that passes the -t flag with mount?
>>
>
> I tried going through these steps again and could not get any other ideas 
> except to pass in that flag for mounting. Would you be willing to try a patch?
> (http://fpaste.org/33099/37691580/)
>
> You would need to apply it to the `ceph-disk` executable.
>
>
>>
>>
>>
>>
>>
>>
>> From: Pavel Timoschenkov
>> Sent: Thursday, August 15, 2013 3:43 PM
>> To: 'Alfredo Deza'
>> Cc: Samuel Just; ceph-us...@ceph.com
>> Subject: RE: [ceph-users] ceph-deploy and journal on separate disk
>>
>>
>>
>> The separate commands (e.g. `ceph-disk -v prepare /dev/sda1`) works 
>> because then the journal is on the same device as the OSD data, so 
>> the execution is different to get them to a working state.
>>
>> I suspect that there are left over partitions in /dev/sdaa that are 
>> causing this to fail, I *think* that we could pass the `-t` flag with 
>> the filesystem and prevent this.
>>
>> Just to be sure, could you list all the partitions on /dev/sdaa (if 
>> /dev/sdaa is the whole device)?
>>
>> Something like:
>>
>> sudo parted /dev/sdaa print
>>
>> Or if you prefer any other way that could tell use what are all the 
>> partitions in that device.
>>
>>
>>
>>
>>
>> After
>>
>> ceph-deploy disk zap ceph001:sdaa ceph001:sda1
>>
>>
>>
>> root@ceph001:~# parted /dev/sdaa print
>>
>> Model: ATA ST3000DM001-1CH1 (scsi)
>>
>> Disk /dev/sdaa: 3001GB
>>
>> Sector size (logical/physical): 512B/4096B
>>
>> Partition Table: gpt
>>
>>
>>
>> Number  Start  End  Size  File system  Name  Flags
>>
>>
>>
>> root@ceph001:~# parted /dev/sda1 print
>>
>> Model: Unknown (unknown)
>>
>> Disk /dev/sda1: 10.7GB
>>
>> Sector size (logical/physical): 512B/512B
>>
>> Partition Table: gpt
>>
>> So that is after running `disk zap`. What does it say after using 
>> ceph-deploy and failing?
>>
>>
>>
>> Number  Start  End  Size  File system  Name  Flags
>>
>>
>>
>> After ceph-disk -v prepare /dev/sdaa /dev/sda1:
>>
>>
>>
>> root@ceph001:~# parted /dev/sdaa print
>>
>> Model: ATA ST3000DM001-1CH1 (scsi)
>>
>> Disk /dev/sdaa: 3001GB
>>
>> Sector size (logical/physical): 512B/4096B
>>
>> Partition Table: gpt
>>
>>
>>
>> Number  Start   End SizeFile system  Name   Flags
>>
>> 1  1049kB  3001GB  3001GB  xfs  ceph data
>>
>>
>>
>> And
>>
>>
>>
>> root@ceph001:~# parted /dev/sda1 print
>>
>> Model: Unknown (unknown)
>>
>> Disk /dev/sda1: 10.7GB
>>
>> Sector size (logical/physical): 512B/512B
>>
>> Partition Table: gpt
>>
>>
>>
>> Number  Start  End  Size  File system  Name  Flags
>>
>>
>>
>> With the same errors:
>>
>>
>>
>> root@ceph001:~# ceph-disk -v prepare /dev/sdaa /dev/sda1
>>
>> DEBUG:ceph-disk:Journal /dev/sda1 is a partition
>>
>> WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the 
>> same device as the osd data
>>
>> DEBUG:ceph-disk:Creating osd partition on /dev/sdaa
>>
>> Information: Moved requested sector from 34 to 2048 in
>>
>> order to align on 2048-sector boundaries.
>>
>> The operation has completed successfully.
>>
>> DEBUG:ceph-disk:Creating xfs fs on /dev/sdaa1
>>
>> meta-data=/dev/sdaa1 isize=2048   agcount=32, agsize=22892700
>> blks
>>
>>  =   sectsz=512   attr=2, projid32bit=0
>>
>> data =   bsize=4096   blocks=732566385, imaxpct=5
>>
>>  =   sunit=0  swidth=0 blks
>>
>> naming   =version 2  bsize=4096   ascii-ci=0
>>
>> log  =internal log   bsize=4096   blocks=357698, version=2
>>
>>  =   sectsz=512   sunit=0 blks, lazy-count=1
>>
>> realtime =none   extsz=4096   blocks=0, rtextents=0
>>
>> DEBUG:ceph-disk:Mounting /dev/sdaa1 on /var/lib/ceph/tmp/mnt.UkJbwx 

Re: [ceph-users] failing on 0.67.1 radosgw install

2013-08-22 Thread Fuchs, Andreas (SwissTXT)
My apache conf is as follows

cat /etc/apache2/httpd.conf
ServerName radosgw01.swisstxt.ch

cat /etc/apache2/sites-enabled/000_radosgw


ServerName *.radosgw01.swisstxt.ch
# ServerAdmin {email.address}
ServerAdmin serviced...@swisstxt.ch
DocumentRoot /var/www
RewriteEngine On
RewriteRule ^/([a-zA-Z0-9-_.]*)([/]?.*) 
/s3gw.fcgi?page=$1¶ms=$2&%{QUERY_STRING} 
[E=HTTP_AUTHORIZATION:%{HTTP:Authorization},L]



Options +ExecCGI
AllowOverride All
SetHandler fastcgi-script
Order allow,deny
Allow from all
AuthBasicAuthoritative Off



AllowEncodedSlashes On
ErrorLog /var/log/apache2/error.log
CustomLog /var/log/apache2/access.log combined
ServerSignature Off



Default site is disabled

cat /var/www/s3gw.fcgi
#!/bin/sh
exec /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway


There are NO dns entries at the moment
radosgw01.swisstxt.ch is entered in hosts file


cheers
Andi

-Original Message-
From: Yehuda Sadeh [mailto:yeh...@inktank.com] 
Sent: Dienstag, 20. August 2013 17:09
To: Fuchs, Andreas (SwissTXT)
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] failing on 0.67.1 radosgw install

On Tue, Aug 20, 2013 at 7:34 AM, Fuchs, Andreas (SwissTXT) 
 wrote:
> Hi
>
> I succesfully setup a ceph cluster with 0.67.1, now I try to get the 
> radosgw on a separate node running
>
> Os=ubuntu 12.04 lts
> Install seemed succesfull, but if  I try to access the api I see the 
> following in the logs
>
> Apache2/error.log
>
> 2013-08-20 16:22:17.029064 7f927abdf780 -1 warning: unable to create 
> /var/run/ceph: (13) Permission denied
> 2013-08-20 16:22:17.029343 7f927abdf780 -1 WARNING: libcurl doesn't 
> support curl_multi_wait()
> 2013-08-20 16:22:17.029348 7f927abdf780 -1 WARNING: cross zone / 
> region transfer performance may be affected [Tue Aug 20 16:22:17 2013] [warn] 
> FastCGI: (dynamic) server "/var/www/s3gw.fcgi" (pid 20793) terminated by 
> calling exit with status '0'
>
> I can fix the /var/run/ceph permission error by
>
> sudo mkdir /var/run/ceph
> sudo chown www-data /var/run/ceph
>
> but after restarting the services I still get:
>
> [Tue Aug 20 16:24:45 2013] [notice] FastCGI: process manager 
> initialized (pid 24276) [Tue Aug 20 16:24:45 2013] [notice] 
> Apache/2.2.22 (Ubuntu) mod_fastcgi/mod_fastcgi-SNAP-0910052141 
> mod_ssl/2.2.22 OpenSSL/1.0.1 configured -- resuming normal operations 
> [Tue Aug 20 16:25:25 2013] [warn] FastCGI: (dynamic) server 
> "/var/www/s3gw.fcgi" started (pid 24373)
> 2013-08-20 16:25:25.760621 7f2a63aeb780 -1 WARNING: libcurl doesn't 
> support curl_multi_wait()
> 2013-08-20 16:25:25.760628 7f2a63aeb780 -1 WARNING: cross zone / 
> region transfer performance may be affected [Tue Aug 20 16:25:25 2013] [warn] 
> FastCGI: (dynamic) server "/var/www/s3gw.fcgi" (pid 24373) terminated by 
> calling exit with status '0'
>
> Also I have over time more and more of those processes:
>
> /usr/bin/radosgw -c /etc/ceph/ceph.conf -n client.radosgw.gateway
>
> Ceph auth list shows
>
> client.radosgw.gateway
> key: checked witj keyfile and correct
> caps: [mon] allow rw
> caps: [osd] allow rwx
>
> config is:
>
> [client.radosgw.gateway]
> host = radosgw01, radosgw02
> keyring = /etc/ceph/keyring.radosgw.gateway rgw_socket_path = 
> /tmp/radosgw.sock log_file = /var/log/ceph/radosgw.log rgw_dns_name = 
> radosgw01.swisstxt.ch
>
>
Can you provide your apache site setup?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com