[ceph-users] OSDs going down during radosbench benchmark
Trying to understand why some OSDs (6 out of 21) went down in my cluster while running a CBT radosbench benchmark. From the logs below, is this a networking problem between systems, or is it some kind of FileStore problem. Looking at one crashed OSD log, I see the following crash error: 2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed out after 600 seconds. ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4) just before that I see things like: 2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no reply from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09 21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758) and also 2016-09-09 19:03:45.788327 7efc53905700 0 -- 10.0.1.2:6826/58682 >> 10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\ c=0x7efc8bef5b00).connect got RESETSESSION and many warnings for slow requests. All the other osds that died seem to have died with: 2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f2157e65700 time 2016-09-09 19:11:01.660671 common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout") -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mounting a VM rbd image as a /dev/rbd0 device
If I have an rbd image that is being used by a VM and I want to mount it as a read-only /dev/rbd0 kernel device, is that possible? When I try it I get: mount: /dev/rbd0 is write-protected, mounting read-only mount: wrong fs type, bad option, bad superblock on /dev/rbd0, missing codepage or helper program, or other error The rbd image when viewed from the VM has a /dev/vda disk with 2 partitions Number Start End SizeType File system Flags 1 1049kB 525MB 524MB primary xfs boot 2 525MB 12.9GB 12.4GB primary lvm I wanted to view it thru the /dev/rbd0 mount because on one of my systems, the VM is not booting from the image. -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd cache command thru admin socket
Thanks, Jason-- Turns out AppArmor was indeed enabled (I was not aware of that). Disabled it and now I see the socket but it seems to only be there temporarily while some client app is running. The original reason I wanted to use this socket was that I am also using an rbd images thru kvm and I wanted to be able to flush and invalidate the rbd cache as an experiment. I would have thought the socket would get created and stay there as long as kvm is active (since kvm is using librbd). But even when I access the rbd disk from the VM, I don't see any socket created at all. -- Tom > -Original Message- > From: Jason Dillaman [mailto:jdill...@redhat.com] > Sent: Thursday, June 30, 2016 6:15 PM > To: Deneau, Tom <tom.den...@amd.com> > Cc: ceph-users <ceph-us...@ceph.com> > Subject: Re: [ceph-users] rbd cache command thru admin socket > > Can you check the permissions on "/var/run/ceph/" and ensure that the user > your client runs under has permissions to access the directory? > If the permissions are OK, do you have SElinux or AppArmor enabled and > enforcing? > > On Thu, Jun 30, 2016 at 5:37 PM, Deneau, Tom <tom.den...@amd.com> wrote: > > I was following the instructions in > > > > https://www.sebastien-han.fr/blog/2015/09/02/ceph-validate-that-the-rb > > d-cache-is-active/ > > > > because I wanted to look at some of the rbd cache state and possibly > > flush and invalidate it > > > > My ceph.conf has > > [client] > > rbd default features = 1 > > rbd default format = 2 > > admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok > > log file = /var/log/ceph/ > > > > I have a client only node (no osds) and on that node I ran fio with the > librbd engine, which worked fine. But I did not see anything in > /var/run/ceph on that client node. Is there something else I have to do? > > > > -- Tom Deneau, AMD > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd cache command thru admin socket
I was following the instructions in https://www.sebastien-han.fr/blog/2015/09/02/ceph-validate-that-the-rbd-cache-is-active/ because I wanted to look at some of the rbd cache state and possibly flush and invalidate it My ceph.conf has [client] rbd default features = 1 rbd default format = 2 admin socket = /var/run/ceph/$cluster-$type.$id.$pid.$cctid.asok log file = /var/log/ceph/ I have a client only node (no osds) and on that node I ran fio with the librbd engine, which worked fine. But I did not see anything in /var/run/ceph on that client node. Is there something else I have to do? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rgw pool names
Ah that makes sense. The places where it was not adding the "default" prefix were all pre-jewel. -- Tom > -Original Message- > From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com] > Sent: Friday, June 10, 2016 2:36 PM > To: Deneau, Tom <tom.den...@amd.com> > Cc: ceph-users <ceph-us...@ceph.com> > Subject: Re: [ceph-users] rgw pool names > > On Fri, Jun 10, 2016 at 11:44 AM, Deneau, Tom <tom.den...@amd.com> wrote: > > When I start radosgw, I create the pool .rgw.buckets manually to > > control whether it is replicated or erasure coded and I let the other > > pools be created automatically. > > > > However, I have noticed that sometimes the pools get created with the > "default" > > prefix, thus > > rados lspools > > .rgw.root > > default.rgw.control > > default.rgw.data.root > > default.rgw.gc > > default.rgw.log > > .rgw.buckets # the one I created > > default.rgw.users.uid > > default.rgw.users.keys > > default.rgw.meta > > default.rgw.buckets.index > > default.rgw.buckets.data # the one actually being used > > > > What controls whether these pools have the "default" prefix or not? > > > > The prefix is the name of the zone ('default' by default). This was added > for the jewel release, as well as dropping the requirement of having the > pool names starts with a dot. > > Yehuda ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rgw pool names
When I start radosgw, I create the pool .rgw.buckets manually to control whether it is replicated or erasure coded and I let the other pools be created automatically. However, I have noticed that sometimes the pools get created with the "default" prefix, thus rados lspools .rgw.root default.rgw.control default.rgw.data.root default.rgw.gc default.rgw.log .rgw.buckets # the one I created default.rgw.users.uid default.rgw.users.keys default.rgw.meta default.rgw.buckets.index default.rgw.buckets.data # the one actually being used What controls whether these pools have the "default" prefix or not? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] mount -t ceph
I was using SLES 12, SP1 which has 3.12.49 It did have a /usr/sbin/mount.ceph command but using it gave modprobe: FATAL: Module ceph not found. failed to load ceph kernel module (1) -- Tom > -Original Message- > From: Gregory Farnum [mailto:gfar...@redhat.com] > Sent: Wednesday, April 27, 2016 2:59 PM > To: Deneau, Tom <tom.den...@amd.com> > Cc: ceph-users <ceph-us...@ceph.com> > Subject: Re: [ceph-users] mount -t ceph > > On Wed, Apr 27, 2016 at 2:55 PM, Deneau, Tom <tom.den...@amd.com> wrote: > > What kernel versions are required to be able to use CephFS thru mount -t > ceph? > > The CephFS kernel client has been in for ages (2.6.34, I think?), but you > want the absolute latest you can make happen if you're going to try it > out. > The actual mount command requires you have mount.ceph, which is in > different places/availabilities depending on your distro. > -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mount -t ceph
What kernel versions are required to be able to use CephFS thru mount -t ceph? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] yum install ceph on RHEL 7.2
> -Original Message- > From: Ken Dreyer [mailto:kdre...@redhat.com] > Sent: Tuesday, March 08, 2016 10:24 PM > To: Shinobu Kinjo > Cc: Deneau, Tom; ceph-users > Subject: Re: [ceph-users] yum install ceph on RHEL 7.2 > > On Tue, Mar 8, 2016 at 4:11 PM, Shinobu Kinjo <shinobu...@gmail.com> > wrote: > > If you register subscription properly, you should be able to install > > the Ceph without the EPEL. > > The opposite is true (when installing upstream / ceph.com). > > We rely on EPEL for several things, like leveldb and xmlstarlet. > > - Ken Ken -- What about when you just do yum install from the preconfigured repos? Should EPEL be required for that? -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] yum install ceph on RHEL 7.2
Yes, that is what lsb_release is showing... > -Original Message- > From: Shinobu Kinjo [mailto:shinobu...@gmail.com] > Sent: Tuesday, March 08, 2016 5:01 PM > To: Deneau, Tom > Cc: ceph-users > Subject: Re: [ceph-users] yum install ceph on RHEL 7.2 > > On Wed, Mar 9, 2016 at 7:52 AM, Deneau, Tom <tom.den...@amd.com> wrote: > > Just checking... > > > > On vanilla RHEL 7.2 (x64), should I be able to yum install ceph without > adding the EPEL repository? > > Are you talking about? > > # lsb_release -a > ... > Description:Red Hat Enterprise Linux Server release 7.2 (Maipo) > Release:7.2 > Codename:Maipo > > Cheers, > S > > > (looks like the version being installed is 0.94.6) > > > > -- Tom Deneau, AMD > > > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > -- > Email: > shin...@linux.com > GitHub: > shinobu-x > Blog: > Life with Distributed Computational System based on OpenSource ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] yum install ceph on RHEL 7.2
Just checking... On vanilla RHEL 7.2 (x64), should I be able to yum install ceph without adding the EPEL repository? (looks like the version being installed is 0.94.6) -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rbd kernel mapping on 3.13
The commands shown below had successfully mapped rbd images in the past on kernel version 4.1. Now I need to map one on a system running the 3.13 kernel. Ceph version is 9.2.0. Rados bench operations work with no problem. I get the same error message whether I use format 1 or format 2 or --image-shared. Is there something different I need to with the 3.13 kernel? -- Tom # rbd create --size 1000 --image-format 1 rbd/rbddemo # rbd info rbddemo rbd image 'rbddemo': size 1000 MB in 250 objects order 22 (4096 kB objects) block_name_prefix: rb.0.4f08.77bd73c7 format: 1 # rbd map rbd/rbddemo rbd: sysfs write failed rbd: map failed: (5) Input/output error I ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rbd kernel mapping on 3.13
Ah, yes I see this... feature set mismatch, my 4a042a42 < server's 104a042a42, missing 10 which looks like CEPH_FEATURE_CRUSH_V2 Is there any workaround for that? Or what ceph version would I have to back up to? The cbt librbdfio benchmark worked fine (once I had installed librbd-dev on the client). -- Tom > -Original Message- > From: Ilya Dryomov [mailto:idryo...@gmail.com] > Sent: Friday, January 29, 2016 4:53 PM > To: Deneau, Tom > Cc: ceph-users; c...@lists.ceph.com > Subject: Re: [ceph-users] rbd kernel mapping on 3.13 > > On Fri, Jan 29, 2016 at 11:43 PM, Deneau, Tom <tom.den...@amd.com> wrote: > > The commands shown below had successfully mapped rbd images in the past > on kernel version 4.1. > > > > Now I need to map one on a system running the 3.13 kernel. > > Ceph version is 9.2.0. Rados bench operations work with no problem. > > I get the same error message whether I use format 1 or format 2 or -- > image-shared. > > Is there something different I need to with the 3.13 kernel? > > > > -- Tom > > > > # rbd create --size 1000 --image-format 1 rbd/rbddemo > > # rbd info rbddemo > > rbd image 'rbddemo': > > size 1000 MB in 250 objects > > order 22 (4096 kB objects) > > block_name_prefix: rb.0.4f08.77bd73c7 > > format: 1 > > > > # rbd map rbd/rbddemo > > rbd: sysfs write failed > > rbd: map failed: (5) Input/output error > > You are likely missing feature bits - 3.13 was released way before 9.2.0. > The exact error is printed to the kernel log - do dmesg | tail or so. > > Thanks, > > Ilya ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] s3cmd --disable-multipart
If using s3cmd to radosgw and using s3cmd's --disable-multipart option, is there any limit to the size of the object that can be stored thru radosgw? Also, is there a recommendation for multipart chunk size for radosgw? -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] pgs per OSD
I have the following 4 pools: pool 1 'rep2host' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 128 pgp_num 128 last_change 88 flags hashpspool stripe_width 0 pool 17 'rep2osd' replicated size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 256 pgp_num 256 last_change 154 flags hashpspool stripe_width 0 pool 20 'ec104osd' erasure size 14 min_size 10 crush_ruleset 7 object_hash rjenkins pg_num 256 pgp_num 256 last_change 163 flags hashpspool stripe_width 4160 pool 21 'ec32osd' erasure size 5 min_size 3 crush_ruleset 6 object_hash rjenkins pg_num 256 pgp_num 256 last_change 165 flags hashpspool stripe_width 4128 with 15 up osds. and ceph health tells me I have too many PGs per OSD (375 > 300) I'm not sure where the 375 comes from, since there are 896 pgs and 15 osds = approx. 60 pgs per OSD. -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] erasure pool, ruleset-root
I see that I can create a crush rule that only selects osds from a certain node by this: ceph osd crush rule create-simple byosdn1 myhostname osd and if I then create a replicated pool that uses that rule, it does indeed select osds only from that node. I would like to do a similar thing with an erasure pool. When creating the ec-profile, I have successfully used ruleset-failure-domain=osd but when I try to use ruleset-root=myhostname and then use that profile to create an erasure pool, the resulting pool doesn't seem to limit to that node. What is the correct syntax for creating such an erasure pool? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados bench seq throttling
> -Original Message- > From: Gregory Farnum [mailto:gfar...@redhat.com] > Sent: Monday, September 14, 2015 5:32 PM > To: Deneau, Tom > Cc: ceph-users > Subject: Re: [ceph-users] rados bench seq throttling > > On Thu, Sep 10, 2015 at 1:02 PM, Deneau, Tom <tom.den...@amd.com> wrote: > > Running 9.0.3 rados bench on a 9.0.3 cluster... > > In the following experiments this cluster is only 2 osd nodes, 6 osds > > each and a separate mon node (and a separate client running rados > bench). > > > > I have two pools populated with 4M objects. The pools are replicated > > x2 with identical parameters. The objects appear to be spread evenly > across the 12 osds. > > > > In all cases I drop caches on all nodes before doing a rados bench seq > test. > > In all cases I run rados bench seq for identical times (30 seconds) > > and in that time we do not run out of objects to read from the pool. > > > > I am seeing significant bandwidth differences between the following: > > > >* running a single instance of rados bench reading from one pool with > 32 threads > > (bandwidth approx 300) > > > >* running two instances rados bench each reading from one of the two > pools > > with 16 threads per instance (combined bandwidth approx. 450) > > > > I have already increased the following: > > objecter_inflight_op_bytes = 10485760 > > objecter_inflight_ops = 8192 > > ms_dispatch_throttle_bytes = 1048576000 #didn't seem to have any > > effect > > > > The disks and network are not reaching anywhere near 100% utilization > > > > What is the best way to diagnose what is throttling things in the one- > instance case? > > Pretty sure the rados bench main threads are just running into their > limits. There's some work that Piotr (I think?) has been doing to make it > more efficient if you want to browse the PRs, but I don't think they're > even in a dev release yet. > -Greg Some further experiments with numbers of rados-bench clients: * All of the following are reading 4M sized objects with dropped caches as described above: * When we run multiple clients, they are run on different pools but from the same separate client node, which is not anywhere near CPU or network-limited * threads is the total across all clients, as is BW Case 1: two node cluster, 3 osds on each node total BW BW BW threads 1 cli 2cli4cli --- - 4174 185 194 8214 273 301 16198 309 399 32226 309 409 64246 341 421 Case 2: one node cluster, 6 osds on one node. total BW BW BW threads 1 cli 2cli4cli --- - 4 339 262 236 8 465 426 383 16 467 433 353 32 470 432 339 64 471 429 345 So, from the above data, having multiple clients definitely helps in the 2-node case (Case 1) but hurts in the single-node case. Still interested in any tools that would help analyze this more deeply... ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados bench seq throttling
Running 9.0.3 rados bench on a 9.0.3 cluster... In the following experiments this cluster is only 2 osd nodes, 6 osds each and a separate mon node (and a separate client running rados bench). I have two pools populated with 4M objects. The pools are replicated x2 with identical parameters. The objects appear to be spread evenly across the 12 osds. In all cases I drop caches on all nodes before doing a rados bench seq test. In all cases I run rados bench seq for identical times (30 seconds) and in that time we do not run out of objects to read from the pool. I am seeing significant bandwidth differences between the following: * running a single instance of rados bench reading from one pool with 32 threads (bandwidth approx 300) * running two instances rados bench each reading from one of the two pools with 16 threads per instance (combined bandwidth approx. 450) I have already increased the following: objecter_inflight_op_bytes = 10485760 objecter_inflight_ops = 8192 ms_dispatch_throttle_bytes = 1048576000 #didn't seem to have any effect The disks and network are not reaching anywhere near 100% utilization What is the best way to diagnose what is throttling things in the one-instance case? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ensuring write activity is finished
When measuring read bandwidth using rados bench, I've been doing the following: * write some objects using rados bench write --no-cleanup * drop caches on the osd nodes * use rados bench seq to read. I've noticed that on the first rados bench seq immediately following the rados bench write, there is often activity on the journal partitions which must be a carry over from the rados bench write. What is the preferred way to ensure that all write activity is finished before starting to use rados bench seq? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds on 2 nodes vs. on one node
Rewording to remove confusion... Config 1: set up a cluster with 1 node with 6 OSDs Config 2: identical hardware, set up a cluster with 2 nodes with 3 OSDs each In each case I do the following: 1) rados bench write --no-cleanup the same number of 4M size objects 2) drop caches on all osd nodes 3) rados bench seq -t 4 to sequentially read the objects and record the read bandwidth Rados bench is running on a separate client, not on an OSD node. The client has plenty of spare CPU power and the network and disk utilization are not limiting factors. With Config 1, I see approximately 70% more sequential read bandwidth than with Config 2. In both cases the primary OSDs of the objecgts appear evenly distributed across OSDs. Yes, replication factor is 2 but since we are only measuring read performance, I don't think that matters. Question is whether there is a ceph parameter that might be throttling the 2 node configuation? -- Tom > -Original Message- > From: Christian Balzer [mailto:ch...@gol.com] > Sent: Wednesday, September 02, 2015 7:29 PM > To: ceph-users > Cc: Deneau, Tom > Subject: Re: [ceph-users] osds on 2 nodes vs. on one node > > > Hello, > > On Wed, 2 Sep 2015 22:38:12 + Deneau, Tom wrote: > > > In a small cluster I have 2 OSD nodes with identical hardware, each > > with > > 6 osds. > > > > * Configuration 1: I shut down the osds on one node so I am using 6 > > OSDS on a single node > > > Shut down how? > Just a "service blah stop" or actually removing them from the cluster aka > CRUSH map? > > > * Configuration 2: I shut down 3 osds on each node so now I have 6 > > total OSDS but 3 on each node. > > > Same as above. > And in this case even more relevant, because just shutting down random OSDs > on both nodes would result in massive recovery action at best and more likely > a broken cluster. > > > I measure read performance using rados bench from a separate client node. > Default parameters? > > > The client has plenty of spare CPU power and the network and disk > > utilization are not limiting factors. In all cases, the pool type is > > replicated so we're just reading from the primary. > > > Replicated as in size 2? > We can guess/assume that from your cluster size, but w/o you telling us or > giving us all the various config/crush outputs that is only a guess. > > > With Configuration 1, I see approximately 70% more bandwidth than with > > configuration 2. > > Never mind that bandwidth is mostly irrelevant in real life, which bandwidth, > read or write? > > > In general, any configuration where the osds span 2 nodes gets poorer > > performance but in particular when the 2 nodes have equal amounts of > > traffic. > > > > Again, guessing from what you're actually doing this isn't particular > surprising. > Because with a single node, default rules and replication of 2 your OSDs > never have to replicate anything when it comes to writes. > Whereas with 2 nodes replication happens and takes more time (latency) and > might also saturate your network (we have of course no idea how your cluster > looks like). > > Christian > > > Is there any ceph parameter that might be throttling the cases where > > osds span 2 nodes? > > > > -- Tom Deneau, AMD > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Fusion Communications > http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] osds on 2 nodes vs. on one node
After running some other experiments, I see now that the high single-node bandwidth only occurs when ceph-mon is also running on that same node. (In these small clusters I only had one ceph-mon running). If I compare to a single-node where ceph-mon is not running, I see basically identical performance to the two-node arrangement. So now my question is: Is it expected that there would be such a large performance difference between using osds on a single node where ceph-mon is running vs. using osds on a single node where ceph-mon is not running? -- Tom > -Original Message- > From: Deneau, Tom > Sent: Thursday, September 03, 2015 10:39 AM > To: 'Christian Balzer'; ceph-users > Subject: RE: [ceph-users] osds on 2 nodes vs. on one node > > Rewording to remove confusion... > > Config 1: set up a cluster with 1 node with 6 OSDs Config 2: identical > hardware, set up a cluster with 2 nodes with 3 OSDs each > > In each case I do the following: >1) rados bench write --no-cleanup the same number of 4M size objects >2) drop caches on all osd nodes >3) rados bench seq -t 4 to sequentially read the objects > and record the read bandwidth > > Rados bench is running on a separate client, not on an OSD node. > The client has plenty of spare CPU power and the network and disk utilization > are not limiting factors. > > With Config 1, I see approximately 70% more sequential read bandwidth than > with Config 2. > > In both cases the primary OSDs of the objecgts appear evenly distributed > across OSDs. > > Yes, replication factor is 2 but since we are only measuring read > performance, I don't think that matters. > > Question is whether there is a ceph parameter that might be throttling the > 2 node configuation? > > -- Tom > > > -Original Message- > > From: Christian Balzer [mailto:ch...@gol.com] > > Sent: Wednesday, September 02, 2015 7:29 PM > > To: ceph-users > > Cc: Deneau, Tom > > Subject: Re: [ceph-users] osds on 2 nodes vs. on one node > > > > > > Hello, > > > > On Wed, 2 Sep 2015 22:38:12 + Deneau, Tom wrote: > > > > > In a small cluster I have 2 OSD nodes with identical hardware, each > > > with > > > 6 osds. > > > > > > * Configuration 1: I shut down the osds on one node so I am using 6 > > > OSDS on a single node > > > > > Shut down how? > > Just a "service blah stop" or actually removing them from the cluster > > aka CRUSH map? > > > > > * Configuration 2: I shut down 3 osds on each node so now I have 6 > > > total OSDS but 3 on each node. > > > > > Same as above. > > And in this case even more relevant, because just shutting down random > > OSDs on both nodes would result in massive recovery action at best and > > more likely a broken cluster. > > > > > I measure read performance using rados bench from a separate client node. > > Default parameters? > > > > > The client has plenty of spare CPU power and the network and disk > > > utilization are not limiting factors. In all cases, the pool type is > > > replicated so we're just reading from the primary. > > > > > Replicated as in size 2? > > We can guess/assume that from your cluster size, but w/o you telling > > us or giving us all the various config/crush outputs that is only a guess. > > > > > With Configuration 1, I see approximately 70% more bandwidth than > > > with configuration 2. > > > > Never mind that bandwidth is mostly irrelevant in real life, which > > bandwidth, read or write? > > > > > In general, any configuration where the osds span 2 nodes gets > > > poorer performance but in particular when the 2 nodes have equal > > > amounts of traffic. > > > > > > > Again, guessing from what you're actually doing this isn't particular > > surprising. > > Because with a single node, default rules and replication of 2 your > > OSDs never have to replicate anything when it comes to writes. > > Whereas with 2 nodes replication happens and takes more time (latency) > > and might also saturate your network (we have of course no idea how > > your cluster looks like). > > > > Christian > > > > > Is there any ceph parameter that might be throttling the cases where > > > osds span 2 nodes? > > > > > > -- Tom Deneau, AMD > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > > > > -- > > Christian BalzerNetwork/Systems Engineer > > ch...@gol.com Global OnLine Japan/Fusion Communications > > http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] osds on 2 nodes vs. on one node
In a small cluster I have 2 OSD nodes with identical hardware, each with 6 osds. * Configuration 1: I shut down the osds on one node so I am using 6 OSDS on a single node * Configuration 2: I shut down 3 osds on each node so now I have 6 total OSDS but 3 on each node. I measure read performance using rados bench from a separate client node. The client has plenty of spare CPU power and the network and disk utilization are not limiting factors. In all cases, the pool type is replicated so we're just reading from the primary. With Configuration 1, I see approximately 70% more bandwidth than with configuration 2. In general, any configuration where the osds span 2 nodes gets poorer performance but in particular when the 2 nodes have equal amounts of traffic. Is there any ceph parameter that might be throttling the cases where osds span 2 nodes? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] a couple of radosgw questions
I see that the objects that were deleted last Friday are indeed gone now (via gc I guess). gc list does not show anything even after right after objects are deleted. I couldn't get temp remove to do anything. -- Tom > -Original Message- > From: Ben Hines [mailto:bhi...@gmail.com] > Sent: Saturday, August 29, 2015 5:27 PM > To: Brad Hubbard > Cc: Deneau, Tom; ceph-users > Subject: Re: [ceph-users] a couple of radosgw questions > > I'm not the OP, but in my particular case, gc is proceeding normally > (since 94.2, i think) -- i just have millions of older objects > (months-old) which will not go away. > > (see my other post -- > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-August/003967.html > ) > > -Ben > > On Fri, Aug 28, 2015 at 5:14 PM, Brad Hubbard <bhubb...@redhat.com> wrote: > > - Original Message - > >> From: "Ben Hines" <bhi...@gmail.com> > >> To: "Brad Hubbard" <bhubb...@redhat.com> > >> Cc: "Tom Deneau" <tom.den...@amd.com>, "ceph-users" <ceph-us...@ceph.com> > >> Sent: Saturday, 29 August, 2015 9:49:00 AM > >> Subject: Re: [ceph-users] a couple of radosgw questions > >> > >> 16:22:38 root@sm-cephrgw4 /etc/ceph $ radosgw-admin temp remove > >> unrecognized arg remove > >> usage: radosgw-admin [options...] > >> commands: > >> > >> temp removeremove temporary objects that were created up > to > >> specified date (and optional time) > > > > Looking into this ambiguity, thanks. > > > >> > >> > >> On Fri, Aug 28, 2015 at 4:24 PM, Brad Hubbard <bhubb...@redhat.com> wrote: > >> > emove an object, it is no longer visible > >> >> from the S3 API, but the objects > >> >>that comprised it are still there in .rgw.buckets pool. When do > they > >> >>get > >> >>removed? > >> > > >> > Does the following command remove them? > >> > > >> > http://ceph.com/docs/master/radosgw/purge-temp/ > >> > > > > Does "radosgw-admin gc list" show anything? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] a couple of radosgw questions
A couple of questions on the radosgw... 1. I noticed when I use s3cmd to put a 10M object into a bucket in the rados object gateway, I get the following objects created in .rgw.buckets: 0.5M 4M 4M 1.5M I assume the 4M breakdown is controlled by rgw obj stripe size. What causes the small initial 0.5M piece? Also, is there any diagram showing which parts of this striping, if any, occur in parallel? 2. I noticed when I use s3cmd to remove an object, it is no longer visible from the S3 API, but the objects that comprised it are still there in .rgw.buckets pool. When do they get removed? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados bench object not correct errors on v9.0.3
-Original Message- From: Dałek, Piotr [mailto:piotr.da...@ts.fujitsu.com] Sent: Wednesday, August 26, 2015 2:02 AM To: Sage Weil; Deneau, Tom Cc: ceph-de...@vger.kernel.org; ceph-us...@ceph.com Subject: RE: rados bench object not correct errors on v9.0.3 -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Tuesday, August 25, 2015 7:43 PM I have built rpms from the tarball http://ceph.com/download/ceph- 9.0.3.tar.bz2. Have done this for fedora 21 x86_64 and for aarch64. On both platforms when I run a single node cluster with a few osds and run rados bench read tests (either seq or rand) I get occasional reports like benchmark_data_myhost_20729_object73 is not correct! I never saw these with similar rpm builds on these platforms from 9.0.2 sources. Also, if I go to an x86-64 system running Ubuntu trusty for which I am able to install prebuilt binary packages via ceph-deploy install --dev v9.0.3 I do not see the errors there. Hrm.. haven't seen it on this end, but we're running/testing master and not 9.0.2 specifically. If you can reproduce this on master, that'd be very helpful! There have been some recent changes to rados bench... Piotr, does this seem like it might be caused by your changes? Yes. My PR #4690 (https://github.com/ceph/ceph/pull/4690) caused rados bench to be fast enough to sometimes run into race condition between librados's AIO and objbencher processing. That was fixed in PR #5152 (https://github.com/ceph/ceph/pull/5152) which didn't make it into 9.0.3. Tom, you can confirm this by inspecting the contents of objects questioned (their contents should be perfectly fine and I in line with other objects). In the meantime you can either apply patch from PR #5152 on your own or use - -no-verify. With best regards / Pozdrawiam Piotr Dałek Piotr -- Thank you. Yes, when I looked at the contents of the objects they always looked correct. And yes a single object would sometimes report an error and sometimes not. So a race condition makes sense. A couple of questions: * Why would I not see this behavior using the pre-built 9.0.3 binaries that get installed using ceph-deploy install --dev v9.0.3? I would assume this is built from the same sources as the 9.0.3 tarball. * So I assume one should not compare pre 9.0.3 rados bench numbers with 9.0.3 and after? The pull request https://github.com/ceph/ceph/pull/4690 did not mention the effect on final bandwidth numbers, did you notice what that effect was? -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados bench object not correct errors on v9.0.3
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, August 24, 2015 12:45 PM To: ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-maintain...@ceph.com Subject: v9.0.3 released This is the second to last batch of development work for the Infernalis cycle. The most intrusive change is an internal (non user-visible) change to the OSD's ObjectStore interface. Many fixes and improvements elsewhere across RGW, RBD, and another big pile of CephFS scrub/repair improvements. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-9.0.3.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph- deploy -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I have built rpms from the tarball http://ceph.com/download/ceph-9.0.3.tar.bz2. Have done this for fedora 21 x86_64 and for aarch64. On both platforms when I run a single node cluster with a few osds and run rados bench read tests (either seq or rand) I get occasional reports like benchmark_data_myhost_20729_object73 is not correct! I never saw these with similar rpm builds on these platforms from 9.0.2 sources. Also, if I go to an x86-64 system running Ubuntu trusty for which I am able to install prebuilt binary packages via ceph-deploy install --dev v9.0.3 I do not see the errors there. Any suggestions welcome. -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados bench object not correct errors on v9.0.3
-Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, August 25, 2015 12:43 PM To: Deneau, Tom Cc: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; piotr.da...@ts.fujitsu.com Subject: Re: rados bench object not correct errors on v9.0.3 On Tue, 25 Aug 2015, Deneau, Tom wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, August 24, 2015 12:45 PM To: ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-maintain...@ceph.com Subject: v9.0.3 released This is the second to last batch of development work for the Infernalis cycle. The most intrusive change is an internal (non user-visible) change to the OSD's ObjectStore interface. Many fixes and improvements elsewhere across RGW, RBD, and another big pile of CephFS scrub/repair improvements. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-9.0.3.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph- deploy -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I have built rpms from the tarball http://ceph.com/download/ceph- 9.0.3.tar.bz2. Have done this for fedora 21 x86_64 and for aarch64. On both platforms when I run a single node cluster with a few osds and run rados bench read tests (either seq or rand) I get occasional reports like benchmark_data_myhost_20729_object73 is not correct! I never saw these with similar rpm builds on these platforms from 9.0.2 sources. Also, if I go to an x86-64 system running Ubuntu trusty for which I am able to install prebuilt binary packages via ceph-deploy install --dev v9.0.3 I do not see the errors there. Hrm.. haven't seen it on this end, but we're running/testing master and not 9.0.2 specifically. If you can reproduce this on master, that'd be very helpful! There have been some recent changes to rados bench... Piotr, does this seem like it might be caused by your changes? sage Just as a reminder this is with 9.0.3, not 9.0.2. I just tried with the osds running on the fedora machine (with rpms that I built from the tarball) and rados bench running on the Ubuntu machine (with pre-built binary packages) and I do not see the errors with that combination. Will see what happens with master. -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] rados bench object not correct errors on v9.0.3
-Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Deneau, Tom Sent: Tuesday, August 25, 2015 1:24 PM To: Sage Weil Cc: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; piotr.da...@ts.fujitsu.com Subject: RE: rados bench object not correct errors on v9.0.3 -Original Message- From: Sage Weil [mailto:sw...@redhat.com] Sent: Tuesday, August 25, 2015 12:43 PM To: Deneau, Tom Cc: ceph-de...@vger.kernel.org; ceph-us...@ceph.com; piotr.da...@ts.fujitsu.com Subject: Re: rados bench object not correct errors on v9.0.3 On Tue, 25 Aug 2015, Deneau, Tom wrote: -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel- ow...@vger.kernel.org] On Behalf Of Sage Weil Sent: Monday, August 24, 2015 12:45 PM To: ceph-annou...@ceph.com; ceph-de...@vger.kernel.org; ceph-us...@ceph.com; ceph-maintain...@ceph.com Subject: v9.0.3 released This is the second to last batch of development work for the Infernalis cycle. The most intrusive change is an internal (non user-visible) change to the OSD's ObjectStore interface. Many fixes and improvements elsewhere across RGW, RBD, and another big pile of CephFS scrub/repair improvements. Getting Ceph * Git at git://github.com/ceph/ceph.git * Tarball at http://ceph.com/download/ceph-9.0.3.tar.gz * For packages, see http://ceph.com/docs/master/install/get-packages * For ceph-deploy, see http://ceph.com/docs/master/install/install-ceph- deploy -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html I have built rpms from the tarball http://ceph.com/download/ceph- 9.0.3.tar.bz2. Have done this for fedora 21 x86_64 and for aarch64. On both platforms when I run a single node cluster with a few osds and run rados bench read tests (either seq or rand) I get occasional reports like benchmark_data_myhost_20729_object73 is not correct! I never saw these with similar rpm builds on these platforms from 9.0.2 sources. Also, if I go to an x86-64 system running Ubuntu trusty for which I am able to install prebuilt binary packages via ceph-deploy install --dev v9.0.3 I do not see the errors there. Hrm.. haven't seen it on this end, but we're running/testing master and not 9.0.2 specifically. If you can reproduce this on master, that'd be very helpful! There have been some recent changes to rados bench... Piotr, does this seem like it might be caused by your changes? sage Just as a reminder this is with 9.0.3, not 9.0.2. I just tried with the osds running on the fedora machine (with rpms that I built from the tarball) and rados bench running on the Ubuntu machine (with pre-built binary packages) and I do not see the errors with that combination. Will see what happens with master. -- Tom For making a tarball to build rpms from master, I did the following steps: # git checkout master # ./autogen.sh # ./configure # make dist-bzip2 then put the .bz2 file in the rpmbuild/SOURCES and put the spec file in rpmbuild/SPECS Are those the correct steps? Asking because when I do rpmbuild from the above I eventually get Processing files: ceph-9.0.3-0.fc21.x86_64 error: File not found: /root/rpmbuild/BUILDROOT/ceph-9.0.3-0.fc21.x86_64/usr/sbin/ceph-disk-activate error: File not found: /root/rpmbuild/BUILDROOT/ceph-9.0.3-0.fc21.x86_64/usr/sbin/ceph-disk-prepare -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] load-gen throughput numbers
If I run rados load-gen with the following parameters: --num-objects 50 --max-ops 16 --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --percent 100 --target-throughput 2000 So every object is 4M in size and all the ops are reads of the entire 4M. I would assume this is equivalent to running rados bench rand on that pool if the pool has been previously filled with 50 4M objects. And I am assuming the --max-ops=16 is equivalent to having 16 concurrent threads in rados bench. And I have set the target throughput higher than is possible with my network. But when I run both rados load-gen and rados bench as described, I see that rados bench gets about twice the throughput of rados load-gen. Why would that be? I see there is a --max-backlog parameter, is there some setting of that parameter that would help the throughput? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] load-gen throughput numbers
Ah, I see that --max-backlog must be expressed in bytes/sec, in spite of what the --help message says. -- Tom -Original Message- From: Deneau, Tom Sent: Wednesday, July 22, 2015 5:09 PM To: 'ceph-users@lists.ceph.com' Subject: load-gen throughput numbers If I run rados load-gen with the following parameters: --num-objects 50 --max-ops 16 --min-object-size 4M --max-object-size 4M --min-op-len 4M --max-op-len 4M --percent 100 --target-throughput 2000 So every object is 4M in size and all the ops are reads of the entire 4M. I would assume this is equivalent to running rados bench rand on that pool if the pool has been previously filled with 50 4M objects. And I am assuming the --max-ops=16 is equivalent to having 16 concurrent threads in rados bench. And I have set the target throughput higher than is possible with my network. But when I run both rados load-gen and rados bench as described, I see that rados bench gets about twice the throughput of rados load-gen. Why would that be? I see there is a --max-backlog parameter, is there some setting of that parameter that would help the throughput? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Any workaround for ImportError: No module named ceph_argparse?
False alarm, things seem to be fine now. -- Tom -Original Message- From: Deneau, Tom Sent: Wednesday, July 15, 2015 1:11 PM To: ceph-users@lists.ceph.com Subject: Any workaround for ImportError: No module named ceph_argparse? I just installed 9.0.2 on Trusty using ceph-deploy install --testing and I am hitting the ImportError: No module named ceph_argparse issue. What is the best way to get around this issue and still run a version that is compatible with other (non-Ubuntu) nodes in the cluster that are running 9.0.1? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Any workaround for ImportError: No module named ceph_argparse?
I just installed 9.0.2 on Trusty using ceph-deploy install --testing and I am hitting the ImportError: No module named ceph_argparse issue. What is the best way to get around this issue and still run a version that is compatible with other (non-Ubuntu) nodes in the cluster that are running 9.0.1? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests going up and down
I don't think there were any stale or unclean PGs, (when there are, I have seen health detail list them and it did not in this case). I have since restarted the 2 osds and the health went immediately to HEALTH_OK. -- Tom -Original Message- From: Will.Boege [mailto:will.bo...@target.com] Sent: Monday, July 13, 2015 10:19 PM To: Deneau, Tom; ceph-users@lists.ceph.com Subject: Re: [ceph-users] slow requests going up and down Does the ceph health detail show anything about stale or unclean PGs, or are you just getting the blocked ops messages? On 7/13/15, 5:38 PM, Deneau, Tom tom.den...@amd.com wrote: I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow requests going up and down
I have a cluster where over the weekend something happened and successive calls to ceph health detail show things like below. What does it mean when the number of blocked requests goes up and down like this? Some clients are still running successfully. -- Tom Deneau, AMD HEALTH_WARN 20 requests are blocked 32 sec; 2 osds have slow requests 20 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 18 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 4 requests are blocked 32 sec; 2 osds have slow requests 4 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 2 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 27 requests are blocked 32 sec; 2 osds have slow requests 27 ops are blocked 536871 sec 2 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests HEALTH_WARN 34 requests are blocked 32 sec; 2 osds have slow requests 34 ops are blocked 536871 sec 9 ops are blocked 536871 sec on osd.5 25 ops are blocked 536871 sec on osd.7 2 osds have slow requests ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados gateway to use ec pools
what is the correct way to make radosgw create its pools as erasure coded pools? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] journal raw partition
I am experimenting with different external journal partitions as raw partitions (no file system). using ceph-deploy osd prepare foo:/mount-point-for-data-partition:journal-partition followed by ceph-deploy osd activate (same arguments) When the specified journal-partition is on an ssd drive I notice that osd prepare reports [WARNIN] DEBUG:ceph-disk:Journal /dev/sdb2 is a partition [WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data WARNIN] INFO:ceph-disk:Running command: /usr/sbin/blkid -p -o udev /dev/sdb2 WARNIN] WARNING:ceph-disk:Journal /dev/sdb2 was not prepared with ceph-disk. Symlinking directly. [WARNIN] DEBUG:ceph-disk:Preparing osd data dir /var/local//dev/sdc1 [WARNIN] DEBUG:ceph-disk:Creating symlink /var/local//dev/sdc1/journal - /dev/sdb2 and the later ceph-deploy osd activate works fine But when the journal-partition is a small partition on the beginning of the data drive osd prepare reports [WARNIN] DEBUG:ceph-disk:Journal is file /dev/sdc2 = [WARNIN] WARNING:ceph-disk:OSD will not be hot-swappable if journal is not the same device as the osd data [WARNIN] DEBUG:ceph-disk:Preparing osd data dir /var/local//dev/sdc1 [WARNIN] DEBUG:ceph-disk:Creating symlink /var/local//dev/sdc1/journal - /dev/sdc2 and then the later ceph-deploy osd activate fails with [WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd --cluster ceph --mkfs --mkkey -i 0 --monmap /var/local//dev/sd\ c1/activate.monmap --osd-data /var/local//dev/sdc1 --osd-journal /var/local//dev/sdc1/journal --osd-uuid 05b3933e-9ac4-453a-9072-c7ebf242ba7\ 0 --keyring /var/local//dev/sdc1/keyring [WARNIN] 2015-04-30 13:09:30.869745 3ff9df5cec0 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway [WARNIN] 2015-04-30 13:09:30.869810 3ff9df5cec0 -1 journal check: ondisk fsid ---- doesn't match expected 05b3933e-9ac4-453a-9072-c7ebf242ba70, invalid (someone else's?) journal [WARNIN] 2015-04-30 13:09:30.869863 3ff9df5cec0 -1 filestore(/var/local//dev/sdc1) mkjournal error creating journal on /var/local//dev/sdc1/journal: (22) Invalid argument [WARNIN] 2015-04-30 13:09:30.869883 3ff9df5cec0 -1 OSD::mkfs: ObjectStore::mkfs failed with error -22 [WARNIN] 2015-04-30 13:09:30.869934 3ff9df5cec0 -1 ** ERROR: error creating empty object store in /var/local//dev/sdc1: (22) Invalid argument I'm assuming the problem started with osd prepare thinking that /dev/sdc2 was a file rather than a partition. Is there some partition table thing I am missing here? parted /dev/sdc print gives Disk /dev/sdc: 2000GB Sector size (logical/physical): 512B/512B Partition Table: gpt Disk Flags: Number Start End SizeFile system Name Flags 2 1049kB 5000MB 4999MB primary 1 5000MB 2000GB 1995GB xfs primary Not sure if it is related but I do know that in the past I had created a single partition on /dev/sdc and used that as an xfs data partition. -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] switching journal location
If my cluster is quiet and on one node I want to switch the location of the journal from the default location to a file on an SSD drive (or vice versa), what is the quickest way to do that? Can I make a soft link to the new location and do it without restarting the OSDs? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] installing and updating while leaving osd drive data intact
Referencing this old thread below, I am wondering what is the proper way to install say new versions of ceph and start up daemons but keep all the data on the osd drives. I had been using ceph-deploy new which I guess creates a new cluster fsid. Normally for my testing I had been starting with clean osd drives but I would also like to be able to restart and leave the osd drives as is. -- Tom Hi, I have faced a similar issue. This happens if the ceph disks aren't purged/cleaned completely. Clear of the contents in the /dev/sdb1 device. There is a file named ceph_fsid in the disk which would have the old cluster's fsid. This needs to be deleted for it to work. Hope it helps. Sharmila On Mon, May 26, 2014 at 2:52 PM, JinHwan Hwang calanchue at gmail.com wrote: I'm trying to install ceph 0.80.1 on ubuntu 14.04. All other things goes well except 'activate osd' phase. It tells me they can't find proper fsid when i do 'activate osd'. This is not my first time of installing ceph, and all the process i did was ok when i did on other(though they were ubuntu 12.04 , virtual machines, ceph-emperor) ceph at ceph-mon:~$ ceph-deploy osd activate ceph-osd0:/dev/sdb1 ceph-osd0:/dev/sdc1 ceph-osd1:/dev/sdb1 ceph-osd1:/dev/sdc1 ... [ceph-osd0][WARNIN] ceph-disk: Error: No cluster conf found in /etc/ceph with fsid 05b994a0-20f9-48d7-8d34-107ffcb39e5b .. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] object size in rados bench write
I've noticed when I use large object sizes like 100M with rados bench write, I get rados -p data2 bench 60 write --no-cleanup -b 100M Maintaining 16 concurrent writes of 104857600 bytes for up to 60 seconds or 0 objects sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 3 3 0 0 0 - 0 2 5 5 0 0 0 - 0 3 8 8 0 0 0 - 0 4 1010 0 0 0 - 0 5 1313 0 0 0 - 0 6 1515 0 0 0 - 0 error during benchmark: -5 error 5: (5) Input/output error An object_size of 32M works fine and the cluster seems otherwise fine. Seems related to this issue http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-March/028288.html But I didn't see a resolution for that. Is there a timeout that is kicking in? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] object size in rados bench write
Ah, I see there is an osd parameter for this osd max write size Description:The maximum size of a write in megabytes. Default:90 -Original Message- From: Deneau, Tom Sent: Wednesday, April 08, 2015 3:57 PM To: 'ceph-users@lists.ceph.com' Subject: object size in rados bench write I've noticed when I use large object sizes like 100M with rados bench write, I get rados -p data2 bench 60 write --no-cleanup -b 100M Maintaining 16 concurrent writes of 104857600 bytes for up to 60 seconds or 0 objects sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 3 3 0 0 0 - 0 2 5 5 0 0 0 - 0 3 8 8 0 0 0 - 0 4 1010 0 0 0 - 0 5 1313 0 0 0 - 0 6 1515 0 0 0 - 0 error during benchmark: -5 error 5: (5) Input/output error An object_size of 32M works fine and the cluster seems otherwise fine. Seems related to this issue http://lists.ceph.com/pipermail/ceph-users- ceph.com/2014-March/028288.html But I didn't see a resolution for that. Is there a timeout that is kicking in? -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] rados bench seq read with single thread
Say I have a single node cluster with 5 disks. And using dd iflag=direct on that node, I can see disk read bandwidth at 160 MB/s I populate a pool with 4MB objects. And then on that same single node, I run $ drop-caches using /proc/sys/vm/drop_caches $ rados -p mypool bench nn seq -t 1 What bandwidth should I expect for the rados bench seq command here? (I am seeing approximately 70 MB/s with -t 1) -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] clients and monitors
A couple of client-monitor questions: 1) When a client contacts a monitor to get the cluster map, how does it decide which monitor to try to contact? 2) Having gotten the cluster map, assuming a client wants to do multiple reads and writes, does the client have to re-contact the monitor to get the latest cluster map for each operation? -- Tom Deneau, AMD ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] who is using radosgw with civetweb?
Robert -- We are still having trouble with this. Can you share your [client.radosgw.gateway] section of ceph.conf and were there any other special things to be aware of? -- Tom -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Robert LeBlanc Sent: Thursday, February 26, 2015 12:27 PM To: Sage Weil Cc: Ceph-User; ceph-devel Subject: Re: [ceph-users] who is using radosgw with civetweb? Thanks, we were able to get it up and running very quickly. If it performs well, I don't see any reason to use Apache+fast_cgi. I don't have any problems just focusing on civetweb. On Wed, Feb 25, 2015 at 2:49 PM, Sage Weil sw...@redhat.com wrote: On Wed, 25 Feb 2015, Robert LeBlanc wrote: We tried to get radosgw working with Apache + mod_fastcgi, but due to the changes in radosgw, Apache, mode_*cgi, etc and the documentation lagging and not having a lot of time to devote to it, we abandoned it. Where it the documentation for civetweb? If it is appliance like and easy to set-up, we would like to try it to offer some feedback on your question. In giant and hammer, it is enabled by default on port 7480. On firefly, you need to add the line rgw frontends = fastcgi, civetweb port=7480 to ceph.conf (you can of course adjust the port number if you like) and radosgw will run standalone w/ no apache or anything else. sage Thanks, Robert LeBlanc On Wed, Feb 25, 2015 at 12:31 PM, Sage Weil sw...@redhat.com wrote: Hey, We are considering switching to civetweb (the embedded/standalone rgw web server) as the primary supported RGW frontend instead of the current apache + mod-fastcgi or mod-proxy-fcgi approach. Supported here means both the primary platform the upstream development focuses on and what the downstream Red Hat product will officially support. How many people are using RGW standalone using the embedded civetweb server instead of apache? In production? At what scale? What version(s) (civetweb first appeared in firefly and we've backported most fixes). Have you seen any problems? Any other feedback? The hope is to (vastly) simplify deployment. Thanks! sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] mixed ceph versions
I need to set up a cluster where the rados client (for running rados bench) may be on a different architecture and hence running a different ceph version from the osd/mon nodes. Is there a list of which ceph versions work together for a situation like this? -- Tom ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] erasure coded pool
Is it possible to run an erasure coded pool using default k=2, m=2 profile on a single node? (this is just for functionality testing). The single node has 3 OSDs. Replicated pools run fine. ceph.conf does contain: osd crush chooseleaf type = 0 -- Tom Deneau ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com