[ceph-users] Purpose of the s3gw.fcgi script?
From my observation, the s3gw.fcgi script seems to be completely superfluous in the operation of Ceph. With or without the script, swift requests execute correctly, as long as a radosgw daemon is running. Is there something I'm missing here? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Auth URL not found when using object gateway
Hi, I'm having trouble setting up an object gateway on an existing cluster. The cluster I'm trying to add the gateway to is running on a Precise 12.04 virtual machine. The cluster is up and running, with a monitor, two OSDs, and a metadata server. It returns HEALTH_OK and active+clean, so I am somewhat assured that it is running correctly. I've: - set up an apache2 webserver with the fastcgi mod installed - created an rgw.conf file - added an s3gw.fcgi script - enabled the rgw.conf site and disabled the default - created a keyring and gateway user with appropriate cap's - restarted ceph, apache2, and the radosgw daemon - created a user and subuser - tested both s3 and swift calls Unfortunately, both s3 and swift fail to authorize. An attempt to create a new bucket with s3 using a python script returns: Traceback (most recent call last): File s3test.py, line 13, in module bucket = conn.create_bucket('my-new-bucket') File /usr/lib/python2.7/dist-packages/boto/s3/connection.py, line 422, in create_bucket response.status, response.reason, body) boto.exception.S3ResponseError: S3ResponseError: 404 Not Found None And an attempt to post a container using the python-swiftclient from the command line with command: swift --debug --info -A http://localhost/auth/1.0 -U gatewayuser:swift -K key post new_container returns: INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost DEBUG:urllib3.connectionpool:GET /auth/1.0 HTTP/1.1 404 180 INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET INFO:swiftclient:RESP STATUS: 404 Not Found INFO:swiftclient:RESP HEADERS: [('content-length', '180'), ('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'), ('content-type', 'text/html; charset=iso-8859-1'), ('vary', 'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')] INFO:swiftclient:RESP BODY: M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve ���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+ �|-�X)��V��b�a���與'@Uo���-�n��?� ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found Traceback (most recent call last): File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1181, in _retry self.url, self.token = self.get_auth() File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1155, in get_auth insecure=self.insecure) File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 318, in get_auth insecure=insecure) File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 241, in get_auth_1_0 http_reason=resp.reason) ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found INFO:urllib3.connectionpool:Starting new HTTP connection (1): localhost DEBUG:urllib3.connectionpool:GET /auth/1.0 HTTP/1.1 404 180 INFO:swiftclient:REQ: curl -i http://localhost/auth/1.0 -X GET INFO:swiftclient:RESP STATUS: 404 Not Found INFO:swiftclient:RESP HEADERS: [('content-length', '180'), ('content-encoding', 'gzip'), ('date', 'Tue, 24 Mar 2015 23:19:50 GMT'), ('content-type', 'text/html; charset=iso-8859-1'), ('vary', 'Accept-Encoding'), ('server', 'Apache/2.2.22 (Ubuntu)')] INFO:swiftclient:RESP BODY: M�0��}���,�I�)֔)Ң��m��qv��Y��.)�59�=Ve ���y���lsa���#T��p��v�,����B/��� �5D�Z|=���S�N�+ �|-�X)��V��b�a���與'@Uo���-�n��?� ERROR:swiftclient:Auth GET failed: http://localhost/auth/1.0 404 Not Found Traceback (most recent call last): File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1181, in _retry self.url, self.token = self.get_auth() File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 1155, in get_auth insecure=self.insecure) File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 318, in get_auth insecure=insecure) File /usr/lib/python2.7/dist-packages/swiftclient/client.py, line 241, in get_auth_1_0 http_reason=resp.reason) ClientException: Auth GET failed: http://localhost/auth/1.0 404 Not Found Auth GET failed: http://localhost/auth/1.0 404 Not Found I'm not at all sure why it doesn't work when I've followed the documentation for setting it up. Please find attached, the config files for rgw.conf, ceph.conf, and apache2.conf apache2.conf Description: Binary data ceph.conf Description: Binary data rgw.conf Description: Binary data ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor failure after series of traumatic network failures
This was excellent advice. It should be on some official Ceph troubleshooting page. It takes a while for the monitors to deal with new info, but it works. Thanks again! --Greg On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil s...@newdream.net wrote: On Wed, 18 Mar 2015, Greg Chavez wrote: We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be the last thing before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon. [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ] After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20): node-10: dead. ceph-mon will not start node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again: 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Unlikely! You have 5 copies, and I doubt all of them are unrecoverable. Or is there hope? Any suggestions would be greatly appreciated. Stop all mons. Make a backup copy of each mon data dir. Copy the node-14 data dir over the node-15 and/or node-10 and/or node-02. Start all mons, see if they form a quorum. Once things are working again, at the *very* least upgrade to dumpling, and preferably then upgrade to firefly!! Cuttlefish was EOL more than a year ago, and dumpling is EOL in a couple months. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Monitor failure after series of traumatic network failures
We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be the last thing before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon. [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ] After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20): node-10: dead. ceph-mon will not start node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again: 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Or is there hope? Any suggestions would be greatly appreciated. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD turned itself off
/2007323, failed lossy con, dropping message 0x12989400 -855 2015-01-10 22:01:36.589036 7f6d5b954700 0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(727627 rbd_data.1cc69413d1b58ba.0055 [stat,write 2289664~4096] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 0x24f68800 -819 2015-01-12 05:25:06.229753 7f6d3646c700 0 -- 10.168.7.23:6819/10217 10.168.7.53:0/2019809 pipe(0x1f0e9680 sd=460 :6819 s=0 pgs=0 cs=0 l=1 c=0x13090420).accept replacing existing (lossy) channel (new one lossy=1) -818 2015-01-12 05:25:06.581703 7f6d37534700 0 -- 10.168.7.23:6819/10217 10.168.7.53:0/1025252 pipe(0x1b67a780 sd=71 :6819 s=0 pgs=0 cs=0 l=1 c=0x16311e40).accept replacing existing (lossy) channel (new one lossy=1) -817 2015-01-12 05:25:21.342998 7f6d41167700 0 -- 10.168.7.23:6819/10217 10.168.7.53:0/1025579 pipe(0x114e8000 sd=502 :6819 s=0 pgs=0 cs=0 l=1 c=0x16310160).accept replacing existing (lossy) channel (new one lossy=1) -808 2015-01-12 16:01:35.783534 7f6d5b954700 0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(752034 rbd_data.1cc69413d1b58ba.0055 [stat,write 2387968~8192] ondisk = 0) v4 remote, 10.168.7.54:0/1007323, failed lossy con, dropping message 0x1fde9a00 -515 2015-01-25 18:44:23.303855 7f6d5b954700 0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(46402240 rbd_data.4b8e9b3d1b58ba.0471 [read 1310720~4096] ondisk = 0) v4 remote, 10.168.7.51:0/1017204, failed lossy con, dropping message 0x250bce00 -303 2015-02-02 22:30:03.140599 7f6d5c155700 0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(17710313 rbd_data.1cc69562eb141f2.03ce [stat,write 4145152~4096] ondisk = 0) v4 remote, 10.168.7.54:0/2007323, failed lossy con, dropping message 0x1c5d4200 -236 2015-02-05 15:29:04.945660 7f6d3d357700 0 -- 10.168.7.23:6819/10217 10.168.7.51:0/1026961 pipe(0x1c63e780 sd=203 :6819 s=0 pgs=0 cs=0 l=1 c=0x11dc8dc0).accept replacing existing (lossy) channel (new one lossy=1) -66 2015-02-10 20:20:36.673969 7f6d5b954700 0 -- 10.168.7.23:6819/10217 submit_message osd_op_reply(11088 rbd_data.10b8c82eb141f2.4459 [stat,write 749568~8192] ondisk = 0) v4 remote, 10.168.7.55:0/1005630, failed lossy con, dropping message 0x138db200 Could this have lead to the data being erroneous, or is the -5 return code just a sign of a broken hard drive? These are the OSDs creating new connections to each other because the previous ones failed. That's not necessarily a problem (although here it's probably a symptom of some kind of issue, given the frequency) and cannot introduce data corruption of any kind. I’m not seeing any -5 return codes as part of that messenger debug output, so unless you were referring to your EIO from last June I’m not sure what that’s about? (If you do mean EIOs, yes, they’re still a sign of a broken hard drive or local FS.) Cheers, Josef On 14 Jun 2014, at 02:38, Josef Johansson jo...@oderland.se wrote: Thanks for the quick response. Cheers, Josef Gregory Farnum skrev 2014-06-14 02:36: On Fri, Jun 13, 2014 at 5:25 PM, Josef Johansson jo...@oderland.se wrote: Hi Greg, Thanks for the clarification. I believe the OSD was in the middle of a deep scrub (sorry for not mentioning this straight away), so then it could've been a silent error that got wind during scrub? Yeah. What's best practice when the store is corrupted like this? Remove the OSD from the cluster, and either reformat the disk or replace as you judge appropriate. -Greg Cheers, Josef Gregory Farnum skrev 2014-06-14 02:21: The OSD did a read off of the local filesystem and it got back the EIO error code. That means the store got corrupted or something, so it killed itself to avoid spreading bad data to the rest of the cluster. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Jun 13, 2014 at 5:16 PM, Josef Johansson jo...@oderland.se wrote: Hey, Just examing what happened to an OSD, that was just turned off. Data has been moved away from it, so hesitating to turned it back on. Got the below in the logs, any clues to what the assert talks about? Cheers, Josef -1 os/FileStore.cc: In function 'virtual int FileStore::read(coll_t, const hobject_t, uint64_t, size_t, ceph::bufferlist, bool)' thread 7fdacb88 c700 time 2014-06-11 21:13:54.036982 os/FileStore.cc: 2992: FAILED assert(allow_eio || !m_filestore_fail_eio || got != -5) ceph version 0.67.7 (d7ab4244396b57aac8b7e80812115bbd079e6b73) 1: (FileStore::read(coll_t, hobject_t const, unsigned long, unsigned long, ceph::buffer::list, bool)+0x653) [0x8ab6c3] 2: (ReplicatedPG::do_osd_ops(ReplicatedPG::OpContext*, std::vectorOSDOp, std::allocatorOSDOp )+0x350) [0x708230] 3: (ReplicatedPG::prepare_transaction(ReplicatedPG::OpContext*)+0x86) [0x713366] 4: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x3095
Re: [ceph-users] Poor performance on all SSD cluster
How does RBD cache work? I wasn't able to find an adequate explanation in the docs. On Sunday, June 22, 2014, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: Good point, I had neglected to do that. So, amending my conf.conf [1]: [client] rbd cache = true rbd cache size = 2147483648 rbd cache max dirty = 1073741824 rbd cache max dirty age = 100 and also the VM's xml def to include cache to writeback: disk type='network' device='disk' driver name='qemu' type='raw' cache='writeback' io='native'/ auth username='admin' secret type='ceph' uuid='cd2d3ab1-2d31-41e0-ab08-3d0c6e2fafa0'/ /auth source protocol='rbd' name='rbd/vol1' host name='192.168.1.64' port='6789'/ /source target dev='vdb' bus='virtio'/ address type='pci' domain='0x' bus='0x00' slot='0x07' function='0x0'/ /disk Retesting from inside the VM: $ dd if=/dev/zero of=/mnt/vol1/scratch/file bs=16k count=65535 oflag=direct 65535+0 records in 65535+0 records out 1073725440 bytes (1.1 GB) copied, 8.1686 s, 131 MB/s Which is much better, so certainly for the librbd case enabling the rbd cache seems to nail this particular issue. Regards Mark [1] possibly somewhat agressively set, but at least a noticeable difference :-) On 22/06/14 19:02, Haomai Wang wrote: Hi Mark, Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed ok. dd if=/dev/zero of=test bs=16k count=65536 oflag=direct 82.3MB/s ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson mark.nel...@inktank.com wrote: RBD Cache is definitely going to help in this use case. This test is basically just sequentially writing a single 16k chunk of data out, one at a time. IE, entirely latency bound. At least on OSDs backed by XFS, you have to wait for that data to hit the journals of every OSD associated with the object before the acknowledgement gets sent back to the client. Again, I can reproduce this with replication disabled. If you are using the default 4MB block size, you'll hit the same OSDs over and over again and your other OSDs will sit there twiddling their thumbs waiting for IO until you hit the next block, but then it will just be a different set OSDs getting hit. You should be able to verify this by using iostat or collectl or something to look at the behaviour of the SSDs during the test. Since this is all sequential though, switching to buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache for direct IO (coalesce IOs below the block device) will dramatically improve things. This makes sense. Given the following scenario: - No replication - osd_op time average is .015 seconds (stddev ~.003 seconds) - Network latency is approximately .000237 seconds on avg I should be getting 60 IOPS from the OSD reporting this time, right? So 60 * 16kB = 960kB. That's slightly slower than we're getting because I'm only able to sample the slowest ops. We're getting closer to 100 IOPS. But that does make sense, I suppose. So the only way to improve performance would be to not use O_DIRECT (as this should bypass rbd cache as well, right?). Ceph is pretty good at small random IO with lots of parallelism on spinning disk backed OSDs (So long as you use SSD journals or controllers with WB cache). It's much harder to get native-level IOPS rates with SSD backed OSDs though. The latency involved in distributing and processing all of that data becomes a much bigger deal. Having said that, we are actively working on improving latency as much as we can. :) And this is true because flushing from the journal to spinning disks is going to coalesce the writes into the appropriate blocks in a meaningful way, right? Or I guess... Why is this? Why doesn't that happen with SSD journals and SSD OSDs? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Poor performance on all SSD cluster
I'm using Crucial M500s. On Sat, Jun 21, 2014 at 7:09 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I can reproduce this in: ceph version 0.81-423-g1fb4574 on Ubuntu 14.04. I have a two osd cluster with data on two sata spinners (WD blacks) and journals on two ssd (Crucual m4's). I getting about 3.5 MB/s (kernel and librbd) using your dd command with direct on. Leaving off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11 [2]). The ssd's can do writes at about 180 MB/s each... which is something to look at another day[1]. It would be interesting to know what version of Ceph Tyer is using, as his setup seems not nearly impacted by adding direct. Also it might be useful to know what make and model of ssd you both are using (some of 'em do not like a series of essentially sync writes). Having said that testing my Crucial m4's shows they can do the dd command (with direct *on*) at about 180 MB/s...hmmm...so it *is* the Ceph layer it seems. Regards Mark [1] I set filestore_max_sync_interval = 100 (30G journal...ssd able to do 180 MB/s etc), however I am still seeing writes to the spinners during the 8s or so that the above dd tests take). [2] Ubuntu 13.10 VM - I'll upgrade it to 14.04 and see if that helps at all. On 21/06/14 09:17, Greg Poirier wrote: Thanks Tyler. So, I'm not totally crazy. There is something weird going on. I've looked into things about as much as I can: - We have tested with collocated journals and dedicated journal disks. - We have bonded 10Gb nics and have verified network configuration and connectivity is sound - We have run dd independently on the SSDs in the cluster and they are performing fine - We have tested both in a VM and with the RBD kernel module and get identical performance - We have pool size = 3, pool min size = 2 and have tested with min size of 2 and 3 -- the performance impact is not bad - osd_op times are approximately 6-12ms - osd_sub_op times are 6-12 ms - iostat reports service time of 6-12ms - Latency between the storage and rbd client is approximately .1-.2ms - Disabling replication entirely did not help significantly On Fri, Jun 20, 2014 at 2:13 PM, Tyler Wilson k...@linuxdigital.net mailto:k...@linuxdigital.net wrote: Greg, Not a real fix for you but I too run a full-ssd cluster and am able to get 112MB/s with your command; [root@plesk-test ~]# dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct 65535+0 records in 65535+0 records out 1073725440 bytes (1.1 GB) copied, 9.59092 s, 112 MB/s This of course is in a VM, here is my ceph config [global] fsid = hidden mon_initial_members = node-1 node-2 node-3 mon_host = 192.168.0.3 192.168.0.4 192.168.0.5 auth_supported = cephx osd_journal_size = 2048 filestore_xattr_use_omap = true osd_pool_default_size = 2 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 1024 public_network = 192.168.0.0/24 http://192.168.0.0/24 osd_mkfs_type = xfs cluster_network = 192.168.1.0/24 http://192.168.1.0/24 On Fri, Jun 20, 2014 at 11:08 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: I recently created a 9-node Firefly cluster backed by all SSDs. We have had some pretty severe performance degradation when using O_DIRECT in our tests (as this is how MySQL will be interacting with RBD volumes, this makes the most sense for a preliminary test). Running the following test: dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd thread. Running a second dd process does show increased throughput which is encouraging, but I am still concerned by the low throughput of a single thread w/ O_DIRECT. Two threads: 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s I am testing with an RBD volume mounted with the kernel module (I have also tested from within KVM, similar performance). If allow caching, we start to see reasonable numbers from a single dd process: dd if=/dev/zero of=testfilasde bs=16k count=65535 65535+0 records in 65535+0 records out 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s I can get 1GB/s from a single host with three threads. Rados bench produces similar results. Is there something I can do to increase the performance of O_DIRECT? I expect performance degradation, but so much? If I increase the blocksize to 4M, I'm able to get significantly higher throughput: 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s
Re: [ceph-users] Poor performance on all SSD cluster
We actually do have a use pattern of large batch sequential writes, and this dd is pretty similar to that use case. A round-trip write with replication takes approximately 10-15ms to complete. I've been looking at dump_historic_ops on a number of OSDs and getting mean, min, and max for sub_op and ops. If these were on the order of 1-2 seconds, I could understand this throughput... But we're talking about fairly fast SSDs and a 20Gbps network with 1ms latency for TCP round-trip between the client machine and all of the OSD hosts. I've gone so far as disabling replication entirely (which had almost no impact) and putting journals on separate SSDs as the data disks (which are ALSO SSDs). This just doesn't make sense to me. On Sun, Jun 22, 2014 at 6:44 AM, Mark Nelson mark.nel...@inktank.com wrote: On 06/22/2014 02:02 AM, Haomai Wang wrote: Hi Mark, Do you enable rbdcache? I test on my ssd cluster(only one ssd), it seemed ok. dd if=/dev/zero of=test bs=16k count=65536 oflag=direct 82.3MB/s RBD Cache is definitely going to help in this use case. This test is basically just sequentially writing a single 16k chunk of data out, one at a time. IE, entirely latency bound. At least on OSDs backed by XFS, you have to wait for that data to hit the journals of every OSD associated with the object before the acknowledgement gets sent back to the client. If you are using the default 4MB block size, you'll hit the same OSDs over and over again and your other OSDs will sit there twiddling their thumbs waiting for IO until you hit the next block, but then it will just be a different set OSDs getting hit. You should be able to verify this by using iostat or collectl or something to look at the behaviour of the SSDs during the test. Since this is all sequential though, switching to buffered IO (ie coalesce IOs at the buffercache layer) or using RBD cache for direct IO (coalesce IOs below the block device) will dramatically improve things. The real question here though, is whether or not a synchronous stream of sequential 16k writes is even remotely close to the IO patterns that would be seen in actual use for MySQL. Most likely in actual use you'll be seeing a big mix of random reads and writes, and hopefully at least some parallelism (though this depends on the number of databases, number of users, and the workload!). Ceph is pretty good at small random IO with lots of parallelism on spinning disk backed OSDs (So long as you use SSD journals or controllers with WB cache). It's much harder to get native-level IOPS rates with SSD backed OSDs though. The latency involved in distributing and processing all of that data becomes a much bigger deal. Having said that, we are actively working on improving latency as much as we can. :) Mark On Sun, Jun 22, 2014 at 11:50 AM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: On 22/06/14 14:09, Mark Kirkwood wrote: Upgrading the VM to 14.04 and restesting the case *without* direct I get: - 164 MB/s (librbd) - 115 MB/s (kernel 3.13) So managing to almost get native performance out of the librbd case. I tweaked both filestore max and min sync intervals (100 and 10 resp) just to see if I could actually avoid writing to the spinners while the test was in progress (still seeing some, but clearly fewer). However no improvement at all *with* direct enabled. The output of iostat on the host while the direct test is in progress is interesting: avg-cpu: %user %nice %system %iowait %steal %idle 11.730.005.040.760.00 82.47 Device: rrqm/s wrqm/s r/s w/srMB/swMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 0.000.00 11.00 0.00 4.02 749.09 0.14 12.360.00 12.36 6.55 7.20 sdb 0.00 0.000.00 11.00 0.00 4.02 749.09 0.14 12.360.00 12.36 5.82 6.40 sdc 0.00 0.000.00 435.00 0.00 4.29 20.21 0.531.210.001.21 1.21 52.80 sdd 0.00 0.000.00 435.00 0.00 4.29 20.21 0.521.200.001.20 1.20 52.40 (sda,b are the spinners sdc,d the ssds). Something is making the journal work very hard for its 4.29 MB/s! regards Mark Leaving off direct I'm seeing about 140 MB/s (librbd) and 90 MB/s (kernel 3.11 [2]). The ssd's can do writes at about 180 MB/s each... which is something to look at another day[1]. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com
[ceph-users] Poor performance on all SSD cluster
I recently created a 9-node Firefly cluster backed by all SSDs. We have had some pretty severe performance degradation when using O_DIRECT in our tests (as this is how MySQL will be interacting with RBD volumes, this makes the most sense for a preliminary test). Running the following test: dd if=/dev/zero of=testfilasde bs=16k count=65535 oflag=direct 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s Shows us only about 1.5 MB/s throughput and 100 IOPS from the single dd thread. Running a second dd process does show increased throughput which is encouraging, but I am still concerned by the low throughput of a single thread w/ O_DIRECT. Two threads: 779829248 bytes (780 MB) copied, 604.333 s, 1.3 MB/s 126271488 bytes (126 MB) copied, 99.2069 s, 1.3 MB/s I am testing with an RBD volume mounted with the kernel module (I have also tested from within KVM, similar performance). If allow caching, we start to see reasonable numbers from a single dd process: dd if=/dev/zero of=testfilasde bs=16k count=65535 65535+0 records in 65535+0 records out 1073725440 bytes (1.1 GB) copied, 2.05356 s, 523 MB/s I can get 1GB/s from a single host with three threads. Rados bench produces similar results. Is there something I can do to increase the performance of O_DIRECT? I expect performance degradation, but so much? If I increase the blocksize to 4M, I'm able to get significantly higher throughput: 3833593856 bytes (3.8 GB) copied, 44.2964 s, 86.5 MB/s This still seems very low. I'm using the deadline scheduler in all places. With noop scheduler, I do not see a performance improvement. Suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Backfill and Recovery traffic shaping
We have a cluster in a sub-optimal configuration with data and journal colocated on OSDs (that coincidentally are spinning disks). During recovery/backfill, the entire cluster suffers degraded performance because of the IO storm that backfills cause. Client IO becomes extremely latent. I've tried to decrease the impact that recovery/backfill has with the following: ceph tell osd.* injectargs '--osd-max-backfills 1' ceph tell osd.* injectargs '--osd-max-recovery-threads 1' ceph tell osd.* injectargs '--osd-recovery-op-priority 1' ceph tell osd.* injectargs '--osd-client-op-priority 63' ceph tell osd.* injectargs '--osd-recovery-max-active 1' The only other option I have left would be to use linux traffic shaping to artificially reduce the bandwidth available to the interfaced tagged for cluster traffic (instead of separate physical networks, we use VLAN tagging). We are nowhere _near_ the point where network saturation would cause the latency we're seeing, so I am left to believe that it is simply disk IO saturation. I could be wrong about this assumption, though, as iostat doesn't terrify me. This could be suboptimal network configuration on the cluster as well. I'm still looking into that possibility, but I wanted to get feedback on what I'd done already first--as well as the proposed traffic shaping idea. Thoughts? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Backfill and Recovery traffic shaping
On Saturday, April 19, 2014, Mike Dawson mike.daw...@cloudapt.com wrote: With a workload consisting of lots of small writes, I've seen client IO starved with as little as 5Mbps of traffic per host due to spindle contention once deep-scrub and/or recovery/backfill start. Co-locating OSD Journals on the same spinners as you have will double that likelihood. Yeah. We're working on addressing the collocation issues. Possible solutions include moving OSD Journals to SSD (with a reasonable ratio), expanding the cluster, or increasing the performance of underlying storage. We are considering an all SSD cluster. If I'm not mistaken, at that point journal collocation isn't as much of an issue since iops/seek time stop being an issue. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Useful visualizations / metrics
I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.comwrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- *Jason Villalta* Co-founder [image: Inline image 1] 800.799.4407x1230 | www.RubixTechnology.comhttp://www.rubixtechnology.com/ inline: EmailLogo.png___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Useful visualizations / metrics
We are collecting system metrics through sysstat every minute and getting those to OpenTSDB via Sensu. We have a plethora of metrics, but I am finding it difficult to create meaningful visualizations. We have alerting for things like individual OSDs reaching capacity thresholds, memory spikes on OSD or MON hosts. I am just trying to come up with some visualizations that could become solid indicators that something is wrong with the cluster in general, or with a particular host (besides CPU or memory utilization). This morning, I have thought of things like: - Stddev of bytes used on all disks in the cluster and individual OSD hosts - 1st and 2nd derivative of bytes used on all disks in the cluster and individual OSD hosts - bytes used in the entire cluster - % usage of cluster capacity Stddev should help us identify hotspots. Velocity and acceleration of bytes used should help us with capacity planning. Bytes used in general is just a neat thing to see, but doesn't tell us all that much. % usage of cluster capacity is another thing that's just kind of neat to see. What would you suggest looking for in dump_historic_ops? Maybe get regular metrics on things like total transaction length? The only problem is that dump_historic_ops may not always contain relevant/recent data. It is not as easily translated into time series data as some other things. On Sat, Apr 12, 2014 at 9:23 AM, Mark Nelson mark.nel...@inktank.comwrote: One thing I do right now for ceph performance testing is run a copy of collectl during every test. This gives you a TON of information about CPU usage, network stats, disk stats, etc. It's pretty easy to import the output data into gnuplot. Mark Seger (the creator of collectl) also has some tools to gather aggregate statistics across multiple nodes. Beyond collectl, you can get a ton of useful data out of the ceph admin socket. I especially like dump_historic_ops as it some times is enough to avoid having to parse through debug 20 logs. While the following tools have too much overhead to be really useful for general system monitoring, they are really useful for specific performance investiations: 1) perf with the dwarf/unwind support 2) blktrace (optionally with seekwatcher) 3) valgrind (cachegrind, callgrind, massif) Beyond that, there are some collectd plugins for Ceph and last time I checked DreamHost was using Graphite for a lot of visualizations. There's always ganglia too. :) Mark On 04/12/2014 09:41 AM, Jason Villalta wrote: I know ceph throws some warnings if there is high write latency. But i would be most intrested in the delay for io requests, linking directly to iops. If iops start to drop because the disk are overwhelmed then latency for requests would be increasing. This would tell me that I need to add more OSDs/Nodes. I am not sure there is a specific metric in ceph for this but it would be awesome if there was. On Sat, Apr 12, 2014 at 10:37 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: Curious as to how you define cluster latency. On Sat, Apr 12, 2014 at 7:21 AM, Jason Villalta ja...@rubixnet.com mailto:ja...@rubixnet.com wrote: Hi, i have not don't anything with metrics yet but the only ones I personally would be interested in is total capacity utilization and cluster latency. Just my 2 cents. On Sat, Apr 12, 2014 at 10:02 AM, Greg Poirier greg.poir...@opower.com mailto:greg.poir...@opower.com wrote: I'm in the process of building a dashboard for our Ceph nodes. I was wondering if anyone out there had instrumented their OSD / MON clusters and found particularly useful visualizations. At first, I was trying to do ridiculous things (like graphing % used for every disk in every OSD host), but I realized quickly that that is simply too many metrics and far too visually dense to be useful. I am attempting to put together a few simpler, more dense visualizations like... overcall cluster utilization, aggregate cpu and memory utilization per osd host, etc. Just looking for some suggestions. Thanks! ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ -- -- */Jason Villalta/* Co-founder Inline image 1 800.799.4407x1230 | www.RubixTechnology.com http://www.rubixtechnology.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com
Re: [ceph-users] OSD full - All RBD Volumes stopped responding
So... our storage problems persisted for about 45 minutes. I gave an entire hypervisor worth of VM's time to recover (approx. 30 vms), and none of them recovered on their own. In the end, we had to stop and start every VM (easily done, it was just alarming). Once rebooted, the VMs of course were fine. I marked the two full OSDs as down and out. I am a little concerned that these two are full while the cluster, in general, is only at 50% capacity. It appears we may have a hot spot. I'm going to look into that later today. Also, I'm not sure how it happened, but pgp_num is lower than pg_num. I had not noticed that until last night. Will address that as well. This probably happened when i last resized placement groups or potentially when I setup object storage pools. On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander w...@42on.com wrote: On 04/11/2014 09:23 AM, Josef Johansson wrote: On 11/04/14 09:07, Wido den Hollander wrote: Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se: Hi, On 11/04/14 07:29, Wido den Hollander wrote: Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com : One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not happening. Rebooting them restores service immediately. :/ Not ideal. A reboot isn't really required though. It could be that the VM itself is in trouble, but from a librados/librbd perspective I/O should simply continue as soon as a osdmap has been received without the full flag. It could be that you have to wait some time before the VM continues. This can take up to 15 minutes. With other storage solution you would have to change the timeout-value for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to survive storage problems. Does Ceph handle this differently somehow? It's not that RBD does it differently. Librados simply blocks the I/O and thus dus librbd which then causes Qemu to block. I've seen VMs survive RBD issues for longer periods then 60 seconds. Gave them some time and they continued again. Which exact setting are you talking about? I'm talking about a Qemu/KVM VM running with a VirtIO drive. cat /sys/block/*/device/timeout (http://kb.vmware.com/selfservice/microsites/search. do?language=en_UScmd=displayKCexternalId=1009465) This file is non-existant for my Ceph-VirtIO-drive however, so it seems RBD handles this. Well, I don't think it's handled by RBD, but VirtIO simply doesn't have the timeout. That's probably only in the SCSI driver. Wido I have just Para-Virtualized VMs to compare with right now, and they don't have it inside the VM, but that's expected. From my understanding it should've been there if it was a HVM. Whenever the timeout was reached, an error occured and the disk was set in read-only-mode. Cheers, Josef Wido Cheers, Josef Wido On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier greg.poir...@opower.comwrote: Going to try increasing the full ratio. Disk utilization wasn't really growing at an unreasonable pace. I'm going to keep an eye on it for the next couple of hours and down/out the OSDs if necessary. We have four more machines that we're in the process of adding (which doubles the number of OSDs), but got held up by some networking nonsense. Thanks for the tips. On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote: On Thu, 10 Apr 2014, Greg Poirier wrote: Hi, I have about 200 VMs with a common RBD volume as their root filesystem and a number of additional filesystems on Ceph. All of them have stopped responding. One of the OSDs in my cluster is marked full. I tried stopping that OSD to force things to rebalance or at least go to degraded mode, but nothing is responding still. I'm not exactly sure what to do or how to investigate. Suggestions? Try marking the osd out or partially out (ceph osd reweight N .9) to move some data off, and/or adjust the full ratio up (ceph pg set_full_ratio .95). Note that this becomes increasinly dangerous as OSDs get closer to full; add some disks. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Wido den Hollander 42on B.V. Ceph trainer and consultant Phone: +31 (0)20 700 9902 Skype: contact42on ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD full - All RBD Volumes stopped responding
So, setting pgp_num to 2048 to match pg_num had a more serious impact than I expected. The cluster is rebalancing quite substantially (8.5% of objects being rebalanced)... which makes sense... Disk utilization is evening out fairly well which is encouraging. We are a little stumped as to why a few OSDs being full would cause the entire cluster to stop serving IO. Is this a configuration issue that we have? We're slowly recovering: health HEALTH_WARN 135 pgs backfill; 187 pgs backfill_toofull; 151 pgs backfilling; 2 pgs degraded; 369 pgs stuck unclean; 29 requests are blocked 32 sec; recovery 2563902/52390259 objects degraded (4.894%); 4 near full osd(s) pgmap v8363400: 5120 pgs, 3 pools, 22635 GB data, 23872 kobjects 48889 GB used, 45022 GB / 93911 GB avail 2563902/52390259 objects degraded (4.894%) 4751 active+clean 31 active+remapped+wait_backfill 1 active+backfill_toofull 103 active+remapped+wait_backfill+backfill_toofull 1 active+degraded+wait_backfill+backfill_toofull 150 active+remapped+backfilling 82 active+remapped+backfill_toofull 1 active+degraded+remapped+backfilling recovery io 362 MB/s, 365 objects/s client io 1643 kB/s rd, 6001 kB/s wr, 911 op/s On Fri, Apr 11, 2014 at 5:45 AM, Greg Poirier greg.poir...@opower.comwrote: So... our storage problems persisted for about 45 minutes. I gave an entire hypervisor worth of VM's time to recover (approx. 30 vms), and none of them recovered on their own. In the end, we had to stop and start every VM (easily done, it was just alarming). Once rebooted, the VMs of course were fine. I marked the two full OSDs as down and out. I am a little concerned that these two are full while the cluster, in general, is only at 50% capacity. It appears we may have a hot spot. I'm going to look into that later today. Also, I'm not sure how it happened, but pgp_num is lower than pg_num. I had not noticed that until last night. Will address that as well. This probably happened when i last resized placement groups or potentially when I setup object storage pools. On Fri, Apr 11, 2014 at 3:49 AM, Wido den Hollander w...@42on.com wrote: On 04/11/2014 09:23 AM, Josef Johansson wrote: On 11/04/14 09:07, Wido den Hollander wrote: Op 11 april 2014 om 8:50 schreef Josef Johansson jo...@oderland.se: Hi, On 11/04/14 07:29, Wido den Hollander wrote: Op 11 april 2014 om 7:13 schreef Greg Poirier greg.poir...@opower.com: One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not happening. Rebooting them restores service immediately. :/ Not ideal. A reboot isn't really required though. It could be that the VM itself is in trouble, but from a librados/librbd perspective I/O should simply continue as soon as a osdmap has been received without the full flag. It could be that you have to wait some time before the VM continues. This can take up to 15 minutes. With other storage solution you would have to change the timeout-value for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to survive storage problems. Does Ceph handle this differently somehow? It's not that RBD does it differently. Librados simply blocks the I/O and thus dus librbd which then causes Qemu to block. I've seen VMs survive RBD issues for longer periods then 60 seconds. Gave them some time and they continued again. Which exact setting are you talking about? I'm talking about a Qemu/KVM VM running with a VirtIO drive. cat /sys/block/*/device/timeout (http://kb.vmware.com/selfservice/microsites/search. do?language=en_UScmd=displayKCexternalId=1009465) This file is non-existant for my Ceph-VirtIO-drive however, so it seems RBD handles this. Well, I don't think it's handled by RBD, but VirtIO simply doesn't have the timeout. That's probably only in the SCSI driver. Wido I have just Para-Virtualized VMs to compare with right now, and they don't have it inside the VM, but that's expected. From my understanding it should've been there if it was a HVM. Whenever the timeout was reached, an error occured and the disk was set in read-only-mode. Cheers, Josef Wido Cheers, Josef Wido On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier greg.poir...@opower.comwrote: Going to try increasing the full ratio. Disk utilization wasn't really growing at an unreasonable pace. I'm going to keep an eye on it for the next couple of hours and down/out the OSDs if necessary. We have four more machines that we're in the process of adding (which doubles the number of OSDs), but got held up by some networking nonsense. Thanks for the tips. On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote: On Thu, 10 Apr 2014
[ceph-users] OSD full - All RBD Volumes stopped responding
Hi, I have about 200 VMs with a common RBD volume as their root filesystem and a number of additional filesystems on Ceph. All of them have stopped responding. One of the OSDs in my cluster is marked full. I tried stopping that OSD to force things to rebalance or at least go to degraded mode, but nothing is responding still. I'm not exactly sure what to do or how to investigate. Suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD full - All RBD Volumes stopped responding
Going to try increasing the full ratio. Disk utilization wasn't really growing at an unreasonable pace. I'm going to keep an eye on it for the next couple of hours and down/out the OSDs if necessary. We have four more machines that we're in the process of adding (which doubles the number of OSDs), but got held up by some networking nonsense. Thanks for the tips. On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote: On Thu, 10 Apr 2014, Greg Poirier wrote: Hi, I have about 200 VMs with a common RBD volume as their root filesystem and a number of additional filesystems on Ceph. All of them have stopped responding. One of the OSDs in my cluster is marked full. I tried stopping that OSD to force things to rebalance or at least go to degraded mode, but nothing is responding still. I'm not exactly sure what to do or how to investigate. Suggestions? Try marking the osd out or partially out (ceph osd reweight N .9) to move some data off, and/or adjust the full ratio up (ceph pg set_full_ratio .95). Note that this becomes increasinly dangerous as OSDs get closer to full; add some disks. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD full - All RBD Volumes stopped responding
One thing to note All of our kvm VMs have to be rebooted. This is something I wasn't expecting. Tried waiting for them to recover on their own, but that's not happening. Rebooting them restores service immediately. :/ Not ideal. On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier greg.poir...@opower.comwrote: Going to try increasing the full ratio. Disk utilization wasn't really growing at an unreasonable pace. I'm going to keep an eye on it for the next couple of hours and down/out the OSDs if necessary. We have four more machines that we're in the process of adding (which doubles the number of OSDs), but got held up by some networking nonsense. Thanks for the tips. On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil s...@inktank.com wrote: On Thu, 10 Apr 2014, Greg Poirier wrote: Hi, I have about 200 VMs with a common RBD volume as their root filesystem and a number of additional filesystems on Ceph. All of them have stopped responding. One of the OSDs in my cluster is marked full. I tried stopping that OSD to force things to rebalance or at least go to degraded mode, but nothing is responding still. I'm not exactly sure what to do or how to investigate. Suggestions? Try marking the osd out or partially out (ceph osd reweight N .9) to move some data off, and/or adjust the full ratio up (ceph pg set_full_ratio .95). Note that this becomes increasinly dangerous as OSDs get closer to full; add some disks. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replication lag in block storage
So, on the cluster that I _expect_ to be slow, it appears that we are waiting on journal commits. I want to make sure that I am reading this correctly: received_at: 2014-03-14 12:14:22.659170, { time: 2014-03-14 12:14:22.660191, event: write_thread_in_journal_buffer}, At this point we have received the write and are attempting to write the transaction to the OSD's journal, yes? Then: { time: 2014-03-14 12:14:22.900779, event: journaled_completion_queued}, 240ms later we have successfully written to the journal? I expect this particular slowness due to colocation of journal and data on the same disk (and it's a spinning disk, not an SSD). I expect some of this could be alleviated by migrating journals to SSDs, but I am looking to rebuild in the near future--so am willing to hobble in the meantime. I am surprised that our all SSD cluster is also underperforming. I am trying colocating the journal on the same disk with all SSDs at the moment and will see if the performance degradation is of the same nature. On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum g...@inktank.com wrote: Right. So which is the interval that's taking all the time? Probably it's waiting for the journal commit, but maybe there's something else blocking progress. If it is the journal commit, check out how busy the disk is (is it just saturated?) and what its normal performance characteristics are (is it breaking?). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier greg.poir...@opower.com wrote: Many of the sub ops look like this, with significant lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1192491 rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0 e660), received_at: 2014-03-13 20:42:05.811936, age: 46.088198, duration: 0.038328, snip { time: 2014-03-13 20:42:05.850215, event: commit_sent}, { time: 2014-03-13 20:42:05.850264, event: done}]]}, In this case almost 39ms between received_at and commit_sent. A particularly egregious example of 80+ms lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1190526 rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd e660), received_at: 2014-03-13 20:41:40.227813, age: 320.017087, duration: 0.086852, snip { time: 2014-03-13 20:41:40.314633, event: commit_sent}, { time: 2014-03-13 20:41:40.314665, event: done}]]}, On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com wrote: On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier greg.poir...@opower.com wrote: We've been seeing this issue on all of our dumpling clusters, and I'm wondering what might be the cause of it. In dump_historic_ops, the time between op_applied and sub_op_commit_rec or the time between commit_sent and sub_op_applied is extremely high. Some of the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is included at the bottom. It's important to understand what each of those timestamps are reporting. op_applied: the point at which an OSD has applied an operation to its readable backing filesystem in-memory (which for xfs or ext4 will be after it's committed to the journal) sub_op_commit_rec: the point at which an OSD has gotten commits from the replica OSDs commit_sent: the point at which a replica OSD has sent a commit back to its primary sub_op_applied: the point at which a replica OSD has applied a particular operation to its backing filesystem in-memory (again, after the journal if using xfs) Reads are never served from replicas, so a long time between commit_sent and sub_op_applied should not in itself be an issue. A lag time between op_applied and sub_op_commit_rec means that the OSD is waiting on its replicas. A long time there indicates either that the replica is processing slowly, or that there's some issue in the communications stack (all the way from the raw ethernet up to the message handling in the OSD itself). So the first thing to look for are sub ops which have a lag time between the received_at and commit_sent timestamps. If none of those ever turn up, but unusually long waits for sub_op_commit_rec are still present, then it'll take more effort to correlate particular subops on replicas with the op on the primary they correspond to, and see where the time lag is coming into it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing
Re: [ceph-users] Replication lag in block storage
We are stressing these boxes pretty spectacularly at the moment. On every box I have one OSD that is pegged for IO almost constantly. ceph-1: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdv 0.00 0.00 104.00 160.00 748.00 1000.0013.24 1.154.369.461.05 3.70 97.60 ceph-2: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdq 0.0025.00 109.00 218.00 844.00 1773.5016.01 1.374.209.031.78 3.01 98.40 ceph-3: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdm 0.00 0.00 126.00 56.00 996.00 540.0016.88 1.015.588.060.00 5.43 98.80 These are all disks in my block storage pool. osdmap e26698: 102 osds: 102 up, 102 in pgmap v6752413: 4624 pgs, 3 pools, 14151 GB data, 21729 kobjects 28517 GB used, 65393 GB / 93911 GB avail 4624 active+clean client io 1915 kB/s rd, 59690 kB/s wr, 1464 op/s I don't see any smart errors, but i'm slowly working my way through all of the disks on these machines with smartctl to see if anything stands out. On Fri, Mar 14, 2014 at 9:52 AM, Gregory Farnum g...@inktank.com wrote: On Fri, Mar 14, 2014 at 9:37 AM, Greg Poirier greg.poir...@opower.com wrote: So, on the cluster that I _expect_ to be slow, it appears that we are waiting on journal commits. I want to make sure that I am reading this correctly: received_at: 2014-03-14 12:14:22.659170, { time: 2014-03-14 12:14:22.660191, event: write_thread_in_journal_buffer}, At this point we have received the write and are attempting to write the transaction to the OSD's journal, yes? Then: { time: 2014-03-14 12:14:22.900779, event: journaled_completion_queued}, 240ms later we have successfully written to the journal? Correct. That seems an awfully long time for a 16K write, although I don't know how much data I have on co-located journals. (At least, I'm assuming it's in the 16K range based on the others, although I'm just now realizing that subops aren't providing that information...I've created a ticket to include that diagnostic info in future.) -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com I expect this particular slowness due to colocation of journal and data on the same disk (and it's a spinning disk, not an SSD). I expect some of this could be alleviated by migrating journals to SSDs, but I am looking to rebuild in the near future--so am willing to hobble in the meantime. I am surprised that our all SSD cluster is also underperforming. I am trying colocating the journal on the same disk with all SSDs at the moment and will see if the performance degradation is of the same nature. On Thu, Mar 13, 2014 at 6:25 PM, Gregory Farnum g...@inktank.com wrote: Right. So which is the interval that's taking all the time? Probably it's waiting for the journal commit, but maybe there's something else blocking progress. If it is the journal commit, check out how busy the disk is (is it just saturated?) and what its normal performance characteristics are (is it breaking?). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Thu, Mar 13, 2014 at 5:48 PM, Greg Poirier greg.poir...@opower.com wrote: Many of the sub ops look like this, with significant lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1192491 rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0 e660), received_at: 2014-03-13 20:42:05.811936, age: 46.088198, duration: 0.038328, snip { time: 2014-03-13 20:42:05.850215, event: commit_sent}, { time: 2014-03-13 20:42:05.850264, event: done}]]}, In this case almost 39ms between received_at and commit_sent. A particularly egregious example of 80+ms lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1190526 rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd e660), received_at: 2014-03-13 20:41:40.227813, age: 320.017087, duration: 0.086852, snip { time: 2014-03-13 20:41:40.314633, event: commit_sent}, { time: 2014-03-13 20:41:40.314665, event: done}]]}, On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com wrote: On Thu, Mar 13, 2014 at 3:56 PM
Re: [ceph-users] Replication lag in block storage
Many of the sub ops look like this, with significant lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1192491 rbd_data.67b14a2ae8944a.9105 [write 507904~3686400] 6.556a4db0 e660), received_at: 2014-03-13 20:42:05.811936, age: 46.088198, duration: 0.038328, snip { time: 2014-03-13 20:42:05.850215, event: commit_sent}, { time: 2014-03-13 20:42:05.850264, event: done}]]}, In this case almost 39ms between received_at and commit_sent. A particularly egregious example of 80+ms lag between received_at and commit_sent: { description: osd_op(client.6869831.0:1190526 rbd_data.67b14a2ae8944a.8fac [write 3325952~868352] 6.5255f5fd e660), received_at: 2014-03-13 20:41:40.227813, age: 320.017087, duration: 0.086852, snip { time: 2014-03-13 20:41:40.314633, event: commit_sent}, { time: 2014-03-13 20:41:40.314665, event: done}]]}, On Thu, Mar 13, 2014 at 4:17 PM, Gregory Farnum g...@inktank.com wrote: On Thu, Mar 13, 2014 at 3:56 PM, Greg Poirier greg.poir...@opower.com wrote: We've been seeing this issue on all of our dumpling clusters, and I'm wondering what might be the cause of it. In dump_historic_ops, the time between op_applied and sub_op_commit_rec or the time between commit_sent and sub_op_applied is extremely high. Some of the osd_sub_ops are as long as 100 ms. A sample dump_historic_ops is included at the bottom. It's important to understand what each of those timestamps are reporting. op_applied: the point at which an OSD has applied an operation to its readable backing filesystem in-memory (which for xfs or ext4 will be after it's committed to the journal) sub_op_commit_rec: the point at which an OSD has gotten commits from the replica OSDs commit_sent: the point at which a replica OSD has sent a commit back to its primary sub_op_applied: the point at which a replica OSD has applied a particular operation to its backing filesystem in-memory (again, after the journal if using xfs) Reads are never served from replicas, so a long time between commit_sent and sub_op_applied should not in itself be an issue. A lag time between op_applied and sub_op_commit_rec means that the OSD is waiting on its replicas. A long time there indicates either that the replica is processing slowly, or that there's some issue in the communications stack (all the way from the raw ethernet up to the message handling in the OSD itself). So the first thing to look for are sub ops which have a lag time between the received_at and commit_sent timestamps. If none of those ever turn up, but unusually long waits for sub_op_commit_rec are still present, then it'll take more effort to correlate particular subops on replicas with the op on the primary they correspond to, and see where the time lag is coming into it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] no user info saved after user creation / can't create buckets
And the debug log because that last log was obviously not helpful... 2014-03-12 23:57:49.497780 7ff97e7dd700 1 == starting new request req=0x23bc650 = 2014-03-12 23:57:49.498198 7ff97e7dd700 2 req 1:0.000419::PUT /test::initializing 2014-03-12 23:57:49.498233 7ff97e7dd700 10 host=s3.amazonaws.comrgw_dns_name=us-west-1.domain 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test 2014-03-12 23:57:49.498437 7ff97e7dd700 2 req 1:0.000659:s3:PUT /test::getting op 2014-03-12 23:57:49.498448 7ff97e7dd700 2 req 1:0.000670:s3:PUT /test:create_bucket:authorizing 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : miss 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6) 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : hit 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : miss 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : type miss (requested=1, cached=6) 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : hit 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr: PUT binary/octet-stream Wed, 12 Mar 2014 23:57:51 GMT /test 2014-03-12 23:57:49.507674 7ff97e7dd700 2 req 1:0.009895:s3:PUT /test:create_bucket:reading permissions 2014-03-12 23:57:49.507682 7ff97e7dd700 2 req 1:0.009904:s3:PUT /test:create_bucket:verifying op mask 2014-03-12 23:57:49.507695 7ff97e7dd700 2 req 1:0.009917:s3:PUT /test:create_bucket:verifying op permissions 2014-03-12 23:57:49.509604 7ff97e7dd700 2 req 1:0.011826:s3:PUT /test:create_bucket:verifying op params 2014-03-12 23:57:49.509615 7ff97e7dd700 2 req 1:0.011836:s3:PUT /test:create_bucket:executing 2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+test : miss 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+test 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding .us-west-1.domain.rgw+test to cache LRU end 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : miss 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+.pools.avail 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6) 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : hit 2014-03-12 23:57:49.520295 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+.pools.avail 2014-03-12 23:57:49.520348 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.522672 7ff97e7dd700 2 req 1:0.024893:s3:PUT /test:create_bucket:http status=403 2014-03-12 23:57:49.523204 7ff97e7dd700 1 == req done req=0x23bc650 http_status=403 == On Wed, Mar 12, 2014 at 7:36 PM, Greg Poirier greg.poir...@opower.comwrote: The saga continues... So, after fiddling with haproxy a bit, I managed to make sure that my requests were hitting the RADOS Gateway. NOW, I get a 403 from my ruby script: 2014-03-12 23:34:08.289670 7fda9bfbf700 1 == starting new request req=0x215a780 = 2014-03-12 23:34:08.305105 7fda9bfbf700 1 == req done req=0x215a780 http_status=403 == The aws-s3 gem forces the Host header to be set to s3.amazonaws.com -- and I am wondering if this could potentially cause
Re: [ceph-users] no user info saved after user creation / can't create buckets
Increasing the logging further, and I notice the following: 2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1]) But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets is perhaps telling of something? I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index weren't created. I created those, restarted radosgw, and still 403 errors. On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier greg.poir...@opower.comwrote: And the debug log because that last log was obviously not helpful... 2014-03-12 23:57:49.497780 7ff97e7dd700 1 == starting new request req=0x23bc650 = 2014-03-12 23:57:49.498198 7ff97e7dd700 2 req 1:0.000419::PUT /test::initializing 2014-03-12 23:57:49.498233 7ff97e7dd700 10 host=s3.amazonaws.comrgw_dns_name=us-west-1.domain 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test 2014-03-12 23:57:49.498437 7ff97e7dd700 2 req 1:0.000659:s3:PUT /test::getting op 2014-03-12 23:57:49.498448 7ff97e7dd700 2 req 1:0.000670:s3:PUT /test:create_bucket:authorizing 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : miss 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6) 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : hit 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : miss 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : type miss (requested=1, cached=6) 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : hit 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr: PUT binary/octet-stream Wed, 12 Mar 2014 23:57:51 GMT /test 2014-03-12 23:57:49.507674 7ff97e7dd700 2 req 1:0.009895:s3:PUT /test:create_bucket:reading permissions 2014-03-12 23:57:49.507682 7ff97e7dd700 2 req 1:0.009904:s3:PUT /test:create_bucket:verifying op mask 2014-03-12 23:57:49.507695 7ff97e7dd700 2 req 1:0.009917:s3:PUT /test:create_bucket:verifying op permissions 2014-03-12 23:57:49.509604 7ff97e7dd700 2 req 1:0.011826:s3:PUT /test:create_bucket:verifying op params 2014-03-12 23:57:49.509615 7ff97e7dd700 2 req 1:0.011836:s3:PUT /test:create_bucket:executing 2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+test : miss 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+test 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding .us-west-1.domain.rgw+test to cache LRU end 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : miss 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+.pools.avail 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6) 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518272 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : hit 2014-03-12 23:57:49.520295 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+.pools.avail 2014-03-12 23:57:49.520348 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.522672 7ff97e7dd700 2 req 1:0.024893:s3:PUT /test:create_bucket:http status=403 2014-03-12 23:57:49.523204 7ff97e7dd700 1 == req
Re: [ceph-users] no user info saved after user creation / can't create buckets
And, I figured out the issue. The utility I was using to create pools, zones, and regions automatically failed to do two things: - create rgw.buckets and rgw.buckets.index for each zone - setup placement pools for each zone I did both of those, and now everything is working. Thanks, me, for the commitment to figuring this poo out. On Wed, Mar 12, 2014 at 8:31 PM, Greg Poirier greg.poir...@opower.comwrote: Increasing the logging further, and I notice the following: 2014-03-13 00:27:28.617100 7f6036ffd700 20 rgw_create_bucket returned ret=-1 bucket=test(@.rgw.buckets[us-west-1.15849318.1]) But hope that .rgw.buckets doesn't have to exist... and that @.rgw.buckets is perhaps telling of something? I did notice that .us-west-1.rgw.buckets and .us-west-1.rgw.buckets.index weren't created. I created those, restarted radosgw, and still 403 errors. On Wed, Mar 12, 2014 at 8:00 PM, Greg Poirier greg.poir...@opower.comwrote: And the debug log because that last log was obviously not helpful... 2014-03-12 23:57:49.497780 7ff97e7dd700 1 == starting new request req=0x23bc650 = 2014-03-12 23:57:49.498198 7ff97e7dd700 2 req 1:0.000419::PUT /test::initializing 2014-03-12 23:57:49.498233 7ff97e7dd700 10 host=s3.amazonaws.comrgw_dns_name=us-west-1.domain 2014-03-12 23:57:49.498366 7ff97e7dd700 10 s-object=NULL s-bucket=test 2014-03-12 23:57:49.498437 7ff97e7dd700 2 req 1:0.000659:s3:PUT /test::getting op 2014-03-12 23:57:49.498448 7ff97e7dd700 2 req 1:0.000670:s3:PUT /test:create_bucket:authorizing 2014-03-12 23:57:49.498508 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : miss 2014-03-12 23:57:49.500852 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.500865 7ff97e7dd700 10 adding .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500886 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500889 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : type miss (requested=1, cached=6) 2014-03-12 23:57:49.500907 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.500910 7ff97e7dd700 10 cache get: name=.us-west-1.users+BLAHBLAHBLAH : hit 2014-03-12 23:57:49.502663 7ff97e7dd700 10 cache put: name=.us-west-1.users+BLAHBLAHBLAH 2014-03-12 23:57:49.502667 7ff97e7dd700 10 moving .us-west-1.users+BLAHBLAHBLAH to cache LRU end 2014-03-12 23:57:49.502700 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : miss 2014-03-12 23:57:49.505128 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.505138 7ff97e7dd700 10 adding .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505157 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505160 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : type miss (requested=1, cached=6) 2014-03-12 23:57:49.505176 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.505178 7ff97e7dd700 10 cache get: name=.us-west-1.users.uid+test : hit 2014-03-12 23:57:49.507401 7ff97e7dd700 10 cache put: name=.us-west-1.users.uid+test 2014-03-12 23:57:49.507406 7ff97e7dd700 10 moving .us-west-1.users.uid+test to cache LRU end 2014-03-12 23:57:49.507521 7ff97e7dd700 10 get_canon_resource(): dest=/test 2014-03-12 23:57:49.507529 7ff97e7dd700 10 auth_hdr: PUT binary/octet-stream Wed, 12 Mar 2014 23:57:51 GMT /test 2014-03-12 23:57:49.507674 7ff97e7dd700 2 req 1:0.009895:s3:PUT /test:create_bucket:reading permissions 2014-03-12 23:57:49.507682 7ff97e7dd700 2 req 1:0.009904:s3:PUT /test:create_bucket:verifying op mask 2014-03-12 23:57:49.507695 7ff97e7dd700 2 req 1:0.009917:s3:PUT /test:create_bucket:verifying op permissions 2014-03-12 23:57:49.509604 7ff97e7dd700 2 req 1:0.011826:s3:PUT /test:create_bucket:verifying op params 2014-03-12 23:57:49.509615 7ff97e7dd700 2 req 1:0.011836:s3:PUT /test:create_bucket:executing 2014-03-12 23:57:49.509694 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+test : miss 2014-03-12 23:57:49.512229 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+test 2014-03-12 23:57:49.512259 7ff97e7dd700 10 adding .us-west-1.domain.rgw+test to cache LRU end 2014-03-12 23:57:49.512333 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : miss 2014-03-12 23:57:49.518216 7ff97e7dd700 10 cache put: name=.us-west-1.domain.rgw+.pools.avail 2014-03-12 23:57:49.518228 7ff97e7dd700 10 adding .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518248 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518251 7ff97e7dd700 10 cache get: name=.us-west-1.domain.rgw+.pools.avail : type miss (requested=1, cached=6) 2014-03-12 23:57:49.518270 7ff97e7dd700 10 moving .us-west-1.domain.rgw+.pools.avail to cache LRU end 2014-03-12 23:57:49.518272
Re: [ceph-users] RBD Snapshots
Interesting. I think this may not be a bad idea. Thanks for the info. On Monday, March 3, 2014, Jean-Tiare LE BIGOT jean-tiare.le-bi...@ovh.net wrote: To get consistent RBD live snapshots, you may want to first freeze the guest filesystem (ext4, btrfs, xfs) with a tool like [fsfreeze]. It will basically flush the FS state to disk and blocking any future write access while maintaining Read accesses. [fsfreeze] http://manpages.courier-mta.org/htmlman8/fsfreeze.8.html On 02/28/2014 11:27 PM, Gregory Farnum wrote: RBD itself will behave fine with whenever you take the snapshot. The thing to worry about is that it's a snapshot at the block device layer, not the filesystem layer, so if you don't quiesce IO and sync to disk the filesystem might not be entirely happy with you for the same reasons that it won't be happy if you pull the power plug on it. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Feb 28, 2014 at 2:12 PM, Greg Poirier greg.poir...@opower.com wrote: According to the documentation at https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require that all I/O to a block device be stopped prior to making the snapshot. Is there any plan to allow for online snapshotting so that we could do incremental snapshots of running VMs on a regular basis. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jean-Tiare, shared-hosting team ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD Snapshots
According to the documentation at https://ceph.com/docs/master/rbd/rbd-snapshot/ -- snapshots require that all I/O to a block device be stopped prior to making the snapshot. Is there any plan to allow for online snapshotting so that we could do incremental snapshots of running VMs on a regular basis. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph MON can no longer join quorum
Hi Karan, I resolved it the same way you did. We had a network partition that caused the MON to die, it appears. I'm running 0.72.1 It would be nice if redeploying wasn't the solution, but if it's simply cleaner to do so, then I will continue along that route. I think what's more troubling is that when this occurred we lost all connectivity to the Ceph cluster. On Wed, Feb 5, 2014 at 1:11 AM, Karan Singh ksi...@csc.fi wrote: Hi Greg I have seen this problem before in my cluster. - What ceph version you are running - Did you made any change recently in the cluster , that resulted in this problem You identified correct , the only problem is ceph-mon-2003 is listening to incorrect port , it should listen on port 6789 ( like the other two monitors ) . How i resolved is cleanly removing the infected monitor node and adding it back to cluster. Regards Karan -- *From: *Greg Poirier greg.poir...@opower.com *To: *ceph-users@lists.ceph.com *Sent: *Tuesday, 4 February, 2014 10:50:21 PM *Subject: *[ceph-users] Ceph MON can no longer join quorum I have a MON that at some point lost connectivity to the rest of the cluster and now cannot rejoin. Each time I restart it, it looks like it's attempting to create a new MON and join the cluster, but the rest of the cluster rejects it, because the new one isn't in the monmap. I don't know why it suddenly decided it needed to be a new MON. I am not really sure where to start. root@ceph-mon-2003:/var/log/ceph# ceph -s cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002 monmap e2: 3 mons at {ceph-mon-2001= 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002 Notice ceph-mon-2003:6800 If I try to start ceph-mon-all, it will be listening on some other port... root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all ceph-mon-all start/running root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph root 6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-mon-2003 -f root 6931 1 3 15:49 ?00:00:00 python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003 root@ceph-mon-2003:/var/log/ceph# ceph -s 2014-02-04 15:49:56.854866 7f9cf422d700 0 -- :/1007028 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9cf00215d0).fault cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002 monmap e2: 3 mons at {ceph-mon-2001= 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002 Suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph MON can no longer join quorum
I have a MON that at some point lost connectivity to the rest of the cluster and now cannot rejoin. Each time I restart it, it looks like it's attempting to create a new MON and join the cluster, but the rest of the cluster rejects it, because the new one isn't in the monmap. I don't know why it suddenly decided it needed to be a new MON. I am not really sure where to start. root@ceph-mon-2003:/var/log/ceph# ceph -s cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002 monmap e2: 3 mons at {ceph-mon-2001= 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002 Notice ceph-mon-2003:6800 If I try to start ceph-mon-all, it will be listening on some other port... root@ceph-mon-2003:/var/log/ceph# start ceph-mon-all ceph-mon-all start/running root@ceph-mon-2003:/var/log/ceph# ps -ef | grep ceph root 6930 1 31 15:49 ?00:00:00 /usr/bin/ceph-mon --cluster=ceph -i ceph-mon-2003 -f root 6931 1 3 15:49 ?00:00:00 python /usr/sbin/ceph-create-keys --cluster=ceph -i ceph-mon-2003 root@ceph-mon-2003:/var/log/ceph# ceph -s 2014-02-04 15:49:56.854866 7f9cf422d700 0 -- :/1007028 10.30.66.15:6789/0 pipe(0x7f9cf0021370 sd=3 :0 s=1 pgs=0 cs=0 l=1 c=0x7f9cf00215d0).fault cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8 health HEALTH_ERR 1 pgs inconsistent; 2 pgs peering; 126 pgs stale; 2 pgs stuck inactive; 126 pgs stuck stale; 2 pgs stuck unclean; 10 requests are blocked 32 sec; 1 scrub errors; 1 mons down, quorum 0,1 ceph-mon-2001,ceph-mon-2002 monmap e2: 3 mons at {ceph-mon-2001= 10.30.66.13:6789/0,ceph-mon-2002=10.30.66.14:6789/0,ceph-mon-2003=10.30.66.15:6800/0}, election epoch 12964, quorum 0,1 ceph-mon-2001,ceph-mon-2002 Suggestions? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RadosGW S3 API - Bucket Versions
On Fri, Jan 24, 2014 at 4:28 PM, Yehuda Sadeh yeh...@inktank.com wrote: For each object that rgw stores it keeps a version tag. However this version is not ascending, it's just used for identifying whether an object has changed. I'm not completely sure what is the problem that you're trying to solve though. We have two datacenters. I want to have two regions that are split across both datacenters. Let's say us-west and us-east are our regions, us-east-1 would live in one datacenter and be the primary zone for the us-east region while us-east-2 would live in the other datacenter and be secondary zone. We then do the opposite for us-west. What I was envisioning, I think will not work. For example: - write object A.0 to bucket X in us-west-1 (master) - us-west-1 (master) goes down. - write to us-west-2 (secondary) a _new_ version of of object A.1 to bucket X - us-west-1 comes back up - read object A.1 from us-west-1 The idea being that if you are versioning objects, you are never updating them, so it doesn't matter that the copy of the object that is now in us-west-1 is read-only. I'm not even sure if this is an accurate description of how replication operates, but I thought I'd discussed a master-master scenario with someone who said this _might_ be possible... assuming you had versioned objects. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] 1MB/s throughput to 33-ssd test cluster
On Sun, Dec 8, 2013 at 8:33 PM, Mark Kirkwood mark.kirkw...@catalyst.net.nz wrote: I'd suggest testing the components separately - try to rule out NIC (and switch) issues and SSD performance issues, then when you are sure the bits all go fast individually test how ceph performs again. What make and model of SSD? I'd check that the firmware is up to date (sometimes makes a huge difference). I'm also wondering if you might get better performance by having (say) 7 osds and using 4 of the SSD for journals for them. Thanks, Mark. In my haste, I left out part of a paragraph... probably really a whole paragraph... that contains a pretty crucial detail. I had previously run rados bench on this hardware with some success (24-26MBps throughput w/ 4k blocks). ceph osd bench looks great. iperf on the network looks great. After my last round of testing (with a few aborted rados bench tests), I deleted the pool and recreated it (same name, crush ruleset, pg num, size, etc). That is when I started to notice the degraded performance. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] 1MB/s throughput to 33-ssd test cluster
Hi. So, I have a test cluster made up of ludicrously overpowered machines with nothing but SSDs in them. Bonded 10Gbps NICs (802.3ad layer 2+3 xmit hash policy, confirmed ~19.8 Gbps throughput with 32+ threads). I'm running rados bench, and I am currently getting less than 1 MBps throughput: sudo rados -N `hostname` bench 600 write -b 4096 -p volumes --no-cleanup -t 32 bench_write_4096_volumes_1_32.out 21' Colocated journals on the same disk, so I'm not expecting optimum throughput, but previous tests on spinning disks have shown reasonable speeds (23MB/s, 4000-6000 iops) as opposed to the 150-450 iops I'm currently getting. ceph_deploy@ssd-1001:~$ sudo ceph -s cluster 4167d5f2-2b9e-4bde-a653-f24af68a45f8 health HEALTH_WARN clock skew detected on mon.ssd-1003 monmap e1: 3 mons at {ssd-1001= 10.20.69.101:6789/0,ssd-1002=10.20.69.102:6789/0,ssd-1003=10.20.69.103:6789/0}, election epoch 20, quorum 0,1,2 ssd-1001,ssd-1002,ssd-1003 osdmap e344: 33 osds: 33 up, 33 in pgmap v10600: 1650 pgs, 6 pools, 289 MB data, 74029 objects 466 GB used, 17621 GB / 18088 GB avail 1650 active+clean client io 1263 kB/s wr, 315 op/s ceph_deploy@ssd-1001:~$ sudo ceph osd tree # id weight type name up/down reweight -1 30.03 root default -2 10.01 host ssd-1001 0 0.91 osd.0 up 1 1 0.91 osd.1 up 1 2 0.91 osd.2 up 1 3 0.91 osd.3 up 1 4 0.91 osd.4 up 1 5 0.91 osd.5 up 1 6 0.91 osd.6 up 1 7 0.91 osd.7 up 1 8 0.91 osd.8 up 1 9 0.91 osd.9 up 1 10 0.91 osd.10 up 1 -3 10.01 host ssd-1002 11 0.91 osd.11 up 1 12 0.91 osd.12 up 1 13 0.91 osd.13 up 1 14 0.91 osd.14 up 1 15 0.91 osd.15 up 1 16 0.91 osd.16 up 1 17 0.91 osd.17 up 1 18 0.91 osd.18 up 1 19 0.91 osd.19 up 1 20 0.91 osd.20 up 1 21 0.91 osd.21 up 1 -4 10.01 host ssd-1003 22 0.91 osd.22 up 1 23 0.91 osd.23 up 1 24 0.91 osd.24 up 1 25 0.91 osd.25 up 1 26 0.91 osd.26 up 1 27 0.91 osd.27 up 1 28 0.91 osd.28 up 1 29 0.91 osd.29 up 1 30 0.91 osd.30 up 1 31 0.91 osd.31 up 1 32 0.91 osd.32 up 1 The clock skew error can safely be ignored. It's something like 2-3 ms skew, I just haven't bothered configuring away the warning. This is with a newly-created pool after deleting the last pool used for testing. Any suggestions on where to start debugging? thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] near full osd
Kevin, in my experience that usually indicates a bad or underperforming disk, or a too-high priority. Try running ceph osd crush reweight osd.## 1.0. If that doesn't do the trick, you may want to just out that guy. I don't think the crush algorithm guarantees balancing things out in the way you're expecting. --Greg On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler kevin.wei...@imc-chicago.comwrote: Hi guys, I have an OSD in my cluster that is near full at 90%, but we're using a little less than half the available storage in the cluster. Shouldn't this be balanced out? -- *Kevin Weiler* IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *kevin.wei...@imc-chicago.com kevin.wei...@imc-chicago.com* -- The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it. If you receive this e-mail unintentionally, please inform us immediately by reply and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments. Messages and attachments are scanned for all known viruses. Always scan attachments before opening them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] near full osd
Erik, it's utterly non-intuitive and I'd love another explanation than the one I've provided. Nevertheless, the OSDs on my slower PE2970 nodes fill up much faster than those on HP585s or Dell R820s. I've handled this by dropping priorities and, in a couple of cases, outing or removing the OSD. Kevin, generally speaking, the OSDs that fill up on me are the same ones. Once I lower the weights, they stay low or they fill back up again within days or hours of re-raising the weight. Please try to lift them up though, maybe you'll have better luck than me. --Greg On Tue, Nov 5, 2013 at 11:30 AM, Kevin Weiler kevin.wei...@imc-chicago.comwrote: All of the disks in my cluster are identical and therefore all have the same weight (each drive is 2TB and the automatically generated weight is 1.82 for each one). Would the procedure here be to reduce the weight, let it rebal, and then put the weight back to where it was? -- *Kevin Weiler* IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *kevin.wei...@imc-chicago.com kevin.wei...@imc-chicago.com* From: Aronesty, Erik earone...@expressionanalysis.com Date: Tuesday, November 5, 2013 10:27 AM To: Greg Chavez greg.cha...@gmail.com, Kevin Weiler kevin.wei...@imc-chicago.com Cc: ceph-users@lists.ceph.com ceph-users@lists.ceph.com Subject: RE: [ceph-users] near full osd If there’s an underperforming disk, why on earth would *more* data be put on it? You’d think it would be less…. I would think an *overperforming* disk should (desirably) cause that case,right? *From:* ceph-users-boun...@lists.ceph.com [ mailto:ceph-users-boun...@lists.ceph.comceph-users-boun...@lists.ceph.com] *On Behalf Of *Greg Chavez *Sent:* Tuesday, November 05, 2013 11:20 AM *To:* Kevin Weiler *Cc:* ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] near full osd Kevin, in my experience that usually indicates a bad or underperforming disk, or a too-high priority. Try running ceph osd crush reweight osd.## 1.0. If that doesn't do the trick, you may want to just out that guy. I don't think the crush algorithm guarantees balancing things out in the way you're expecting. --Greg On Tue, Nov 5, 2013 at 11:11 AM, Kevin Weiler kevin.wei...@imc-chicago.com wrote: Hi guys, I have an OSD in my cluster that is near full at 90%, but we're using a little less than half the available storage in the cluster. Shouldn't this be balanced out? -- *Kevin Weiler* IT IMC Financial Markets | 233 S. Wacker Drive, Suite 4300 | Chicago, IL 60606 | http://imc-chicago.com/ Phone: +1 312-204-7439 | Fax: +1 312-244-3301 | E-Mail: *kevin.wei...@imc-chicago.com kevin.wei...@imc-chicago.com* -- The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it. If you receive this e-mail unintentionally, please inform us immediately by reply and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e-mail to the sender and permanently delete this message and any attachments. Messages and attachments are scanned for all known viruses. Always scan attachments before opening them. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- The information in this e-mail is intended only for the person or entity to which it is addressed. It may contain confidential and /or privileged material. If someone other than the intended recipient should receive this e-mail, he / she shall not be entitled to read, disseminate, disclose or duplicate it. If you receive this e-mail unintentionally, please inform us immediately by reply and then delete it from your system. Although this information has been compiled with great care, neither IMC Financial Markets Asset Management nor any of its related entities shall accept any responsibility for any errors, omissions or other inaccuracies in this information or for the consequences thereof, nor shall it be bound in any way by the contents of this e-mail or its attachments. In the event of incomplete or incorrect transmission, please return the e
Re: [ceph-users] PG repair failing when object missing
I was also able to reproduce this, guys, but I believe it’s specific to the mode of testing rather than to anything being wrong with the OSD. In particular, after restarting the OSD whose file I removed and running repair, it did so successfully. The OSD has an “fd cacher” which caches open file handles, and we believe this is what causes the observed behavior: if the removed object is among the most recent n objects touched, the FileStore (an OSD subsystem) has an open fd cached, so when manually deleting the file the FileStore now has a deleted file open. When the repair happens, it finds that open file descriptor and applies the repair to it — which of course doesn’t help put it back into place! -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On October 24, 2013 at 2:52:54 AM, Matt Thompson (watering...@gmail.com) wrote: Hi Harry, I was able to replicate this. What does appear to work (for me) is to do an osd scrub followed by a pg repair. I've tried this 2x now and in each case the deleted file gets copied over to the OSD from where it was removed. However, I've tried a few pg scrub / pg repairs after manually deleting a file and have yet to see the file get copied back to the OSD on which it was deleted. Like you said, the pg repair sets the health of the PG back to active+clean, but then re-running the pg scrub detects the file as missing again and sets it back to active+clean+inconsistent. Regards, Matt On Wed, Oct 23, 2013 at 3:45 PM, Harry Harrington wrote: Hi, I've been taking a look at the repair functionality in ceph. As I understand it the osds should try to copy an object from another member of the pg if it is missing. I have been attempting to test this by manually removing a file from one of the osds however each time the repair completes the the file has not been restored. If I run another scrub on the pg it gets flagged as inconsistent. See below for the output from my testing. I assume I'm missing something obvious, any insight into this process would be greatly appreciated. Thanks, Harry # ceph --version ceph version 0.67.4 (ad85b8bfafea6232d64cb7ba76a8b6e8252fa0c7) # ceph status cluster a4e417fe-0386-46a5-4475-ca7e10294273 health HEALTH_OK monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum 0 ceph1 osdmap e13: 3 osds: 3 up, 3 in pgmap v232: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used, 164 GB / 179 GB avail mdsmap e1: 0/0/1 up file removed from osd.2 # ceph pg scrub 0.b instructing pg 0.b on osd.1 to scrub # ceph status cluster a4e417fe-0386-46a5-4475-ca7e10294273 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum 0 ceph1 osdmap e13: 3 osds: 3 up, 3 in pgmap v233: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44 bytes data, 15465 MB used, 164 GB / 179 GB avail mdsmap e1: 0/0/1 up # ceph pg repair 0.b instructing pg 0.b on osd.1 to repair # ceph status cluster a4e417fe-0386-46a5-4475-ca7e10294273 health HEALTH_OK monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum 0 ceph1 osdmap e13: 3 osds: 3 up, 3 in pgmap v234: 192 pgs: 192 active+clean; 44 bytes data, 15465 MB used, 164 GB / 179 GB avail mdsmap e1: 0/0/1 up # ceph pg scrub 0.b instructing pg 0.b on osd.1 to scrub # ceph status cluster a4e417fe-0386-46a5-4475-ca7e10294273 health HEALTH_ERR 1 pgs inconsistent; 1 scrub errors monmap e1: 1 mons at {ceph1=1.2.3.4:6789/0}, election epoch 2, quorum 0 ceph1 osdmap e13: 3 osds: 3 up, 3 in pgmap v236: 192 pgs: 191 active+clean, 1 active+clean+inconsistent; 44 bytes data, 15465 MB used, 164 GB / 179 GB avail mdsmap e1: 0/0/1 up The logs from osd.1: 2013-10-23 14:12:31.188281 7f02a5161700 0 log [ERR] : 0.b osd.2 missing 3a643fcb/testfile1/head//0 2013-10-23 14:12:31.188312 7f02a5161700 0 log [ERR] : 0.b scrub 1 missing, 0 inconsistent objects 2013-10-23 14:12:31.188319 7f02a5161700 0 log [ERR] : 0.b scrub 1 errors 2013-10-23 14:13:03.197802 7f02a5161700 0 log [ERR] : 0.b osd.2 missing 3a643fcb/testfile1/head//0 2013-10-23 14:13:03.197837 7f02a5161700 0 log [ERR] : 0.b repair 1 missing, 0 inconsistent objects 2013-10-23 14:13:03.197850 7f02a5161700 0 log [ERR] : 0.b repair 1 errors, 1 fixed 2013-10-23 14:14:47.232953 7f02a5161700 0 log [ERR] : 0.b osd.2 missing 3a643fcb/testfile1/head//0 2013-10-23 14:14:47.232985 7f02a5161700 0 log [ERR] : 0.b scrub 1 missing, 0 inconsistent objects 2013-10-23 14:14:47.232991 7f02a5161700 0 log [ERR] : 0.b scrub 1 errors ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users
[ceph-users] Cluster stuck at 15% degraded
We have an 84-osd cluster with volumes and images pools for OpenStack. I was having trouble with full osds, so I increased the pg count from the 128 default to 2700. This balanced out the osds but the cluster is stuck at 15% degraded http://hastebin.com/wixarubebe.dos That's the output of ceph health detail. I've never seen a pg with the state active+remapped+wait_backfill+backfill_toofull. Clearly I should have increased the pg count more gradually, but here I am. I'm frozen, afraid to do anything. Any suggestions? Thanks. --Greg Chavez ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] librados pthread_create failure
So, in doing some testing last week, I believe I managed to exhaust the number of threads available to nova-compute last week. After some investigation, I found the pthread_create failure and increased nproc for our Nova user to, what I considered, a ridiculous 120,000 threads after reading that librados will require a thread per osd, plus a few for overhead, per VM on our compute nodes. This made me wonder: how many threads could Ceph possibly need on one of our compute nodes. 32 cores * an overcommit ratio of 16, assuming each one is booted from a Ceph volume, * 300 (approximate number of disks in our soon-to-go-live Ceph cluster) = 153,600 threads. So this is where I started to put the truck in reverse. Am I right? What about when we triple the size of our Ceph cluster? I could easily see a future where we have easily 1,000 disks, if not many, many more in our cluster. How do people scale this? Do you RAID to increase the density of your Ceph cluster? I can only imagine that this will also drastically increase the amount of resources required on my data nodes as well. So... suggestions? Reading? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
Ah thanks, Brian. I will do that. I was going off the wiki instructions on performing rados benchmarks. If I have the time later, I will change it there. On Fri, Aug 23, 2013 at 9:37 AM, Brian Andrus brian.and...@inktank.comwrote: Hi Greg, I haven't had any luck with the seq bench. It just errors every time. Can you confirm you are using the --no-cleanup flag with rados write? This will ensure there is actually data to read for subsequent seq tests. ~Brian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Fri, Aug 23, 2013 at 9:53 AM, Gregory Farnum g...@inktank.com wrote: Okay. It's important to realize that because Ceph distributes data pseudorandomly, each OSD is going to end up with about the same amount of data going to it. If one of your drives is slower than the others, the fast ones can get backed up waiting on the slow one to acknowledge writes, so they end up impacting the cluster throughput a disproportionate amount. :( Anyway, I'm guessing you have 24 OSDs from your math earlier? 47MB/s * 24 / 2 = 564MB/s 41MB/s * 24 / 2 = 492MB/s 33 OSDs and 3 hosts in the cluster. So taking out or reducing the weight on the slow ones might improve things a little. But that's still quite a ways off from what you're seeing — there are a lot of things that could be impacting this but there's probably something fairly obvious with that much of a gap. What is the exact benchmark you're running? What do your nodes look like? The write benchmark I am running is Fio with the following configuration: ioengine: libaio iodepth: 16 runtime: 180 numjobs: 16 - name: 128k-500M-write description: 128K block 500M write bs: 128K size: 500M rw: write Sorry for the weird yaml formatting but I'm copying it from the config file of my automation stuff. I run that on powers of 2 VMs up to 32. Each VM is qemu-kvm with a 50 GB RBD-backed Cinder volume attached. They are 2 VCPU, 4 GB RAM VMs. The host machines are Dell C6220s, 16-core, hyperthreaded VMs, 128 GB RAM, with bonded 10 Gbps NICs (mode 4, 20 Gbps throughput -- tested and verified that's working correctly). There are 2 host machines with 16 VMs each. The Ceph cluster is made up of Dell C6220s, same NIC setup, 256 GB RAM, same CPU, 12 disks each (one for os, 11 for OSDs). ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Unexpectedly slow write performance (RBD cinder volumes)
On Thu, Aug 22, 2013 at 2:34 PM, Gregory Farnum g...@inktank.com wrote: You don't appear to have accounted for the 2x replication (where all writes go to two OSDs) in these calculations. I assume your pools have Ah. Right. So I should then be looking at: # OSDs * Throughput per disk / 2 / repl factor ? Which makes 300-400 MB/s aggregate throughput actually sort of reasonable. size 2 (or 3?) for these tests. 3 would explain the performance difference entirely; 2x replication leaves it still a bit low but takes the difference down to ~350/600 instead of ~350/1200. :) Yeah. We're doing 2x repl now, and haven't yet made the decision if we're going to move to 3x repl or not. You mentioned that your average osd bench throughput was ~50MB/s; what's the range? 41.9 - 54.7 MB/s The actual average is 47.1 MB/s Have you run any rados bench tests? Yessir. rados bench write: 2013-08-23 00:18:51.933594min lat: 0.071682 max lat: 1.77006 avg lat: 0.196411 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 900 14 73322 73308 325.764 316 0.13978 0.196411 Total time run: 900.239317 Total writes made: 73322 Write size: 4194304 Bandwidth (MB/sec): 325.789 Stddev Bandwidth: 35.102 Max bandwidth (MB/sec): 440 Min bandwidth (MB/sec): 0 Average Latency:0.196436 Stddev Latency: 0.121463 Max latency:1.77006 Min latency:0.071682 I haven't had any luck with the seq bench. It just errors every time. What is your PG count across the cluster? pgmap v18263: 1650 pgs: 1650 active+clean; 946 GB data, 1894 GB used, 28523 GB / 30417 GB avail; 498MB/s wr, 124op/s Thanks again. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Production/Non-production segmentation
Does anyone here have multiple clusters or segment their single cluster in such a way as to try to maintain different SLAs for production vs non-production services? We have been toying with the idea of running separate clusters (on the same hardware, but reserve a portion of the OSDs for the production cluster), but I'd rather have a single cluster in order to more evenly distribute load across all of the spindles. Thoughts or observations from people with Ceph in production would be greatly appreciated. Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Defective ceph startup script
I am running on Ubuntu 13.04. There is something amiss with /etc/init.d/ceph on all of my ceph nodes. I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I realized that service ceph-all restart wasn't actually doing anything. I saw nothing in /var/log/ceph.log - it just kept printing pg statuses - and the PIDs of the osd and mon daemons did not change. Stops failed as well. Then, when I tried to do individual osd restarts like this: root@kvm-cs-sn-14i:/var/lib/ceph/osd# service ceph -v status osd.10 /etc/init.d/ceph: osd.10 not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) Despite the fact that I have this directory: /var/lib/ceph/osd/ceph-10/. I have the same issue with mon restarts: root@kvm-cs-sn-14i:/var/lib/ceph/mon# ls ceph-kvm-cs-sn-14i root@kvm-cs-sn-14i:/var/lib/ceph/mon# service ceph -v status mon.kvm-cs-sn-14i /etc/init.d/ceph: mon.kvm-cs-sn-14i not found (/etc/ceph/ceph.conf defines , /var/lib/ceph defines ) I'm very worried that I have all my packages at 0.61.7 while my osd and mon daemons could be running as old as 0.61.1! Can anyone help me figure this out? Thanks. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Production/Non-production segmentation
On Wed, Jul 31, 2013 at 12:19 PM, Mike Dawson mike.daw...@cloudapt.comwrote: Due to the speed of releases in the Ceph project, I feel having separate physical hardware is the safer way to go, especially in light of your mention of an SLA for your production services. Ah. I guess I should offer a little more background as to what I mean by production vs. non-production: customer-facing, and not. We're using Ceph primarily for volume storage with OpenStack at the moment and operate two OS clusters: one for all of our customer-facing services (which require a higher SLA) and one for all of our internal services. The idea being that all of the customer-facing stuff is segmented physically from anything our developers might be testing internally. What I'm wondering: Does anyone else here do this? If so, do you run multiple Ceph clusters? Do you let Ceph sort itself out? Can this be done with a single physical cluster, but multiple logical clusters? Should it be? I know that, mathematically speaking, the larger your Ceph cluster is, the more evenly distributed the load (thanks to CRUSH). I'm wondering if, in practice, RBD can still create hotspots (say from a runaway service with multiple instances and volumes that is suddenly doing a ton of IO). This would increase IO latency across the Ceph cluster, I'd assume, and could impact the performance of customer-facing services. So, to some degree, physical segmentation makes sense to me. But can we simply reserve some OSDs per physical host for a production logical cluster and then use the rest for the development logical cluster (separate MON clusters for each, but all running on the same hardware). Or, given a sufficiently large cluster, is this not even a concern? I'm also interested in hearing about experience using CephFS, Swift, and RBD all on a single cluster or if people have chosen to use multiple clusters for these as well. For example, if you need faster volume storage in RBD, so you go for more spindles and smaller disks vs. larger disks with fewer spindles for object storage, which can have a higher allowance for latency than volume storage. A separate non-production cluster will allow you to test and validate new versions (including point releases within a stable series) before you attempt to upgrade your production cluster. Oh yeah. I'm doing that for sure. Thanks, Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Defective ceph startup script
Blast and gadzooks. This is a bug then. What's worse is that on three of my mon nodes have anything in /var/run/ceph. The directory is empty! I can't believe I've basically been running a busy ceph cluster for the last month. I'll try what you suggested, thank you. On Wed, Jul 31, 2013 at 3:48 PM, Eric Eastman eri...@aol.com wrote: Hi Greg, I saw about the same thing on Ubuntu 13.04 as you did. I used apt-get -y update apt-get -y upgrade On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed that some of my systems did not restart all the daemons. I tried: stop ceph-all start ceph-all On those nodes, but that did not kill all the old processes on the systems still running old daemons, so I ended up doing a: ps auxww | grep ceph On every node, and for any ceph process that was older then when I upgraded, I hand killed all the ceph processes on that node and then did a: start ceph-all Which seemed to fixed the issue. Eric I am running on Ubuntu 13.04. There is something amiss with /etc/init.d/ceph on all of my ceph nodes. I was upgrading to 0.61.7 from what I *thought* was 0.61.5 today when I realized that service ceph-all restart wasn't actually doing anything. I saw nothing in /var/log/ceph.log - it just kept printing pg statuses - and the PIDs of the osd and mon daemons did not change. Stops failed as well. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Defective ceph startup script
After I did what Eric Eastman, suggested, my mon and osd sockets showed up in /var/run/ceph: root@kvm-cs-sn-10i:/etc/ceph# ls /var/run/ceph/ ceph-osd.0.asok ceph-osd.1.asok ceph-osd.2.asok ceph-osd.3.asok ceph-osd.4.asok ceph-osd.5.asok ceph-osd.6.asok ceph-osd.7.asok However, while the osd daemons came back on line, the mon did not. As it happened, the cause for it is in another thread from today (Subject: Problem with MON after reboot). The solution is to upgrade and restart the other mon nodes. This worked. Now the status/stop/start commands work each and every time. Somewhere along the line this got goofed up and the osd and mon sockets either weren't created or were deleted. I started my cluster with a devel version of cuttlefish, so who knows? Craig, that's good advice re: starting the mon daemons first, but this is no good if the sockets are missing from /var/run/ceph. I'll keep on eye on these directories moving forward to make sure they don't get lost again. Thanks everyone for their help. Now I hope to engage in some drama free upgrading on my osd-only nodes. Ceph is great! On Wed, Jul 31, 2013 at 4:31 PM, Craig Lewis cle...@centraldesktop.comwrote: You do need to use the stop script, not service stop. If you use service stop, Upstart will restart the service. It's ok for start and restart, because that what you want anyway, but service stop is effectively a restart. I wouldn't recommend doing stop ceph-all and start ceph-all after an upgrade anyway, at least not with the latest 0.61 upgrades. Due to the MON issues between 61.4, 61.5, and 61.6, it seemed safer to follow the major version upgrade procedure (http://ceph.com/docs/next/** install/upgrading-ceph/http://ceph.com/docs/next/install/upgrading-ceph/). So I've been restarting MON on all nodes, then all OSDs on all nodes, then the remaining services. That said, it stop ceph-all should stop all the daemons. I just wouldn't use this upgrade procedure. On all my cluster nodes to upgrade from 0.61.5 to 0.61.7 and then noticed that some of my systems did not restart all the daemons. I tried: stop ceph-all start ceph-all __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proplem about capacity when mount using CephFS?
This is interesting. So there are no built-in ceph commands that can calculate your usable space? It just so happened that I was going to try and figure that out today (new Openstack block cluster, 20TB total capacity) by skimming through the documentation. I figured that there had to be a command that would do this. Blast and gadzooks. On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan tua...@vccloud.vn wrote: Thank Sage, tuantaba On 07/16/2013 09:24 PM, Sage Weil wrote: On Tue, 16 Jul 2013, Ta Ba Tuan wrote: Thanks Sage, I wories about returned capacity when mounting CephFS. but when disk is full, capacity will showed 50% or 100% Used? 100%. sage On 07/16/2013 11:01 AM, Sage Weil wrote: On Tue, 16 Jul 2013, Ta Ba Tuan wrote: Hi everyone. I have 83 osds, and every osds have same 2TB, (Capacity sumary is 166TB) I'm using replicate 3 for pools ('data','metadata'). But when mounting Ceph filesystem from somewhere (using: mount -t ceph Monitor_IP:/ /ceph -o name=admin,secret=xx) then capacity sumary is showed 160TB?, I used replicate 3 and I think that it must return 160TB/3=50TB? FilesystemSize Used Avail Use% Mounted on 192.168.32.90:/160T 500G 156T 1% /tmp/ceph_mount Please, explain this help me? statfs/df show the raw capacity of the cluster, not the usable capacity. How much data you can store is a (potentially) complex function of your CRUSH rules and replication layout. If you store 1TB, you'll notice the available space will go down by about 2TB (if you're using the default 2x). sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Proplem about capacity when mount using CephFS?
Watching. Thanks, Neil. On Tue, Jul 16, 2013 at 12:43 PM, Neil Levine neil.lev...@inktank.com wrote: This seems like a good feature to have. I've created http://tracker.ceph.com/issues/5642 N On Tue, Jul 16, 2013 at 8:05 AM, Greg Chavez greg.cha...@gmail.com wrote: This is interesting. So there are no built-in ceph commands that can calculate your usable space? It just so happened that I was going to try and figure that out today (new Openstack block cluster, 20TB total capacity) by skimming through the documentation. I figured that there had to be a command that would do this. Blast and gadzooks. On Tue, Jul 16, 2013 at 10:37 AM, Ta Ba Tuan tua...@vccloud.vn wrote: Thank Sage, tuantaba On 07/16/2013 09:24 PM, Sage Weil wrote: On Tue, 16 Jul 2013, Ta Ba Tuan wrote: Thanks Sage, I wories about returned capacity when mounting CephFS. but when disk is full, capacity will showed 50% or 100% Used? 100%. sage On 07/16/2013 11:01 AM, Sage Weil wrote: On Tue, 16 Jul 2013, Ta Ba Tuan wrote: Hi everyone. I have 83 osds, and every osds have same 2TB, (Capacity sumary is 166TB) I'm using replicate 3 for pools ('data','metadata'). But when mounting Ceph filesystem from somewhere (using: mount -t ceph Monitor_IP:/ /ceph -o name=admin,secret=xx) then capacity sumary is showed 160TB?, I used replicate 3 and I think that it must return 160TB/3=50TB? FilesystemSize Used Avail Use% Mounted on 192.168.32.90:/160T 500G 156T 1% /tmp/ceph_mount Please, explain this help me? statfs/df show the raw capacity of the cluster, not the usable capacity. How much data you can store is a (potentially) complex function of your CRUSH rules and replication layout. If you store 1TB, you'll notice the available space will go down by about 2TB (if you're using the default 2x). sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph on mixed AMD/Intel architecture
I could have sworn that I read somewhere, very early on in my investigation of Ceph, that you OSDs need to run on the same processor architecture. Only it suddenly occurred to me that for the last month, I 've been running a small 3-node cluster with two Intel systems and one AMD system. I thought they were all AMD! So... is this a problem? It seems to be running well. Thanks. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] RBD image copying
Hello, I found some oddity when attempting to copy an rbd image in my pool (using bobtail 0.56.4), please see this : I have a built working RBD image name p1b16 : root@nas16:~# rbd -p sp ls p1b16 Copying image : root@nas16:~# rbd -p sp cp p1b16 p2b16 Image copy: 100% complete...done. Great, seems to go fine, it went superfast (a few seconds), let's check : root@nas16:~# rbd -p sp ls p1b16 Uh ? let's try again : root@nas16:~# rbd -p sp cp p1b16 p2b16 2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0% complete...failed.librbd: rbd image p2b16 already exists rbd: copy failed: (17) File exists 2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed Doh! Really ? root@nas16:~# rbd -p sp ls p1b16 Hmmm, something hidden? let's try to restart : root@nas16:~# rbd -p sp rm p2b16 2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding header: (2) No such file or directory 2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error removing img from new-style directory: (2) No such file or directory0% complete...failed. rbd: delete error: (2) No such file or directory Damned, let's see at rados level : root@nas16:~# rados -p sp ls | grep -v rb\\. p1b16.rbd rbd_directory I downloaded rbd_directory file and took a look inside, I see p1b16 (along with binary data) but no trace of p2b16 I must have missed something somewhere... Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image copying
Wolfgang, you are perfectly right, the '-p' switch only applies to the source image, this is subtle ! Thanks a lot. Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit : Hi, I believe this went into the pool named 'rbd'. if you rbd copy it's maybe easier to do it with explicit destination pool name: rbd cp sp/p1b16 sp/p2b16 hth wolfgang On 05/14/2013 11:47 AM, Greg wrote: Hello, I found some oddity when attempting to copy an rbd image in my pool (using bobtail 0.56.4), please see this : I have a built working RBD image name p1b16 : root@nas16:~# rbd -p sp ls p1b16 Copying image : root@nas16:~# rbd -p sp cp p1b16 p2b16 Image copy: 100% complete...done. Great, seems to go fine, it went superfast (a few seconds), let's check : root@nas16:~# rbd -p sp ls p1b16 Uh ? let's try again : root@nas16:~# rbd -p sp cp p1b16 p2b16 2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0% complete...failed.librbd: rbd image p2b16 already exists rbd: copy failed: (17) File exists 2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed Doh! Really ? root@nas16:~# rbd -p sp ls p1b16 Hmmm, something hidden? let's try to restart : root@nas16:~# rbd -p sp rm p2b16 2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding header: (2) No such file or directory 2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error removing img from new-style directory: (2) No such file or directory0% complete...failed. rbd: delete error: (2) No such file or directory Damned, let's see at rados level : root@nas16:~# rados -p sp ls | grep -v rb\\. p1b16.rbd rbd_directory I downloaded rbd_directory file and took a look inside, I see p1b16 (along with binary data) but no trace of p2b16 I must have missed something somewhere... Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image copying
Ok now the copy is done to the right pool but the data isn't there. I mapped both the original and the copy to try and compare : root@client2:~# rbd showmapped id pool image snap device 1 sp p1b16 -/dev/rbd1 2 sp p2b16 -/dev/rbd2 And try to mount : root@client2:~# mount /dev/rbd1 /mnt/ root@client2:~# umount /mnt/ root@client2:~# mount /dev/rbd2 /mnt/ mount: you must specify the filesystem type What strikes me is the copy is superfast and I'm in a pool in format 1 which, as far as I understand, is not supposed to support copy-on-write. I tried listing the pool (with rados tool) and it show the p2b16.rbd file is there but no rb.0.X.Y.offset is present for p2b16 (name can be found from p2b16.rbd), while there is for p1b16. Did I not understand the copy mechanism ? Thanks! Le 14/05/2013 11:52, Wolfgang Hennerbichler a écrit : Hi, I believe this went into the pool named 'rbd'. if you rbd copy it's maybe easier to do it with explicit destination pool name: rbd cp sp/p1b16 sp/p2b16 hth wolfgang On 05/14/2013 11:47 AM, Greg wrote: Hello, I found some oddity when attempting to copy an rbd image in my pool (using bobtail 0.56.4), please see this : I have a built working RBD image name p1b16 : root@nas16:~# rbd -p sp ls p1b16 Copying image : root@nas16:~# rbd -p sp cp p1b16 p2b16 Image copy: 100% complete...done. Great, seems to go fine, it went superfast (a few seconds), let's check : root@nas16:~# rbd -p sp ls p1b16 Uh ? let's try again : root@nas16:~# rbd -p sp cp p1b16 p2b16 2013-05-14 09:30:42.369917 400b8000 -1 Image copy: 0% complete...failed.librbd: rbd image p2b16 already exists rbd: copy failed: (17) File exists 2013-05-14 09:30:42.369969 400b8000 -1 librbd: header creation failed Doh! Really ? root@nas16:~# rbd -p sp ls p1b16 Hmmm, something hidden? let's try to restart : root@nas16:~# rbd -p sp rm p2b16 2013-05-14 09:30:19.445336 400c7000 -1 librbd::ImageCtx: error finding header: (2) No such file or directory 2013-05-14 09:30:19.644381 400c7000 -1 Removing image: librbd: error removing img from new-style directory: (2) No such file or directory0% complete...failed. rbd: delete error: (2) No such file or directory Damned, let's see at rados level : root@nas16:~# rados -p sp ls | grep -v rb\\. p1b16.rbd rbd_directory I downloaded rbd_directory file and took a look inside, I see p1b16 (along with binary data) but no trace of p2b16 I must have missed something somewhere... Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD image copying
Le 14/05/2013 13:00, Wolfgang Hennerbichler a écrit : On 05/14/2013 12:16 PM, Greg wrote: ... And try to mount : root@client2:~# mount /dev/rbd1 /mnt/ root@client2:~# umount /mnt/ root@client2:~# mount /dev/rbd2 /mnt/ mount: you must specify the filesystem type What strikes me is the copy is superfast and I'm in a pool in format 1 which, as far as I understand, is not supposed to support copy-on-write. I tried listing the pool (with rados tool) and it show the p2b16.rbd file is there but no rb.0.X.Y.offset is present for p2b16 (name can be found from p2b16.rbd), while there is for p1b16. Did I not understand the copy mechanism ? you sure did understand it the way it is supposed to be. something's wrong here. what happens if you dd bs=1024 count=1 | hexdump your devices, do you see differences there? is your cluster healthy? Wolfgang, after a copy, there is an index file (.rbd file) but no data file. When I map the block device, I can read/write from/to it, when writing the data files are created and I can read them back. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD vs RADOS benchmark performance
Le 13/05/2013 07:38, Olivier Bonvalet a écrit : Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit : Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? You should try to increase read_ahead to 512K instead of the defaults 128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference on reads with that. Olivier, thanks a lot for pointing this out, it indeed makes a *huge* difference ! # dd if=/mnt/t/1 of=/dev/zero bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s (caches dropped before each test of course) Mark, this is probably something you will want to investigate and explain in a tweaking topic of the documentation. Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD vs RADOS benchmark performance
Le 13/05/2013 15:55, Mark Nelson a écrit : On 05/13/2013 07:26 AM, Greg wrote: Le 13/05/2013 07:38, Olivier Bonvalet a écrit : Le vendredi 10 mai 2013 à 19:16 +0200, Greg a écrit : Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? You should try to increase read_ahead to 512K instead of the defaults 128K (/sys/block/*/queue/read_ahead_kb). I have seen a huge difference on reads with that. Olivier, thanks a lot for pointing this out, it indeed makes a *huge* difference ! # dd if=/mnt/t/1 of=/dev/zero bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 5.12768 s, 81.8 MB/s (caches dropped before each test of course) Mark, this is probably something you will want to investigate and explain in a tweaking topic of the documentation. Regards, Out of curiosity, has your rados bench performance improved as well? We've also seen improvements for sequential read throughput when increasing read_ahead_kb. (it may decrease random iops in some cases though!) The reason I didn't think to mention it here though is because I was just focused on the difference between rados bench and rbd. It would be interesting to know if rbd has improved more dramatically than rados bench. Mark, the read ahead is set on the RBD block device (on the client), so it doesn't improve benchmark results as the benchmark doesn't use the block layer. 1 question remains : why did I have poor performance with 1 single writing thread ? Regards, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] RBD vs RADOS benchmark performance
Le 11/05/2013 02:52, Mark Nelson a écrit : On 05/10/2013 07:20 PM, Greg wrote: Le 11/05/2013 00:56, Mark Nelson a écrit : On 05/10/2013 12:16 PM, Greg wrote: Hello folks, I'm in the process of testing CEPH and RBD, I have set up a small cluster of hosts running each a MON and an OSD with both journal and data on the same SSD (ok this is stupid but this is simple to verify the disks are not the bottleneck for 1 client). All nodes are connected on a 1Gb network (no dedicated network for OSDs, shame on me :). Summary : the RBD performance is poor compared to benchmark A 5 seconds seq read benchmark shows something like this : sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 163923 91.958692 0.966117 0.431249 2 166448 95.9602 100 0.513435 0.53849 3 169074 98.6317 104 0.25631 0.55494 4 119584 83.973540 1.80038 0.58712 Total time run:4.165747 Total reads made: 95 Read size:4194304 Bandwidth (MB/sec):91.220 Average Latency: 0.678901 Max latency: 1.80038 Min latency: 0.104719 91MB read performance, quite good ! Now the RBD performance : root@client:~# dd if=/dev/rbd1 of=/dev/null bs=4M count=100 100+0 records in 100+0 records out 419430400 bytes (419 MB) copied, 13.0568 s, 32.1 MB/s There is a 3x performance factor (same for write: ~60M benchmark, ~20M dd on block device) The network is ok, the CPU is also ok on all OSDs. CEPH is Bobtail 0.56.4, linux is 3.8.1 arm (vanilla release + some patches for the SoC being used) Can you show me the starting point for digging into this ? Hi Greg, First things first, are you doing kernel rbd or qemu/kvm? If you are doing qemu/kvm, make sure you are using virtio disks. This can have a pretty big performance impact. Next, are you using RBD cache? With 0.56.4 there are some performance issues with large sequential writes if cache is on, but it does provide benefit for small sequential writes. In general RBD cache behaviour has improved with Cuttlefish. Beyond that, are the pools being targeted by RBD and rados bench setup the same way? Same number of Pgs? Same replication? Mark, thanks for your prompt reply. I'm doing kernel RBD and so, I have not enabled the cache (default setting?) Sorry, I forgot to mention the pool used for bench and RBD is the same. Interesting. Does your rados bench performance change if you run a longer test? So far I've been seeing about a 20-30% performance overhead for kernel RBD, but 3x is excessive! It might be worth watching the underlying IO sizes to the OSDs in each case with something like collectl -sD -oT to see if there's any significant differences. Mark, I'll gather you some more data with collectl, meanwhile I realized a difference : the benchmark performs 16 concurrent reads while RBD only does 1. Shouldn't be a problem but still these are 2 different usage patterns. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] problem readding an osd
Le 06/05/2013 20:41, Glen Aidukas a écrit : New post bellow... *From:*Greg [mailto:it...@itooo.com] *Sent:* Monday, May 06, 2013 2:31 PM *To:* Glen Aidukas *Subject:* Re: [ceph-users] problem readding an osd Le 06/05/2013 20:05, Glen Aidukas a écrit : Greg, Not sure where to use the --d switch. I tried the following: Service ceph start --d Service ceph --d start Both do not work. I did see an error in my log though... 2013-05-06 13:03:38.432479 7f0007ef2780 -1 filestore(/srv/ceph/osd/osd.2) limited size xattrs -- filestore_xattr_use_omap enabled 2013-05-06 13:03:38.438563 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and appears to work 2013-05-06 13:03:38.438591 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-05-06 13:03:38.438804 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs 2013-05-06 13:03:38.484841 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully supported (by glibc and kernel) 2013-05-06 13:03:38.485010 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount found snaps 2013-05-06 13:03:38.488631 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-05-06 13:03:38.488936 7f0007ef2780 1 journal _open /srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-05-06 13:03:38.489095 7f0007ef2780 1 journal _open /srv/ceph/osd/osd.2/journal fd 19: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-05-06 13:03:38.490116 7f0007ef2780 1 journal close /srv/ceph/osd/osd.2/journal 2013-05-06 13:03:38.538302 7f0007ef2780 -1 filestore(/srv/ceph/osd/osd.2) limited size xattrs -- filestore_xattr_use_omap enabled 2013-05-06 13:03:38.559813 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is supported and appears to work 2013-05-06 13:03:38.559848 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2013-05-06 13:03:38.560082 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount did NOT detect btrfs 2013-05-06 13:03:38.566015 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount syncfs(2) syscall fully supported (by glibc and kernel) 2013-05-06 13:03:38.566106 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount found snaps 2013-05-06 13:03:38.569047 7f0007ef2780 0 filestore(/srv/ceph/osd/osd.2) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2013-05-06 13:03:38.569237 7f0007ef2780 1 journal _open /srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-05-06 13:03:38.569316 7f0007ef2780 1 journal _open /srv/ceph/osd/osd.2/journal fd 27: 1048576000 bytes, block size 4096 bytes, directio = 1, aio = 0 2013-05-06 13:03:38.574317 7f0007ef2780 1 journal close /srv/ceph/osd/osd.2/journal 2013-05-06 13:03:38.574801 7f0007ef2780 -1 ** ERROR: osd init failed: (1) Operation not permitted *Glen Aidukas [Manager IT Infrasctructure] * *From:*ceph-users-boun...@lists.ceph.com mailto:ceph-users-boun...@lists.ceph.com [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf Of *Greg *Sent:* Monday, May 06, 2013 1:47 PM *To:* ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com *Subject:* Re: [ceph-users] problem readding an osd Le 06/05/2013 19:23, Glen Aidukas a écrit : Hello, I think this is a newbe question but I tested everything and, yes I FTFM as best I could. I'm evaluating ceph and so I setup a cluster of 4 nodes. The nodes are KVM virtual machines named ceph01 to ceph04 all running Ubuntu 12.04.2 LTS each with a single osd named osd.1 though osd.4 respective to the host they were running on. Each host also has a 1TB disk for ceph to use '/dev/vdb1'. After some work I was able to get the cluster up and running and even mounted it on a test client host (named ceph00). I ran into issues when I was testing a failure. I shut off ceph02 and watched via (ceph --w) it recover and move the data around. At this point all is fine. When I turned the host back on, it did not auto reconnect. I expected this. I then send through many attempts to re add it but all failed. Here is an output from: ceph osd tree # idweight type name up/down reweight -1 4 root default -3 4 rack unknownrack -2 1 host ceph01 1 1 osd.1 up 1 -4
Re: [ceph-users] ceph osd tell bench
Le 03/05/2013 16:34, Travis Rhoden a écrit : I have a question about tell bench command. When I run this, is it behaving more or less like a dd on the drive? It appears to be, but I wanted to confirm whether or not it is bypassing all the normal Ceph stack that would be writing metadata, calculating checksums, etc. One bit of behavior I noticed a while back that I was not expecting is that this command does write to the journal. It made sense when I thought about it, but when I have an SSD journal in front of an OSD, I can't get the tell bench command to really show me accurate numbers of the raw speed of the OSD -- instead I get write speeds of the SSD. Just a small caveat there. The upside to that is when do you something like tell \* bench, you are able to see if that SSD becomes a bottleneck by hosting multiple journals, so I'm not really complaining. But it does make a bit tough to see if perhaps one OSD is performing much differently than others. But really, I'm mainly curious if it skips any normal metadata/checksum overhead that may be there otherwise. Travis, I'm no expert but, to me, the bench doesn't bypass the ceph stack. On a test setup, I set up the journal on the same drive as the data drive, when I tell bench I can see ~160MB/s throughoutput on the SSD block device and the benchmark result is ~80MB/s which leads me to think the data is written twice : once to the journal and once to the permanent storage. I see almost no read on the block device but the written data probably is in the page cache. Cheers, ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacement hardware
The MDS doesn't have any local state. You just need start up the daemon somewhere with a name and key that are known to the cluster (these can be different from or the same as the one that existed on the dead node; doesn't matter!). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote: Actually, I already have recovered OSDs and MON daemon back to the cluster according to http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ . But doc has missed info about removing/add MDS. How I can recovery MDS daemon for failed node? On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob) d...@bob-the-boat.me.uk (mailto:d...@bob-the-boat.me.uk) wrote: Igor, I am sure that I'm right in saying that you just have to create a new filesystem (btrfs?) on the new block device, mount it, and then initialise the osd with: ceph-osd -i the osd number --mkfs Then you can start the osd with: ceph-osd -i the osd number Since you are replacing an osd that already existed, the cluster knows about it, and there is a key for it that is known. I don't claim any great expertise, but this is what I've been doing, and the cluster seems to adopt the new osd and sort everything out. David ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) Kiev, Ukraine ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Replacement hardware
Yeah. If you run ceph auth list you'll get a dump of all the users and keys the cluster knows about; each of your daemons has that key stored somewhere locally (generally in /var/lib/ceph/ceph-[osd|mds|mon].$id). You can create more or copy an unused MDS one. I believe the docs include information on how this works. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wednesday, March 20, 2013 at 10:48 AM, Igor Laskovy wrote: Well, can you please clarify what exactly key I must to use? Do I need to get/generate it somehow from working cluster? On Wed, Mar 20, 2013 at 7:41 PM, Greg Farnum g...@inktank.com (mailto:g...@inktank.com) wrote: The MDS doesn't have any local state. You just need start up the daemon somewhere with a name and key that are known to the cluster (these can be different from or the same as the one that existed on the dead node; doesn't matter!). -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Wednesday, March 20, 2013 at 10:40 AM, Igor Laskovy wrote: Actually, I already have recovered OSDs and MON daemon back to the cluster according to http://ceph.com/docs/master/rados/operations/add-or-rm-osds/ and http://ceph.com/docs/master/rados/operations/add-or-rm-mons/ . But doc has missed info about removing/add MDS. How I can recovery MDS daemon for failed node? On Wed, Mar 20, 2013 at 3:23 PM, Dave (Bob) d...@bob-the-boat.me.uk (mailto:d...@bob-the-boat.me.uk) (mailto:d...@bob-the-boat.me.uk) wrote: Igor, I am sure that I'm right in saying that you just have to create a new filesystem (btrfs?) on the new block device, mount it, and then initialise the osd with: ceph-osd -i the osd number --mkfs Then you can start the osd with: ceph-osd -i the osd number Since you are replacing an osd that already existed, the cluster knows about it, and there is a key for it that is known. I don't claim any great expertise, but this is what I've been doing, and the cluster seems to adopt the new osd and sort everything out. David ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) (http://facebook.com/igor.laskovy) Kiev, Ukraine ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) Kiev, Ukraine ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Status of Mac OS and Windows PC client
At various times there the ceph-fuse client has worked on OS X — Noah was the last one to do this and the branch for it is sitting in my long-term really-like-to-get-this-mainlined-someday queue. OS X is a lot easier than Windows though, and nobody's done any planning around that beyond noting that there are FUSE-like systems for Windows, and that Samba is a workaround. Sorry. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tuesday, March 19, 2013 at 8:58 AM, Igor Laskovy wrote: Thanks for reply! Actually I would like found some way to use one large salable central storage across multiple PC and MAC. CephFS will be most suitable here, but you provide only Linux support. Really no planning here? On Tue, Mar 19, 2013 at 3:52 PM, Patrick McGarry patr...@inktank.com (mailto:patr...@inktank.com) wrote: Hey Igor, Currently there are no plans to develop a OS X or Windows-specific client per se. We do provide a number of different ways to expose the cluster in ways that you could use it from these machines, however. The most recent example of this is the work being done on tgt that can expose Ceph via iSCSI. For reference see: http://www.mail-archive.com/ceph-devel@vger.kernel.org/msg11662.html Keep an eye out for more details in the near future. Best Regards, Patrick McGarry Director, Community || Inktank http://ceph.com || http://inktank.com @scuttlemonkey || @ceph || @inktank On Tue, Mar 19, 2013 at 8:30 AM, Igor Laskovy igor.lask...@gmail.com (mailto:igor.lask...@gmail.com) wrote: Anybody? :) Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) Kiev, Ukraine On Mar 17, 2013 6:37 PM, Igor Laskovy igor.lask...@gmail.com (mailto:igor.lask...@gmail.com) wrote: Hi there! Could you please clarify what is the current status of development client for OS X and Windows desktop editions? -- Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) Kiev, Ukraine ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Igor Laskovy facebook.com/igor.laskovy (http://facebook.com/igor.laskovy) Kiev, Ukraine ___ ceph-users mailing list ceph-users@lists.ceph.com (mailto:ceph-users@lists.ceph.com) http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crash and strange things on MDS
On Friday, March 8, 2013 at 3:29 PM, Kevin Decherf wrote: On Fri, Mar 01, 2013 at 11:12:17AM -0800, Gregory Farnum wrote: On Tue, Feb 26, 2013 at 4:49 PM, Kevin Decherf ke...@kdecherf.com (mailto:ke...@kdecherf.com) wrote: You will find the archive here: snip The data is not anonymized. Interesting folders/files here are /user_309bbd38-3cff-468d-a465-dc17c260de0c/* Sorry for the delay, but I have retrieved this archive locally at least so if you want to remove it from your webserver you can do so. :) Also, I notice when I untar it that the file name includes filtered — what filters did you run it through? Hi Gregory, Do you have any news about it? I wrote a couple tools to do log analysis and created a number of bugs to make the MDS more amenable to analysis as a result of this. Having spot-checked some of your longer-running requests, they're all getattrs or setattrs contending on files in what look to be shared cache and php libraries. These cover a range from ~40 milliseconds to ~150 milliseconds. I'd look into what your split applications are sharing across those spaces. On the up side for Ceph, 80% of your requests take 0 milliseconds and ~95% of them take less than 2 milliseconds. Hurray, it's not ridiculously slow most of the time. :) -Greg ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Crash and strange things on MDS
On Friday, March 15, 2013 at 3:40 PM, Marc-Antoine Perennou wrote: Thank you a lot for these explanations, looking forward for these fixes! Do you have some public bug reports regarding this to link us? Good luck, thank you for your great job and have a nice weekend Marc-Antoine Perennou Well, for now the fixes are for stuff like make analysis take less time, and export timing information more easily. The most immediately applicable one is probably http://tracker.ceph.com/issues/4354, which I hope to start on next week and should be done by the end of the sprint. -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com