RE: High-availability testing of ceph
Hi, Josh: Thanks for your reply. However, I had asked a question about replica setting before. http://www.spinics.net/lists/ceph-devel/msg07346.html If the performance of rbd device is n MB/s under replica=2, then that means the total io throughputs on hard disk is over 3 * n MB/s. Because I think the total number of copies is 3 in original. So, it seems not correct now, the total number of copies is only 2. The total io through puts on disk should be 2 * n MB/s. Right? -Original Message- From: Josh Durgin [mailto:josh.dur...@inktank.com] Sent: Tuesday, July 31, 2012 1:56 PM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: High-availability testing of ceph On 07/30/2012 07:46 PM, eric_yh_c...@wiwynn.com wrote: Hi, all: I am testing high-availability of ceph. Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48 Kernel: 3.2.0-27 We create a ceph cluster with 24 osd. Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 The crush rule is using default rule. rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 1536 pgp_num 1536 last_change 1172 owner 0 Test case 1: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 3. check the read/write of rbd device Test case 2: 1. Create a rbd device and read/write to it 2. Random turn off one osd on server1 (service ceph stop osd.0) 2. Random turn off one osd on server2 (service ceph stop osd.12) 3. check the read/write of rbd device About test case 1, we can access the rbd device as normal. But about test case 2, we would hang there and no response. Is it a correct scenario ? I imagine that we can turn off any two osd when we set the replication as 2. Because without the master data, we have two other copies on two different osd. Even when we turn off two osd, we can find the data on third osd. Any misunderstanding? Thanks! rep size is the total number of copies, so stopping two osds with rep size 2 may cause you to lose access to some objects. Josh -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
The cluster do not aware some osd are disappear
Dear All: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd When I turn off the network of server1, we expect that server2 would aware 12 osd (on server 1) disappear. However, when I type ceph -s, it still show 24 osd there. And from the log of osd.0 and osd.11, we can find heartbeat check on server1, but not on server2. What happened to server2? Can we restart the heartbeat server? Thanks! root@wistor-002:~# ceph -s health HEALTH_WARN 1 mons down, quorum 1,2 008,009 monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 522, quorum 1,2 008,009 osdmap e1388: 24 osds: 24 up, 24 in pgmap v288663: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail mdsmap e1: 0/0/1 up log of ceph -w (we turn of server1 arround 15:20, that cause the new monitor election) 2012-07-31 15:21:25.966572 mon.0 [INF] pgmap v288658: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:20:10.400566 mon.1 [INF] mon.008 calling new monitor election 2012-07-31 15:21:36.030473 mon.1 [INF] mon.008 calling new monitor election 2012-07-31 15:21:36.079772 mon.2 [INF] mon.009 calling new monitor election 2012-07-31 15:21:46.102587 mon.1 [INF] mon.008@1 won leader election with quorum 1,2 2012-07-31 15:21:46.273253 mon.1 [INF] pgmap v288659: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:21:46.273379 mon.1 [INF] mdsmap e1: 0/0/1 up 2012-07-31 15:21:46.273495 mon.1 [INF] osdmap e1388: 24 osds: 24 up, 24 in 2012-07-31 15:21:46.273814 mon.1 [INF] monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0} 2012-07-31 15:21:46.587679 mon.1 [INF] pgmap v288660: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:22:01.245813 mon.1 [INF] pgmap v288661: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail 2012-07-31 15:22:33.970838 mon.1 [INF] pgmap v288662: 4608 pgs: 4608 active+clean; 257 GB data, 988 GB used, 20214 GB / 22320 GB avail Log of osd.0 (on server 1) 2012-07-31 15:20:25.309264 7fdc06470700 0 -- 192.168.200.81:6825/12162 192.168.200.82:6840/8772 pipe(0x4dbea00 sd=52 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1 2012-07-31 15:20:25.310887 7fdc1c551700 0 -- 192.168.200.81:6825/12162 192.168.200.82:6833/15570 pipe(0x4dbec80 sd=51 pgs=0 cs=0 l=0).accept connect_seq 0 vs existing 0 state 1 2012-07-31 15:21:46.861458 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.12 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861496 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.13 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861506 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.14 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861514 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.15 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861522 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.16 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861530 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.17 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861538 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.18 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861546 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.19 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861556 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.20 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861576 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.21 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861609 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.22 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) 2012-07-31 15:21:46.861618 7fdc14e9d700 -1 osd.0 1388 heartbeat_check: no reply from osd.23 since 2012-07-31 15:21:26.770108 (cutoff 2012-07-31 15:21:26.861458) Log of osd.12 (on server 2) 2012-07-31 15:20:31.475815 7f9eac5ba700 0 osd.12 1387 pg[2.16f( v 1356'10485 (465'9480,1356'10485] n=42 ec=1 les/c 1387/1387 1383/1383/1383) [12,0] r=0 lpr=1383 mlcod 0'0 active+clean] watch: oi.user_version=45 2012-07-31
How to integrate ceph with opendedup.
Hi all, I want to integrate ceph with opendedup(sdfs) using java-rados. Please help me to integration of ceph with opendedup. Thanks, Ramu. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to integrate ceph with opendedup.
On 07/31/2012 11:18 AM, ramu wrote: Hi all, I want to integrate ceph with opendedup(sdfs) using java-rados. Please help me to integration of ceph with opendedup. What is the exact use-case for this? I get the point of de-duplication, but having a filesystem running on top of RADOS and not using CephFS? That doesn't seem like a trivial integration. Wido Thanks, Ramu. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
cannot startup one of the osd
Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0) 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7f9363a9d700 time 2012-07-31 17:18:19.670082 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid()) ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x6a3123] 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314] 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, int, eversion_t, interval_setunsigned long, std::maphobject_t, interval_setunsigned long, std::lesshobject_t, std::allocatorstd::pairhobject_t const, interval_setunsigned long )+0x333) [0x54c873] 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t const, int)+0x343) [0x54cdc3] 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) [0x5527bf] 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const, std::tr1::shared_ptrOpRequest)+0x17b) [0x55406b] 7: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x9de) [0x56305e] 8: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x199) [0x5fda89] 9: (OSD::dequeue_op(PG*)+0x238) [0x5bf668] 10: (ThreadPool::worker()+0x605) [0x796d55] 11: (ThreadPool::WorkThread::entry()+0xd) [0x5d5d0d] 12: (()+0x7e9a) [0x7f9374794e9a] 13: (clone()+0x6d) [0x7f93734344bd] NOTE: a copy of the executable, or `objdump -rdS executable` is needed to interpret this. --- begin dump of recent events --- -21 2012-07-31
Re: Ceph Benchmark HowTo
Hi all, I have updated the how-to here: http://ceph.com/wiki/Benchmark And published the results of my latest tests: http://ceph.com/wiki/Benchmark#First_Example All results are good, my benchmark is clearly limited by my network connection ~ 110MB/s. In exception of the rest-api bench, the value seems really low. I have configured radosgw with this: http://ceph.com/docs/master/radosgw/config/ I clean disk cache on all servers before the bench, and start rest-bench for 900 seconds with default value. Is my rest-bench result normal ? Have I missed something ? Don't hesitate if you need more informations on my setup. And then, I have another question about how is the Standard Deviation calculated with rados bench and rest-bench ? with the reported value printed each second by the benchmark client ? If yes, when latency is too high, the reported bandwith is sometime zero, then has the calculated StdDev for bandwith a sens ? Cheers, -- Mehdi Abaakouk for eNovance mail: sil...@sileht.net irc: sileht signature.asc Description: Digital signature
About teuthology
Hi, I have taken a look into teuthology, the automation of all this tests are good, but are they any way to run it into a already installed ceph clusters ? Thanks in advance. Cheers, -- Mehdi Abaakouk for eNovance mail: sil...@sileht.net irc: sileht signature.asc Description: Digital signature
Re: About teuthology
On 7/31/12 8:59 AM, Mehdi Abaakouk wrote: Hi, I have taken a look into teuthology, the automation of all this tests are good, but are they any way to run it into a already installed ceph clusters ? Thanks in advance. Cheers, Hi Mehdi, I think a number of the test related tasks should run fine without strictly requiring the ceph task. You may have to change binary locations for things like rados, but those should be pretty minor. Best way to find out is to give it a try! Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
another performance-related thread
Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: avoiding false detection of down OSDs
On 07/30/2012 06:24 PM, Gregory Farnum wrote: On Mon, Jul 30, 2012 at 3:47 PM, Jim Schuttjasc...@sandia.gov wrote: Above you mentioned that you are seeing these issues as you scaled out a storage cluster, but none of the solutions you mentioned address scaling. Let's assume your preferred solution handles this issue perfectly on the biggest cluster anyone has built today. What do you predict will happen when that cluster size is scaled up by a factor of 2, or 10, or 100? Sage should probably describe in more depth what we've seen since he's looked at it the most, but I can expand on it a little. In argonaut and earlier version of Ceph, processing a new OSDMap for an OSD is very expensive. I don't remember the precise numbers we'd whittled it down to but it required at least one disk sync as well as pausing all request processing for a while. If you combined this expense with a large number of large maps (if, perhaps, one quarter of your 800-OSD system had been down but not out for 6+ hours), you could cause memory thrashing on OSDs as they came up, which could force them to become very, very, veeery slow. In the next version of Ceph, map processing is much less expensive (no syncs or full-system pauses required), which will prevent request backup. And there are a huge number of ways to reduce the memory utilization of maps, some of which can be backported to argonaut and some of which can't. Now, if we can't prevent our internal processes from running an OSD out of memory, we'll have failed. But we don't think this is an intractable problem; in fact we have reason to hope we've cleared it up now that we've seen the problem — although we don't think it's something that we can absolutely prevent on argonaut (too much code churn). So we're looking for something that we can apply to argonaut as a band-aid, but that we can also keep around in case forces external to Ceph start causing similar cluster-scale resource shortages beyond our control (runaway co-located process eats up all the memory on lots of boxes, switch fails and bandwidth gets cut in half, etc). If something happens that means Ceph can only supply half as much throughput as it was previously, then Ceph should provide that much throughput; right now if that kind of incident occurs then Ceph won't provide any throughput because it'll all be eaten by spurious recovery work. Ah, thanks for the extra context. I hadn't fully appreciated the proposal was primarily a mitigation for argonaut, and otherwise as a fail-safe mechanism. As I mentioned above, I'm concerned this is addressing symptoms, rather than root causes. I'm concerned the root cause has something to do with how the map processing work scales with number of OSDs/PGs, and that this will limit the maximum size of a Ceph storage cluster. I think I discussed this above enough already? :) Yep, thanks. But, if you really just want to not mark down an OSD that is laggy, I know this will sound simplistic, but I keep thinking that the OSD knows for itself if it's up, even when the heartbeat mechanism is backed up. Couldn't there be some way to ask an OSD suspected of being down whether it is or not, separate from the heartbeat mechanism? I mean, if you're considering having the monitor ignore OSD down reports for a while based on some estimate of past behavior, wouldn't it be better for the monitor to just ask such an OSD, hey, are you still there? If it gets an immediate I'm busy, come back later, extend the grace period; otherwise, mark the OSD down. Hmm. The concern is that if an OSD is stuck on disk swapping then it's going to be just as stuck for the monitors as the OSDs — they're all using the same network in the basic case, etc. We want to be able to make that guess before the OSD is able to answer such questions. But I'll think on if we could try something else similar. OK - thanks. Also, FWIW I've been running my Ceph servers with no swap, and I've recently doubled the size of my storage cluster. Is it possible to have map processing do a little memory accounting and log it, or to provide some way to learn that map processing is chewing up significant amounts of memory? Or maybe there's already a way to learn this that I need to learn about? I sometimes run into something that shares some characteristics with what you describe, but is primarily triggered by high client write load. I'd like to be able to confirm or deny it's the same basic issue you've described. Thanks -- Jim -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
Hi Andrey! On 07/31/2012 10:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) Does your 2108 have the RAID or JBOD firmware? I'm guessing the RAID firmware given that you are able to change the caching behavior? How do you have the arrays setup for the OSDs? 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. I've seen high IO wait times (especially with small writes) via rados bench as well. It's something we are actively investigating. Part of the issue with rados bench is that every single request is getting written to a seperate file, so especially at small IO sizes there is a lot of underlying filesystem metadata traffic. For us, this is happening on 9260 controllers with RAID firmware. I think we may see some improvement by switching to 2X08 cards with the JBOD (ie IT) firmware, but we haven't confirmed it yet. We actually just purchased a variety of alternative RAID and SAS controllers to test with to see how universal this problem is. Theoretically RBD shouldn't suffer from this as badly as small writes to the same file should get buffered. The same is true for CephFS when doing buffered IO to a single file due to the Linux buffer cache. Small writes to many files will likely suffer in the same way that rados bench does though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Mark Nelson Performance Engineer Inktank -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
On 07/31/2012 08:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes How are your osd journals configured? What's your ceph.conf for the osds? fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. Are these write tests using direct I/O? That will bypass the cache for writes, which would explain the similar numbers with different cache modes. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
On 07/31/2012 07:17 PM, Mark Nelson wrote: Hi Andrey! On 07/31/2012 10:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) Does your 2108 have the RAID or JBOD firmware? I'm guessing the RAID firmware given that you are able to change the caching behavior? How do you have the arrays setup for the OSDs? Exactly, I am able to change cache behavior on-the-fly using 'famous' megacli binary. Each node contains three disks, each of them configured as raid0 single-disk - two 7200 server sata and intel 313 for journal. On satas I am using xfs with default mount options and on ssd I`ve put ext4 with disabled journal and of course with discard/noatime. This 2108 comes with SuperMicro firmware 2.120.243-1482 - guessing it is RAID variant and I didn`t tried to reflash it yet. For tests, I have forced write-through cache on - this should be very good at small writes aggregation. Before using such config, I have configured two disks to RAID0 and get slightly worse results on write bench. Thanks for suggesting to try JBOD firmware, I`ll do tests using it this week and post results. 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. I've seen high IO wait times (especially with small writes) via rados bench as well. It's something we are actively investigating. Part of the issue with rados bench is that every single request is getting written to a seperate file, so especially at small IO sizes there is a lot of underlying filesystem metadata traffic. For us, this is happening on 9260 controllers with RAID firmware. I think we may see some improvement by switching to 2X08 cards with the JBOD (ie IT) firmware, but we haven't confirmed it yet. For 24 HT cores I have seen 2 percent iowait at most(at writes), so almost surely there is no IO bottleneck at all(except breaking the rule 'one osd per physical disk', when iowait raising up to 50 percent on entire system). Rados bench is not an universal measurement tool, thought - using VM` IO requests instead of manipulating rados objects will lead to almost fair result, by my opinion. We actually just purchased a variety of alternative RAID and SAS controllers to test with to see how universal this problem is. Theoretically RBD shouldn't suffer from this as badly as small writes to the same file should get buffered. The same is true for CephFS when doing buffered IO to a single file due to the Linux buffer cache. Small writes to many files will likely suffer in the same way that rados bench does though. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About teuthology
On Tue, Jul 31, 2012 at 09:27:54AM -0500, Mark Nelson wrote: On 7/31/12 8:59 AM, Mehdi Abaakouk wrote: Hi Mehdi, I think a number of the test related tasks should run fine without strictly requiring the ceph task. You may have to change binary locations for things like rados, but those should be pretty minor. Best way to find out is to give it a try! Thanks for your quick answer :) I have already tried, but the code massively refers to files in /tmp/cephtest/, it seems to me that changing the path of the binaries isn't enough, some of them are built by the ceph task. Perhaps a quicker (a bit dirty) way is to create a new task 'cephdist', that prepares the required files in /tmp/cephtest. ie: - link dist binary to /tmp/cephtest/binary/usr/local/bin/... - link /etc/ceph/ceph.conf to /tmp/cephtest/ceph.conf - ship cephtest tool in /tmp/cephtest (like ceph task) - make dummy script for coverage (because distributed ceph doesn't seem to have ceph-coverage) What do you think about it ? Cheers -- Mehdi Abaakouk for eNovance mail: sil...@sileht.net irc: sileht signature.asc Description: Digital signature
Re: [PATCH v3] rbd: fix the memory leak of bio_chain_clone
On Mon, Jul 30, 2012 at 02:54:44PM -0700, Yehuda Sadeh wrote: On Thu, Jul 26, 2012 at 11:20 PM, Guangliang Zhao gz...@suse.com wrote: The bio_pair alloced in bio_chain_clone would not be freed, this will cause a memory leak. It could be freed actually only after 3 times release, because the reference count of bio_pair is initialized to 3 when bio_split and bio_pair_release only drops the reference count. The function bio_pair_release must be called three times for releasing bio_pair, and the callback functions of bios on the requests will be called when the last release time in bio_pair_release, however, these functions will also be called in rbd_req_cb. In other words, they will be called twice, and it may cause serious consequences. This patch clones bio chian from the origin directly, doesn't use bio_split(without bio_pair). The new bio chain can be release whenever we don't need it. Signed-off-by: Guangliang Zhao gz...@suse.com --- drivers/block/rbd.c | 73 +- 1 files changed, 31 insertions(+), 42 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 013c7a5..356657d 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -712,51 +712,46 @@ static void zero_bio_chain(struct bio *chain, int start_ofs) } } -/* - * bio_chain_clone - clone a chain of bios up to a certain length. - * might return a bio_pair that will need to be released. +/** + * bio_chain_clone - clone a chain of bios up to a certain length. + * @old: bio to clone + * @offset: start point for bio clone + * @len: length of bio chain + * @gfp_mask: allocation priority + * + * RETURNS: + * Pointer to new bio chain on success, NULL on failure. */ -static struct bio *bio_chain_clone(struct bio **old, struct bio **next, - struct bio_pair **bp, +static struct bio *bio_chain_clone(struct bio **old, int *offset, int len, gfp_t gfpmask) { struct bio *tmp, *old_chain = *old, *new_chain = NULL, *tail = NULL; int total = 0; - if (*bp) { - bio_pair_release(*bp); - *bp = NULL; - } - while (old_chain (total len)) { + int need = len - total; + tmp = bio_kmalloc(gfpmask, old_chain-bi_max_vecs); if (!tmp) goto err_out; - if (total + old_chain-bi_size len) { - struct bio_pair *bp; - - /* -* this split can only happen with a single paged bio, -* split_bio will BUG_ON if this is not the case -*/ - dout(bio_chain_clone split! total=%d remaining=%d -bi_size=%d\n, -(int)total, (int)len-total, -(int)old_chain-bi_size); - - /* split the bio. We'll release it either in the next - call, or it will have to be released outside */ - bp = bio_split(old_chain, (len - total) / SECTOR_SIZE); - if (!bp) - goto err_out; - - __bio_clone(tmp, bp-bio1); - - *next = bp-bio2; + __bio_clone(tmp, old_chain); + tmp-bi_sector += *offset SECTOR_SHIFT; + tmp-bi_io_vec-bv_offset += *offset SECTOR_SHIFT; + /* +* The bios span across multiple osd objects must be +* single paged, rbd_merge_bvec would guarantee it. +* So we needn't worry about other things. +*/ + if (tmp-bi_size - *offset need) { + tmp-bi_size = need; + tmp-bi_io_vec-bv_len = need; + *offset += need; } else { - __bio_clone(tmp, old_chain); - *next = old_chain-bi_next; + old_chain = old_chain-bi_next; + tmp-bi_size -= *offset; + tmp-bi_io_vec-bv_len -= *offset; + *offset = 0; } There's still some inherent issue here, which is it assumes tmp-bi_io_vec points to the only iovec for this bio. I don't think that is necessarily true, there may be multiple iovecs, Yes, the bios on the requests may have one or more pages, but the ones span across multiple osds *must* be single page bios because of rbd_merge_bvec. With rbd_merge_bvec, the new bvec will not permitted to merge, if it make the bio cross the osd boundary, except the
Re: cannot startup one of the osd
This crash happens on each startup? -Sam On Tue, Jul 31, 2012 at 2:32 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0) 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7f9363a9d700 time 2012-07-31 17:18:19.670082 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid()) ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x6a3123] 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314] 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, int, eversion_t, interval_setunsigned long, std::maphobject_t, interval_setunsigned long, std::lesshobject_t, std::allocatorstd::pairhobject_t const, interval_setunsigned long )+0x333) [0x54c873] 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t const, int)+0x343) [0x54cdc3] 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) [0x5527bf] 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const, std::tr1::shared_ptrOpRequest)+0x17b) [0x55406b] 7: (ReplicatedPG::do_op(std::tr1::shared_ptrOpRequest)+0x9de) [0x56305e] 8: (PG::do_request(std::tr1::shared_ptrOpRequest)+0x199) [0x5fda89] 9: (OSD::dequeue_op(PG*)+0x238) [0x5bf668] 10: (ThreadPool::worker()+0x605) [0x796d55] 11: (ThreadPool::WorkThread::entry()+0xd) [0x5d5d0d] 12: (()+0x7e9a) [0x7f9374794e9a]
Re: About teuthology
On Tue, Jul 31, 2012 at 6:59 AM, Mehdi Abaakouk sil...@sileht.net wrote: Hi, I have taken a look into teuthology, the automation of all this tests are good, but are they any way to run it into a already installed ceph clusters ? Thanks in advance. Many of the actual tests being run are already independent functionality or stress tests or benchmarks; for example, ffsb will run against any filesystem. The things that are specifically written for teuthology are currently quite tied in details. There is a longer-term plan to rework teuthology into using package-based installation, and at that time I hope we will be able to modularize the tests out of the teuthology core, and to make them easier to run from just the command line. This work depends on a bunch of internal changes to our testing lab infrastructure -- package-based testing is not feasible until we have lab machine reinstallation 100% automated, and currently it still tends to need too much manual care. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: High-availability testing of ceph
On Tue, Jul 31, 2012 at 12:31 AM, eric_yh_c...@wiwynn.com wrote: If the performance of rbd device is n MB/s under replica=2, then that means the total io throughputs on hard disk is over 3 * n MB/s. Because I think the total number of copies is 3 in original. So, it seems not correct now, the total number of copies is only 2. The total io through puts on disk should be 2 * n MB/s. Right? Yes, each replica needs to independently write the data to disk. On top of that, there are journal writes, and filesystems have overhead too. If you create a 1 GB object in a pool replicated 3 times, you should expect about 3*1 GB writes in total to your osd data disks, and at least 3*1 GB writes in total to your osd journal disks. In normal use, you have many servers, and use CRUSH rules to ensure the different replicas are not stored on the same server. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: avoiding false detection of down OSDs
On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt jasc...@sandia.gov wrote: On 07/30/2012 06:24 PM, Gregory Farnum wrote: Hmm. The concern is that if an OSD is stuck on disk swapping then it's going to be just as stuck for the monitors as the OSDs — they're all using the same network in the basic case, etc. We want to be able to make that guess before the OSD is able to answer such questions. But I'll think on if we could try something else similar. OK - thanks. Also, FWIW I've been running my Ceph servers with no swap, and I've recently doubled the size of my storage cluster. Is it possible to have map processing do a little memory accounting and log it, or to provide some way to learn that map processing is chewing up significant amounts of memory? Or maybe there's already a way to learn this that I need to learn about? I sometimes run into something that shares some characteristics with what you describe, but is primarily triggered by high client write load. I'd like to be able to confirm or deny it's the same basic issue you've described. I think that we've done all our diagnosis using profiling tools, but there's now a map cache and it probably wouldn't be too difficult to have it dump data via perfcounters if you poked around...anything like this exist yet, Sage? -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Ceph changes for 3.6
Hi Linus, Please pull the following Ceph changes for 3.6 from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus There are several trivial conflicts to resolve; sorry! Stephen is carrying fixes for them in linux-next as well. Lots of stuff this time around: * lots of cleanup and refactoring in the libceph messenger code, and many hard to hit races and bugs closed as a result. * lots of cleanup and refactoring in the rbd code from Alex Elder, mostly in preparation for the layering functionality that will be coming in 3.7. * some misc rbd cleanups from Josh Durgin that are finally going upstream * support for CRUSH tunables (used by newer clusters to improve the data placement) * some cleanup in our use of d_parent that Al brought up a while back * a random collection of fixes across the tree There is another patch coming that fixes up our -atomic_open() behavior, but I'm going to hammer on it a bit more before sending it. Thanks! sage Alan Cox (1): ceph: fix potential double free Alex Elder (76): libceph: eliminate connection state DEAD libceph: kill bad_proto ceph connection op libceph: rename socket callbacks libceph: rename kvec_reset and kvec_add functions libceph: embed ceph messenger structure in ceph_client libceph: start separating connection flags from state libceph: start tracking connection socket state libceph: provide osd number when creating osd libceph: set CLOSED state bit in con_init libceph: osd_client: don't drop reply reference too early libceph: embed ceph connection structure in mon_client libceph: init monitor connection when opening libceph: fully initialize connection in con_init() libceph: tweak ceph_alloc_msg() libceph: have messages point to their connection libceph: have messages take a connection reference libceph: make ceph_con_revoke() a msg operation libceph: make ceph_con_revoke_message() a msg op libceph: encapsulate out message data setup libceph: encapsulate advancing msg page libceph: don't mark footer complete before it is libceph: move init_bio_*() functions up libceph: move init of bio_iter libceph: don't use bio_iter as a flag libceph: SOCK_CLOSED is a flag, not a state libceph: don't change socket state on sock event libceph: just set SOCK_CLOSED when state changes libceph: don't touch con state in con_close_socket() libceph: clear CONNECTING in ceph_con_close() libceph: clear NEGOTIATING when done libceph: define and use an explicit CONNECTED state libceph: separate banner and connect writes libceph: distinguish two phases of connect sequence libceph: small changes to messenger.c libceph: add some fine ASCII art libceph: drop declaration of ceph_con_get() libceph: fix off-by-one bug in ceph_encode_filepath() rbd: drop a useless local variable libceph: define ceph_extract_encoded_string() rbd: define dup_token() rbd: rename rbd_dev-block_name rbd: create pool_id device attribute rbd: dynamically allocate pool name rbd: dynamically allocate object prefix rbd: dynamically allocate image header name rbd: dynamically allocate image name rbd: dynamically allocate snapshot name rbd: use rbd_dev consistently rbd: rename some fields in struct rbd_dev rbd: more symbol renames rbd: option symbol renames rbd: kill num_reply parameters rbd: don't use snapc-seq that way rbd: preserve snapc-seq in rbd_header_set_snap() rbd: set snapc-seq only when refreshing header rbd: kill rbd_image_header-snap_seq rbd: drop extra header_rwsem init rbd: simplify __rbd_remove_all_snaps() rbd: clean up a few dout() calls ceph: define snap counts as u32 everywhere rbd: encapsulate header validity test rbd: rename rbd_device-id rbd: snapc is unused in rbd_req_sync_read() rbd: drop rbd_header_from_disk() gfp_flags parameter rbd: drop rbd_dev parameter in snap functions rbd: drop object_name from rbd_req_sync_watch() rbd: drop object_name from rbd_req_sync_notify() rbd: drop object_name from rbd_req_sync_notify_ack() rbd: drop object_name from rbd_req_sync_unwatch() rbd: have __rbd_add_snap_dev() return a pointer rbd: make rbd_create_rw_ops() return a pointer rbd: pass null version pointer in add_snap() rbd: always pass ops array to rbd_req_sync_op() rbd: fixes in rbd_header_from_disk() rbd: return obj version in __rbd_refresh_header() rbd: create rbd_refresh_helper() Dan Carpenter (2): rbd: endian bug in rbd_req_cb() libceph: fix NULL dereference in reset_connection() Guanjun He (1): libceph: prevent the
Re: How to integrate ceph with opendedup.
On Tue, Jul 31, 2012 at 2:18 AM, ramu ramu.freesyst...@gmail.com wrote: I want to integrate ceph with opendedup(sdfs) using java-rados. Please help me to integration of ceph with opendedup. It sounds like you could use radosgw and just use S3ChunkStore. If you really want to implement your own ChunkStore straight on top of RADOS, well, it sounds like FileBasedChunkStore should be an easy model; copy-paste it, and replace all file operations with RADOS object reads/writes/etc. BTW it sounds like SDFS is not really a distributed file system, or at least the architecture slides don't point at anything about multiple mounting the same metadata store. It sounds like they made all operations for one file system go through the same metadata server. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [EXTERNAL] Re: avoiding false detection of down OSDs
On Tue, 31 Jul 2012, Gregory Farnum wrote: On Tue, Jul 31, 2012 at 8:07 AM, Jim Schutt jasc...@sandia.gov wrote: On 07/30/2012 06:24 PM, Gregory Farnum wrote: Hmm. The concern is that if an OSD is stuck on disk swapping then it's going to be just as stuck for the monitors as the OSDs ? they're all using the same network in the basic case, etc. We want to be able to make that guess before the OSD is able to answer such questions. But I'll think on if we could try something else similar. OK - thanks. Also, FWIW I've been running my Ceph servers with no swap, and I've recently doubled the size of my storage cluster. Is it possible to have map processing do a little memory accounting and log it, or to provide some way to learn that map processing is chewing up significant amounts of memory? Or maybe there's already a way to learn this that I need to learn about? I sometimes run into something that shares some characteristics with what you describe, but is primarily triggered by high client write load. I'd like to be able to confirm or deny it's the same basic issue you've described. I think that we've done all our diagnosis using profiling tools, but there's now a map cache and it probably wouldn't be too difficult to have it dump data via perfcounters if you poked around...anything like this exist yet, Sage? Much of the bad behavior was triggered by #2860, fixes for which just went into the stable and master branches yesterday. It's difficult to fully observe the bad behavior, though (lots of time spend in generate_past_intervals, reading old maps off disk). With the fix, we pretty much only process maps during handle_osd_map. Adding perfcounters in the methods that grab a map out of the cache or (more importantly) read it off disk will give you better visibility into that. It should be pretty easy to instrument that (and I'll gladly take patches that implement that... :). Without knowing more about what you're seeing, it's hard to say if its related, though. This was triggered by long periods of unclean pgs and lots of data migration, not high load. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Puppet modules for Ceph
On Tue, Jul 24, 2012 at 6:15 AM, loic.dach...@enovance.com wrote: Note that if puppet client was run on nodeB before it was run on nodeA, all three steps would have been run in sequence instead of being spread over two puppet client invocations. Unfortunately, even that is not enough. The relevant keys (client.admin, client.bootstrap-osd, later bootstrap-mds radosgw etc also) can only be created once the mons have reached quorum. This is some time after they have started, even in the best case. Making the puppet/chef run wait for that sounds like a bad idea; especially since I use further chef-client runs to feed ceph-mon information about its peers, which may be necessary for it to ever reach quorum. While I can find ways of making the key generation happen as soon as quorum is reached, communicating the keys to other nodes only happens at the mercy of the configuration management system; both puppet and chef seem to be in the mindset of run every N minutes option. So even if we generate the keys best case 2 seconds after ceph-mon startup, it needs a full configuration manager run on the source node, and then a run on the destination node, before OSD bring-up etc can succeed. I have found no satisfying solution to this, so far. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Puppet modules for Ceph
On Tue, 31 Jul 2012, Tommi Virtanen wrote: On Tue, Jul 24, 2012 at 6:15 AM, loic.dach...@enovance.com wrote: Note that if puppet client was run on nodeB before it was run on nodeA, all three steps would have been run in sequence instead of being spread over two puppet client invocations. Unfortunately, even that is not enough. The relevant keys (client.admin, client.bootstrap-osd, later bootstrap-mds radosgw etc also) can only be created once the mons have reached quorum. This is some time after they have started, even in the best case. Making the puppet/chef run wait for that sounds like a bad idea; especially since I use further chef-client runs to feed ceph-mon information about its peers, which may be necessary for it to ever reach quorum. While I can find ways of making the key generation happen as soon as quorum is reached, communicating the keys to other nodes only happens at the mercy of the configuration management system; both puppet and chef seem to be in the mindset of run every N minutes option. So even if we generate the keys best case 2 seconds after ceph-mon startup, it needs a full configuration manager run on the source node, and then a run on the destination node, before OSD bring-up etc can succeed. I have found no satisfying solution to this, so far. It is also possible to feed initial keys to the monitors during the 'mkfs' stage. If the keys can be agreed on somehow beforehand, then they will already be in place when the initial quorum is reached. Not sure if that helps in this situation or not... sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Puppet modules for Ceph
On Tue, Jul 31, 2012 at 11:51 AM, Sage Weil s...@inktank.com wrote: It is also possible to feed initial keys to the monitors during the 'mkfs' stage. If the keys can be agreed on somehow beforehand, then they will already be in place when the initial quorum is reached. Not sure if that helps in this situation or not... Yeah, we're going that way for the mon. key in the chef cookbooks (to get the mons talking to each other at all, that *has* to be done that way), but putting more and more stuff there is not very nice. Your typical CM framework does not let the recipe run arbitrary code at that sort of an instantiation time, and pushing this work on the admin makes it laborous and brittle; what happens when we need a new type of a bootstrap-foo key? Get all admins to cram an extra entry into their environment json? http://ceph.com/docs/master/config-cluster/chef/#configure-your-ceph-environment That just does not seem like a good way. Juju seems to provide a real-time notification mechanism between peers, using it's name-relation-changed hook. Other CM frameworks may need to step up their game, or be subject to the keep re-running chef-client until it works limitation. If the CM makes it safe to trigger a run manually (e.g. sudo chef-client whenever you feel like it), we can trigger that locally when we finally create the keys. This still doesn't help the receiving side to notice any faster. If the CM makes it safe for us to change node attributes outside of the full CM run, we can do trigger that when we finally create the keys. Chef seems to have a full overwrites only semantic, so this is probably not safe with it. And as above, this does not help the receiving side to notice that it has information to fetch. What I want to do longer term, is make the Chef cookbook for Ceph very thin, push everything except the cross-node communication into Ceph proper, and then write a mkcephfs v2.0 that uses SSH connections as appropriate, from a central workstation host that can SSH anywhere, to trigger these actions ASAP. Then that becomes the goal for CM frameworks: provide me a communication mechanism between these nodes that can do *this*. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: another performance-related thread
On 07/31/2012 07:53 PM, Josh Durgin wrote: On 07/31/2012 08:03 AM, Andrey Korolyov wrote: Hi, I`ve finally managed to run rbd-related test on relatively powerful machines and what I have got: 1) Reads on almost fair balanced cluster(eight nodes) did very well, utilizing almost all disk and bandwidth (dual gbit 802.3ad nics, sata disks beyond lsi sas 2108 with wt cache gave me ~1.6Gbyte/s on linear and sequential reads, which is close to overall disk throughput) 2) Writes get much worse, both on rados bench and on fio test when I ran fio simularly on 120 vms - at it best, overall performance is about 400Mbyte/s, using rados bench -t 12 on three host nodes How are your osd journals configured? What's your ceph.conf for the osds? fio config: rw=(randread|randwrite|seqread|seqwrite) size=256m direct=1 directory=/test numjobs=1 iodepth=12 group_reporting name=random-ead-direct bs=1M loops=12 for 120 vm set, Mbyte/s linear reads: MEAN: 14156 STDEV: 612.596 random reads: MEAN: 14128 STDEV: 911.789 linear writes: MEAN: 2956 STDEV: 283.165 random writes: MEAN: 2986 STDEV: 361.311 each node holds 15 vms and for 64M rbd cache all possible three states - wb, wt and no-cache has almost same numbers at the tests. I wonder if it possible to raise write/read ratio somehow. Seems that osd underutilize itself, e.g. I am not able to get single-threaded rbd write to get above 35Mb/s. Adding second osd on same disk only raising iowait time, but not benchmark results. Are these write tests using direct I/O? That will bypass the cache for writes, which would explain the similar numbers with different cache modes. I have previously forgot that direct flag may affect rbd cache behaviout. Without it on wb cache, read rate remained same and writes increased by ~ 0.15: random writes: MEAN: 3370 STDEV: 939.99 linear writes: MEAN: 3561 STDEV: 824.954 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Quick CentOS/RHEL question ...
Ceph can work well on CentOS6.2 including File Access and RBD while radowsgw is not still under our testing. To install ceph on CentOS6, the main problem is the difference of the packages' names between CentOS and Ubuntu, 'yum search ' may help. and some times, 'ldconfig' is needed after the 'make install' On 1 August 2012 06:17, Joe Landman land...@scalableinformatics.com wrote: Hi folks I was struggling and failing to get Ceph properly built/installed for CentOS 6 (and 5) last week. Is this simply not a recommended platform? Please advise. Thanks! -- Joseph Landman, Ph.D Founder and CEO Scalable Informatics Inc. email: land...@scalableinformatics.com web : http://scalableinformatics.com http://scalableinformatics.com/sicluster phone: +1 734 786 8423 x121 fax : +1 866 888 3112 cell : +1 734 612 4615 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- 袁冬 Tel:13573888215 Email:yuandong1...@gmail.com QQ:10200230 MSN:yuandong1...@hotmail.com -- 袁冬 Tel:13573888215 Email:yuandong1...@gmail.com QQ:10200230 MSN:yuandong1...@hotmail.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: cannot startup one of the osd
Hi, Samuel: It happens every startup, I cannot fix it now. -Original Message- From: Samuel Just [mailto:sam.j...@inktank.com] Sent: Wednesday, August 01, 2012 1:36 AM To: Eric YH Chen/WYHQ/Wiwynn Cc: ceph-devel@vger.kernel.org; Chris YT Huang/WYHQ/Wiwynn; Victor CY Chang/WYHQ/Wiwynn Subject: Re: cannot startup one of the osd This crash happens on each startup? -Sam On Tue, Jul 31, 2012 at 2:32 AM, eric_yh_c...@wiwynn.com wrote: Hi, all: My Environment: two servers, and 12 hard-disk on each server. Version: Ceph 0.48, Kernel: 3.2.0-27 We create a ceph cluster with 24 osd, 3 monitors Osd.0 ~ osd.11 is on server1 Osd.12 ~ osd.23 is on server2 Mon.0 is on server1 Mon.1 is on server2 Mon.2 is on server3 which has no osd root@ubuntu:~$ ceph -s health HEALTH_WARN 227 pgs degraded; 93 pgs down; 93 pgs peering; 85 pgs recovering; 82 pgs stuck inactive; 255 pgs stuck unclean; recovery 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%); 1/24 in osds are down monmap e1: 3 mons at {006=192.168.200.84:6789/0,008=192.168.200.86:6789/0,009=192.168.200.87:6789/0}, election epoch 564, quorum 0,1,2 006,008,009 osdmap e1911: 24 osds: 23 up, 24 in pgmap v292031: 4608 pgs: 4251 active+clean, 85 active+recovering+degraded, 37 active+remapped, 58 down+peering, 142 active+degraded, 35 down+replay+peering; 257 GB data, 948 GB used, 19370 GB / 21390 GB avail; 4808/138644 degraded (3.468%); 202/69322 unfound (0.291%) mdsmap e1: 0/0/1 up I find one of the osd cannot startup anymore. Before that, I am testing HA of Ceph cluster. Step1: shutdown server1, wait 5 min Step2: bootup server1, wait 5 min until ceph enter health status Step3: shutdown server2, wait 5 min Step4: bootup server2, wait 5 min until ceph enter health status Repeat Step1~ Step4 several times, then I met this problem. Log of ceph-osd.22.log 2012-07-31 17:18:15.120678 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.122081 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.128544 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.257302 7f9375300780 1 journal _open /srv/disk10/journal fd 23: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.273163 7f9375300780 1 journal close /srv/disk10/journal 2012-07-31 17:18:15.274395 7f9375300780 -1 filestore(/srv/disk10/data) limited size xattrs -- filestore_xattr_use_omap enabled 2012-07-31 17:18:15.275169 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is supported and appears to work 2012-07-31 17:18:15.275180 7f9375300780 0 filestore(/srv/disk10/data) mount FIEMAP ioctl is disabled via 'filestore fiemap' config option 2012-07-31 17:18:15.275312 7f9375300780 0 filestore(/srv/disk10/data) mount did NOT detect btrfs 2012-07-31 17:18:15.276060 7f9375300780 0 filestore(/srv/disk10/data) mount syncfs(2) syscall fully supported (by glib and kernel) 2012-07-31 17:18:15.276154 7f9375300780 0 filestore(/srv/disk10/data) mount found snaps 2012-07-31 17:18:15.277031 7f9375300780 0 filestore(/srv/disk10/data) mount: enabling WRITEAHEAD journal mode: btrfs not detected 2012-07-31 17:18:15.280906 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:15.307761 7f9375300780 1 journal _open /srv/disk10/journal fd 32: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2012-07-31 17:18:19.466921 7f9360a97700 0 -- 192.168.200.82:6830/18744 192.168.200.83:0/3485583732 pipe(0x45bd000 sd=34 pgs=0 cs=0 l=0).accept peer addr is really 192.168.200.83:0/3485583732 (socket is 192.168.200.83:45653/0) 2012-07-31 17:18:19.671681 7f9363a9d700 -1 os/DBObjectMap.cc: In function 'virtual bool DBObjectMap::DBObjectMapIteratorImpl::valid()' thread 7f9363a9d700 time 2012-07-31 17:18:19.670082 os/DBObjectMap.cc: 396: FAILED assert(!valid || cur_iter-valid()) ceph version 0.48argonaut (commit:c2b20ca74249892c8e5e40c12aa14446a2bf2030) 1: /usr/bin/ceph-osd() [0x6a3123] 2: (ReplicatedPG::send_push(int, ObjectRecoveryInfo, ObjectRecoveryProgress, ObjectRecoveryProgress*)+0x684) [0x53f314] 3: (ReplicatedPG::push_start(ReplicatedPG::ObjectContext*, hobject_t const, int, eversion_t, interval_setunsigned long, std::maphobject_t, interval_setunsigned long, std::lesshobject_t, std::allocatorstd::pairhobject_t const, interval_setunsigned long )+0x333) [0x54c873] 4: (ReplicatedPG::push_to_replica(ReplicatedPG::ObjectContext*, hobject_t const, int)+0x343) [0x54cdc3] 5: (ReplicatedPG::recover_object_replicas(hobject_t const, eversion_t)+0x35f) [0x5527bf] 6: (ReplicatedPG::wait_for_degraded_object(hobject_t const,