Re: Upgrading from 0.61.5 to 0.61.6 ended in disaster
On 2013-07-25 17:46, Sage Weil wrote: On Thu, 25 Jul 2013, pe...@2force.nl wrote: We did not upgrade from bobtail to cuttlefish and are still seeing this issue. I posted this on the ceph-users mailinglist and I missed this thread (sorry!) so I didn't know. That's interesting; a bobtail upgraded cluster was the only way I was able to reproduce it, but I'm also working with relatively short-lived clusters in a test environment so there may very well be a possibility I missed. Can you summarize what the lineage of your cluster is? (What version was it installed with, and when was it upgraded and to what versions?) Either way, I also have an osd crashing after upgrading to 0.61.6. As said on the other list, I'm more than happy to share log files etc with you guys. Will take a look. Thanks! sage Hi Sage, Did you happen to find out what is causing the osd crash? I'm not sure what the best way is to recover from this. Thanks, Peter Thanks, Peter This is fixed in the cuttlefish branch as of earlier this afternoon. I've spent most of the day expanding the automated test suite to include upgrade combinations to trigger this and *finally* figured out that this particular problem seems to surface on clusters that upgraded from bobtail - cuttlefish but not clusters created on cuttlefish. If you've run into this issue, please use the cuttlefish branch build for now. We will have a release out in the next day or so that includes this and a few other pending fixes. I'm sorry we missed this one! The upgrade test matrix I've been working on today should catch this type of issue in the future. Thanks! sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Read ahead affect Ceph read performance much
We performed Iozone read test on a 32-node HPC server. Regarding the hardware of each node, the CPU is very powerful, so does the network, with a bandwidth 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The performance is as follows, Iozone sequential read throughput (MB/s) Number of clients 1 2 4 Default resize180.0954 324.4836 591.5851 Resize: 256MB 645.3347 1022.998 1267.631 The complete iozone parameter for one client is, iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only one thread is started. for two clients, it is, iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output As the data shown, a larger read ahead window could result in 300% speedup! Besides, Since the backend of Ceph is not the traditional hard disk, it is beneficial to capture the stride read prefetching. To prove this, we tested the stride read with the following program, as we know, the generic read ahead algorithm of Linux kernel will not capture stride-read prefetch, so we use fadvise() to manually force pretching. the record size is 4MB. The result is even more surprising, Stride read throughput (MB/s) Number of records prefetched 0 1 4 16 64 128 Throughput 42.82 100.74 217.41 497.73 854.48 950.18 As the data shown, with a read ahead size of 128*4MB, the speedup over without read ahead could be up to 950/42 2000%! The core logic of the test program is below, stride = 17 recordsize = 4MB for (;;) { for (i = 0; i count; ++i) { long long start = pos + (i + 1) * stride * recordsize; printf(PRE READ %lld %lld\n, start, start + block); posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); } len = read(fd, buf, block); total += len; printf(READ %lld %lld\n, pos, (pos + len)); pos += len; lseek(fd, (stride - 1) * block, SEEK_CUR); pos += (stride - 1) * block; } Given the above results and some more, We plan to submit a blue print to discuss the prefetching optimization of Ceph. Cheers, Li Wang -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ObjectContext PGRegistry API
Hi Sam, Sorry to bother you with this again. Would you have time to quickly review this proposal ? I'm sure you'll have comments that will require work on my part ;-) Cheers On 22/07/2013 22:33, Loic Dachary wrote: Hi Sam, Here is the proposed ObjectContext PGRegistry API: https://github.com/dachary/ceph/blob/wip-5487/src/osd/PGRegistry.h which is part of the following commit https://github.com/dachary/ceph/commit/60958095585a1f8392d8a967767f7620089d547d It's a first draft and I asume your comments will require significant work on my part. I'd rather do it as soon as possible, while my short term memory is still fresh ;-) Cheers -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
Re: Read ahead affect Ceph read performance much
Wow, very glad to hear that. I tried with the regular FS tunable and there was almost no effect on the regular test, so I thought that reads cannot be improved at all in this direction. On Mon, Jul 29, 2013 at 2:24 PM, Li Wang liw...@ubuntukylin.com wrote: We performed Iozone read test on a 32-node HPC server. Regarding the hardware of each node, the CPU is very powerful, so does the network, with a bandwidth 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The performance is as follows, Iozone sequential read throughput (MB/s) Number of clients 1 2 4 Default resize180.0954 324.4836 591.5851 Resize: 256MB 645.3347 1022.998 1267.631 The complete iozone parameter for one client is, iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only one thread is started. for two clients, it is, iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output As the data shown, a larger read ahead window could result in 300% speedup! Besides, Since the backend of Ceph is not the traditional hard disk, it is beneficial to capture the stride read prefetching. To prove this, we tested the stride read with the following program, as we know, the generic read ahead algorithm of Linux kernel will not capture stride-read prefetch, so we use fadvise() to manually force pretching. the record size is 4MB. The result is even more surprising, Stride read throughput (MB/s) Number of records prefetched 0 1 4 16 64 128 Throughput 42.82 100.74 217.41 497.73 854.48 950.18 As the data shown, with a read ahead size of 128*4MB, the speedup over without read ahead could be up to 950/42 2000%! The core logic of the test program is below, stride = 17 recordsize = 4MB for (;;) { for (i = 0; i count; ++i) { long long start = pos + (i + 1) * stride * recordsize; printf(PRE READ %lld %lld\n, start, start + block); posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); } len = read(fd, buf, block); total += len; printf(READ %lld %lld\n, pos, (pos + len)); pos += len; lseek(fd, (stride - 1) * block, SEEK_CUR); pos += (stride - 1) * block; } Given the above results and some more, We plan to submit a blue print to discuss the prefetching optimization of Ceph. Cheers, Li Wang -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read ahead affect Ceph read performance much
On 07/29/2013 05:24 AM, Li Wang wrote: We performed Iozone read test on a 32-node HPC server. Regarding the hardware of each node, the CPU is very powerful, so does the network, with a bandwidth 1.5 GB/s. 64GB memory, the IO is relatively slow, the throughput measured by ‘dd’ locally is around 70MB/s. We configured a Ceph cluster with 24 OSDs on 24 nodes, one mds, one to four clients, one client per node. The performance is as follows, Iozone sequential read throughput (MB/s) Number of clients 1 2 4 Default resize180.0954 324.4836 591.5851 Resize: 256MB 645.3347 1022.9981267.631 The complete iozone parameter for one client is, iozone -t 1 -+m /tmp/iozone.nodelist.50305030 -s 64G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output, on each client node, only one thread is started. for two clients, it is, iozone -t 2 -+m /tmp/iozone.nodelist.50305030 -s 32G -r 4M -i 0 -+n -w -c -e -b /tmp/iozone.nodelist.50305030.output As the data shown, a larger read ahead window could result in 300% speedup! Very interesting! I've done some similar tests and saw somewhat different results (I actually in some cases saw improvement with lower readahead!). I suspect that this may be very hardware dependent. Were you using RBD or CephFS? In either case, was it the kernel client or userland (IE QEMU/KVM or FUSE)? Also, where did you adjust readahead? Was this on the client volume or under the OSDs? I've got to prepare for the talk later this week, but I will try to get my readahead test results out soon as well. Besides, Since the backend of Ceph is not the traditional hard disk, it is beneficial to capture the stride read prefetching. To prove this, we tested the stride read with the following program, as we know, the generic read ahead algorithm of Linux kernel will not capture stride-read prefetch, so we use fadvise() to manually force pretching. the record size is 4MB. The result is even more surprising, Stride read throughput (MB/s) Number of records prefetched 0 1 4 16 64 128 Throughput 42.82 100.74 217.41 497.73 854.48 950.18 As the data shown, with a read ahead size of 128*4MB, the speedup over without read ahead could be up to 950/42 2000%! The core logic of the test program is below, stride = 17 recordsize = 4MB for (;;) { for (i = 0; i count; ++i) { long long start = pos + (i + 1) * stride * recordsize; printf(PRE READ %lld %lld\n, start, start + block); posix_fadvise(fd, start, block, POSIX_FADV_WILLNEED); } len = read(fd, buf, block); total += len; printf(READ %lld %lld\n, pos, (pos + len)); pos += len; lseek(fd, (stride - 1) * block, SEEK_CUR); pos += (stride - 1) * block; } Given the above results and some more, We plan to submit a blue print to discuss the prefetching optimization of Ceph. Cool! Cheers, Li Wang -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ceph branch status
-- All Branches -- Dan Mick dan.m...@inktank.com 2012-12-18 12:27:36 -0800 wip-rbd-striping 2013-07-16 23:00:06 -0700 wip-5634 2013-07-18 16:34:23 -0700 wip-daemon David Zafman david.zaf...@inktank.com 2013-01-28 20:26:34 -0800 wip-wireshark-zafman 2013-03-22 18:14:10 -0700 wip-snap-test-fix 2013-07-19 19:47:30 -0700 wip-5624 Gary Lowell gary.low...@inktank.com 2013-07-08 15:45:00 -0700 last Gary Lowell glow...@inktank.com 2013-01-28 22:49:45 -0800 wip-3930 2013-02-05 19:29:11 -0800 wip.cppchecker 2013-02-10 22:21:52 -0800 wip-3955 2013-02-26 19:28:48 -0800 wip-system-leveldb 2013-03-01 18:55:35 -0800 wip-da-spec-1 2013-03-19 11:28:15 -0700 wip-3921 2013-04-11 23:00:05 -0700 wip-init-radosgw 2013-04-17 23:30:11 -0700 wip-4725 2013-04-21 22:06:37 -0700 wip-4752 2013-04-22 14:11:37 -0700 wip-4632 2013-05-31 11:20:40 -0700 wip-doc-prereq 2013-06-06 22:31:54 -0700 wip-build-doc 2013-07-03 17:00:31 -0700 wip-5496 Greg Farnum g...@inktank.com 2013-02-13 14:46:38 -0800 wip-mds-snap-fix 2013-02-22 19:57:53 -0800 wip-4248-snapid-journaling 2013-05-01 17:06:27 -0700 wip-optracker-4354 2013-06-26 16:28:22 -0700 wip-rgw-geo-replica-log 2013-07-19 15:13:07 -0700 wip-rgw-versionchecks James Page james.p...@ubuntu.com 2013-02-27 22:50:38 + wip-debhelper-8 Joao Eduardo Luis joao.l...@inktank.com 2013-04-18 00:01:24 +0100 wip-4521-tool 2013-04-22 15:14:28 +0100 wip-4748 2013-04-24 16:42:11 +0100 wip-4521 2013-04-30 18:45:22 +0100 wip-mon-compact-dbg 2013-05-21 01:46:13 +0100 wip-monstoretool-foo 2013-05-31 16:26:02 +0100 wip-mon-cache-first-last-committed 2013-05-31 21:00:28 +0100 wip-mon-trim-b 2013-07-20 04:30:59 +0100 wip-mon-caps-test Joe Buck jbb...@gmail.com 2013-05-02 16:32:33 -0700 wip-buck-add-terasort 2013-07-01 12:33:57 -0700 wip-rgw-geo-buck John Wilkins john.wilk...@inktank.com 2012-12-21 15:14:37 -0800 wip-mon-docs Josh Durgin josh.dur...@inktank.com 2013-03-01 14:45:23 -0800 wip-rbd-workunit-debug 2013-04-29 14:32:00 -0700 wip-rbd-close-image Noah Watkins noahwatk...@gmail.com 2013-01-05 11:58:38 -0800 wip-localized-read-tests 2013-04-22 15:23:09 -0700 wip-cls-lua 2013-07-21 12:01:01 -0700 wip-osx-upstream 2013-07-21 22:05:32 -0700 fallocate-error-handling Roald van Loon roaldvanl...@gmail.com 2012-12-24 22:26:56 + wip-dout Sage Weil s...@inktank.com 2012-07-14 17:40:21 -0700 wip-osd-redirect 2012-11-30 13:47:27 -0800 wip-osd-readhole 2012-12-07 14:38:46 -0800 wip-osd-alloc 2013-01-29 13:46:02 -0800 wip-readdir 2013-02-11 07:05:15 -0800 wip-sim-journal-clone 2013-04-18 13:51:36 -0700 argonaut 2013-06-02 21:21:09 -0700 wip-fuse-bobtail 2013-06-04 22:43:04 -0700 wip-osd-push 2013-06-18 17:00:00 -0700 wip-mon-refs 2013-06-21 17:59:58 -0700 wip-rgw-vstart 2013-06-24 21:23:55 -0700 bobtail 2013-06-25 13:16:45 -0700 wip-5401 2013-06-28 12:54:08 -0700 wip-mds-snap 2013-06-30 20:41:55 -0700 wip-5453 2013-07-01 17:48:09 -0700 wip-5021 2013-07-06 09:22:29 -0700 wip-mds-lazyio-cuttlefish 2013-07-06 13:00:51 -0700 wip-mds-lazyio-cuttlefish-minimal 2013-07-10 11:03:55 -0700 wip-mon-sync 2013-07-12 08:50:24 -0700 wip-libcephfs 2013-07-18 16:59:03 -0700 wip-refs 2013-07-18 18:12:16 -0700 cuttlefish 2013-07-19 21:13:09 -0700 wip-5692 2013-07-19 22:32:23 -0700 wip-mon-caps 2013-07-20 08:49:48 -0700 wip-5624-b 2013-07-20 09:02:40 -0700 wip-5695 2013-07-21 08:59:51 -0700 wip-paxos 2013-07-21 17:16:10 -0700 wip-5672 2013-07-21 19:58:12 -0700 wip-before 2013-07-21 22:03:19 -0700 wip-cuttlefish-osdmap Sam Lang sam.l...@inktank.com 2012-11-27 15:01:58 -0600 wip-mtime-incr Samuel Just sam.j...@inktank.com 2013-06-06 11:51:04 -0700 wip_bench_num 2013-06-06 13:08:51 -0700 wip_5238_cuttlefish 2013-06-17 14:50:53 -0700 wip-log-rewrite-sam 2013-06-19 14:54:13 -0700 wip_cuttlefish_compact_on_startup 2013-06-19 19:46:06 -0700 wip_observer 2013-07-19 14:51:43 -0700 wip-cuttlefish-next Yehuda Sadeh yeh...@inktank.com 2012-11-16 11:09:34 -0800 wip-mongoose 2012-12-07 13:40:12 -0800 wip-rgw-dr 2012-12-10 13:29:37 -0800 wip-multipart-size 2012-12-13 18:09:37 -0800 wip-2169 2013-02-12 09:40:12 -0800 wip-json-decode 2013-02-22 15:04:37 -0800 wip-4247 2013-02-22 16:19:37
[PATCH] mds: remove waiting lock before merging with neighbours
CephFS currently deadlocks under CTDB's ping_pong POSIX locking test when run concurrently on multiple nodes. The deadlock is caused by failed removal of a waiting_locks entry when the waiting lock is merged with an existing lock, e.g: Initial MDS state (two clients, same file): held_locks -- start: 0, length: 1, client: 4116, pid: 7899, type: 2 start: 2, length: 1, client: 4110, pid: 40767, type: 2 waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2 Waiting lock entry 4116@1:1 fires: handle_client_file_setlock: start: 1, length: 1, client: 4116, pid: 7899, type: 2 MDS state after lock is obtained: held_locks -- start: 0, length: 2, client: 4116, pid: 7899, type: 2 start: 2, length: 1, client: 4110, pid: 40767, type: 2 waiting_locks -- start: 1, length: 1, client: 4116, pid: 7899, type: 2 Note that the waiting 4116@1:1 lock entry is merged with the existing 4116@0:1 held lock to become a 4116@0:2 held lock. However, the now handled 4116@1:1 waiting_locks entry remains. When handling a lock request, the MDS calls adjust_locks() to merge the new lock with available neighbours. If the new lock is merged, then the waiting_locks entry is not located in the subsequent remove_waiting() call. This fix ensures that the waiting_locks entry is removed prior to modification during merge. Signed-off-by: David Disseldorp dd...@suse.de --- src/mds/flock.cc | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/src/mds/flock.cc b/src/mds/flock.cc index e83c5ee..5e329af 100644 --- a/src/mds/flock.cc +++ b/src/mds/flock.cc @@ -75,12 +75,14 @@ bool ceph_lock_state_t::add_lock(ceph_filelock new_lock, } else { //yay, we can insert a shared lock dout(15) inserting shared lock dendl; +remove_waiting(new_lock); adjust_locks(self_overlapping_locks, new_lock, neighbor_locks); held_locks.insert(pairuint64_t, ceph_filelock(new_lock.start, new_lock)); ret = true; } } } else { //no overlapping locks except our own +remove_waiting(new_lock); adjust_locks(self_overlapping_locks, new_lock, neighbor_locks); dout(15) no conflicts, inserting new_lock dendl; held_locks.insert(pairuint64_t, ceph_filelock @@ -89,7 +91,6 @@ bool ceph_lock_state_t::add_lock(ceph_filelock new_lock, } if (ret) { ++client_held_lock_counts[(client_t)new_lock.client]; -remove_waiting(new_lock); } else if (wait_on_fail !replay) ++client_waiting_lock_counts[(client_t)new_lock.client]; @@ -306,7 +307,7 @@ void ceph_lock_state_t::adjust_locks(listmultimapuint64_t, ceph_filelock::ite old_lock = (*iter)-second; old_lock_client = old_lock-client; dout(15) lock to coalesce: *old_lock dendl; -/* because if it's a neibhoring lock there can't be any self-overlapping +/* because if it's a neighboring lock there can't be any self-overlapping locks that covered it */ if (old_lock-type == new_lock.type) { //merge them if (0 == new_lock.length) { -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
mds.0 crashed with 0.61.7
Hello, my Ceph test cluster runs fine with 0.61.4. I have removed all data and have setup a new cluster with 0.61.7 using the same configuration (see ceph.conf). After mkcephfs -c /etc/ceph/ceph.conf -a /etc/init.d/ceph -a start the mds.0 crashed: -1 2013-07-29 17:02:57.626886 7fba2a8cd700 1 -- 10.0.0.231:6800/806 == osd.121 10.0.0.231:6834/5350 1 osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or directory)) v4 112+0+0 (2505332647 0 0) 0x13b7a30 con 0x7fba20010200 0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc: In function 'void MDSTable::load_2(int, ceph::bufferlist, Context*)' thread 7fba2a8cd700 time 2013-07-29 17:02:57.626907 mds/MDSTable.cc: 150: FAILED assert(0) ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf) [0x6e398f] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x73c16e] 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff] 4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df] 5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163] 6: (DispatchQueue::entry()+0x399) [0x7ddd69] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d] 8: (()+0x77b6) [0x7fba2f51e7b6] 9: (clone()+0x6d) [0x7fba2e15dd6d] ... At this point I have no rbd, no cephfs, no ceph-fuse configured. /etc/init.d/ceph -a stop /etc/init.d/ceph -a start doesn't help. Any help would be appreciated. Andreas Friedrich -- FUJITSU Fujitsu Technology Solutions GmbH Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany Tel: +49 (5251) 525-1512 Fax: +49 (5251) 525-321512 Email: andreas.friedr...@ts.fujitsu.com Web: ts.fujitsu.com Company details: de.ts.fujitsu.com/imprint -- [global] #debug ms = 20 debug ITX = 0 debug monc = 0 debug rados = 0 # # enable secure authentication # auth supported = cephx # keyring = /etc/ceph/keyring.client # # -- or -- disable secure authentication # auth supported = none # auth cluster required = cephx # auth service required = cephx # auth client required = cephx auth cluster required = none auth service required = none auth client required = none # allow ourselves to open a lot of files max open files = 131072 # set log file # log file = /ceph-log/log/$name.log # log_to_syslog = true# uncomment this line to log to syslog # set up pid files pid file = /var/run/ceph/$name.pid # If you want to run a IPv6 cluster, set this to true. Dual-stack isn't possible #ms bind ipv6 = true public network = 10.0.0.0/24 cluster network = 10.0.0.0/24 # environment for startup with rosckets # environment = LD_PRELOAD=/usr/lib64/libsdp.so.1 # environment = LD_PRELOAD=/usr/local/lib/rsocket/librspreload.so.1.0.0 LD_LIBRARY_PATH=/usr/local/lib/rsocket:\\\$LD_LIBRARY_PATH ### [client.radosgw.ceph] ### ### host = ceph ### # auto start = yes ### log file = /var/log/ceph/$name.log ### keyring = /etc/ceph/keyring.radosgw.ceph ### rgw socket path = /var/run/radosgw.sock ### # debug rgw = 20 ### # debug ms = 1 [mon] #mon data = /var/lib/ceph/mon/$cluster-$id mon data = /data/mon$id # debug ms = 0 ; see message traffic # debug mon = 5 ; monitor # debug paxos = 5 ; monitor replication # debug auth = 5 ; authentication code # keyring = /etc/ceph/keyring.$name debug optracker = 0 mon debug dump transactions = false [mon.0] host = cibst1 mon addr = 10.0.0.231:6789 [mon.1] host = cibst2 mon addr = 10.0.0.232:6789 [mon.3] host = cibst3 mon addr = 10.0.0.233:6789 [mds] # debug mds = 1 # keyring = /etc/ceph/keyring.$name debug optracker = 0 [mds.0] host = cibst1 [mds.1] host = cibst2 [osd] # journal dio = false # journal aio = true #osd data = /var/lib/ceph/osd/$cluster-$id osd data = /data/$name # osd journal = /journals/$name/journal # osd journal = osd journal size = 5120 #osd journal size = 1024 filestore max sync interval = 30 filestore min sync interval = 29 filestore flusher = false filestore queue max ops = 1 debug optracker = 0 # keyring = /etc/ceph/keyring.$name # debug osd = 20 # debug osd = 0 ; waiters # debug ms = 10 ; message traffic # debug filestore = 20 ; local object storage debug journal = 0 ; local journaling # debug monc = 5 ; monitor interaction, startup
Re: mds.0 crashed with 0.61.7
Hi Sage, as this crash had been around for a while already: do you know whether this had happened in ceph version 0.61.4 as well? Best Regards Andreas Bluemle On Mon, 29 Jul 2013 08:47:00 -0700 (PDT) Sage Weil s...@inktank.com wrote: Hi Andreas, Can you reproduce this (from mkcephfs onward) with debug mds = 20 and debug ms = 1? I've seen this crash several times but never been able to get to the bottom of it. Thanks! sage On Mon, 29 Jul 2013, Andreas Friedrich wrote: Hello, my Ceph test cluster runs fine with 0.61.4. I have removed all data and have setup a new cluster with 0.61.7 using the same configuration (see ceph.conf). After mkcephfs -c /etc/ceph/ceph.conf -a /etc/init.d/ceph -a start the mds.0 crashed: -1 2013-07-29 17:02:57.626886 7fba2a8cd700 1 -- 10.0.0.231:6800/806 == osd.121 10.0.0.231:6834/5350 1 osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or directory)) v4 112+0+0 (2505332647 0 0) 0x13b7a30 con 0x7fba20010200 0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc: 0 In function 'void MDSTable::load_2(int, ceph::bufferlist, 0 Context*)' thread 7fba2a8cd700 time 2013-07-29 0 17:02:57.626907 mds/MDSTable.cc: 150: FAILED assert(0) ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf) [0x6e398f] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x73c16e] 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff] 4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df] 5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163] 6: (DispatchQueue::entry()+0x399) [0x7ddd69] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d] 8: (()+0x77b6) [0x7fba2f51e7b6] 9: (clone()+0x6d) [0x7fba2e15dd6d] ... At this point I have no rbd, no cephfs, no ceph-fuse configured. /etc/init.d/ceph -a stop /etc/init.d/ceph -a start doesn't help. Any help would be appreciated. Andreas Friedrich -- FUJITSU Fujitsu Technology Solutions GmbH Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany Tel: +49 (5251) 525-1512 Fax: +49 (5251) 525-321512 Email: andreas.friedr...@ts.fujitsu.com Web: ts.fujitsu.com Company details: de.ts.fujitsu.com/imprint -- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mds.0 crashed with 0.61.7
On Mon, 29 Jul 2013, Andreas Bluemle wrote: Hi Sage, as this crash had been around for a while already: do you know whether this had happened in ceph version 0.61.4 as well? Pretty sure, yeah. sage Best Regards Andreas Bluemle On Mon, 29 Jul 2013 08:47:00 -0700 (PDT) Sage Weil s...@inktank.com wrote: Hi Andreas, Can you reproduce this (from mkcephfs onward) with debug mds = 20 and debug ms = 1? I've seen this crash several times but never been able to get to the bottom of it. Thanks! sage On Mon, 29 Jul 2013, Andreas Friedrich wrote: Hello, my Ceph test cluster runs fine with 0.61.4. I have removed all data and have setup a new cluster with 0.61.7 using the same configuration (see ceph.conf). After mkcephfs -c /etc/ceph/ceph.conf -a /etc/init.d/ceph -a start the mds.0 crashed: -1 2013-07-29 17:02:57.626886 7fba2a8cd700 1 -- 10.0.0.231:6800/806 == osd.121 10.0.0.231:6834/5350 1 osd_op_reply(4 mds_snaptable [read 0~0] ack = -2 (No such file or directory)) v4 112+0+0 (2505332647 0 0) 0x13b7a30 con 0x7fba20010200 0 2013-07-29 17:02:57.627838 7fba2a8cd700 -1 mds/MDSTable.cc: 0 In function 'void MDSTable::load_2(int, ceph::bufferlist, 0 Context*)' thread 7fba2a8cd700 time 2013-07-29 0 17:02:57.626907 mds/MDSTable.cc: 150: FAILED assert(0) ceph version 0.61.7 (8f010aff684e820ecc837c25ac77c7a05d7191ff) 1: (MDSTable::load_2(int, ceph::buffer::list, Context*)+0x4cf) [0x6e398f] 2: (Objecter::handle_osd_op_reply(MOSDOpReply*)+0xe1e) [0x73c16e] 3: (MDS::handle_core_message(Message*)+0x93f) [0x4db2ff] 4: (MDS::_dispatch(Message*)+0x2f) [0x4db3df] 5: (MDS::ms_dispatch(Message*)+0x1a3) [0x4dd163] 6: (DispatchQueue::entry()+0x399) [0x7ddd69] 7: (DispatchQueue::DispatchThread::entry()+0xd) [0x7d343d] 8: (()+0x77b6) [0x7fba2f51e7b6] 9: (clone()+0x6d) [0x7fba2e15dd6d] ... At this point I have no rbd, no cephfs, no ceph-fuse configured. /etc/init.d/ceph -a stop /etc/init.d/ceph -a start doesn't help. Any help would be appreciated. Andreas Friedrich -- FUJITSU Fujitsu Technology Solutions GmbH Heinz-Nixdorf-Ring 1, 33106 Paderborn, Germany Tel: +49 (5251) 525-1512 Fax: +49 (5251) 525-321512 Email: andreas.friedr...@ts.fujitsu.com Web: ts.fujitsu.com Company details: de.ts.fujitsu.com/imprint -- -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Andreas Bluemle mailto:andreas.blue...@itxperts.de Heinrich Boell Strasse 88 Phone: (+49) 89 4317582 D-81829 Muenchen (Germany) Mobil: (+49) 177 522 0151 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Anyone in NYC next week?
Just signed up, looking forward to it. On Thu, Jul 25, 2013 at 5:18 PM, Travis Rhoden trho...@gmail.com wrote: I'm already signed up. Looking forward to it! - Travis On Thu, Jul 25, 2013 at 12:19 AM, Sage Weil s...@inktank.com wrote: I'm going to be in NYC next week at our first Ceph Day of the summer. If you're in town and want to hear more about what we're doing, you should join us! http://www.inktank.com/CEPHdays/ sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blueprint: object redirects
I have a draft blueprint up for supporting object redirects, a basic building block that will be used for tiering in RADOS. The basic idea is that an object may have symlink-like semantics indicating that it is stored in another pool.. maybe something slower, or erasure-encoded, or whatever. There will be basic librados functions to get redirect metadata, safely/atomically demote objects to another pool (turn them into a redirect), and promote objects back to the main pool. Flags will let you control whether promotion happens automatically on write or possibly read. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/osd:_object_redirects I'm not particularly happy with the complexity surrounding the tombstone state. Hopefully we can come up with a simple way to make the client-side safely drive object deletion. If you're interested in discussing this at CDS, please add your name to the blueprint so we can include you in the hangout! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
krbd live resize
Hi, This works: lvcreate --name tmp --size 10G all Logical volume tmp created mkfs.ext4 /dev/all/tmp mount /dev/all/tmp /mnt blockdev --getsize64 /dev/all/tmp 10737418240 lvextend -L+1G /dev/all/tmp Extending logical volume tmp to 11,00 GiB Logical volume tmp successfully resized blockdev --getsize64 /dev/all/tmp 11811160064 resize2fs /dev/all/tmp resize2fs 1.41.12 (17-May-2010) Filesystem at /dev/all/tmp is mounted on /mnt; on-line resizing required old desc_blocks = 1, new_desc_blocks = 1 Performing an on-line resize of /dev/all/tmp to 2883584 (4k) blocks. The filesystem on /dev/all/tmp is now 2883584 blocks long. This does not work: rbd create --size 10240 tmp rbd info tmp rbd image 'tmp': size 10240 MB in 2560 objects order 22 (4096 KB objects) block_name_prefix: rb.0.12dd.238e1f29 format: 1 rbd map tmp mkfs.ext4 /dev/rbd1 mount /dev/rbd1 /mnt blockdev --getsize64 /dev/rbd1 10737418240 rbd resize --size 2 tmp blockdev --getsize64 /dev/rbd1 10737418240 resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) The filesystem is already 2621440 blocks long. Nothing to do! It does work after umounting: umount /mnt blockdev --getsize64 /dev/rbd1 fsck -f /dev/rbd1 resize2fs /dev/rbd1 resize2fs 1.42 (29-Nov-2011) Resizing the filesystem on /dev/rbd1 to 512 (4k) blocks. The filesystem on /dev/rbd1 is now 512 blocks long. I assume there should be something in KRBD to allow for the same behavior as with LVM but I don't know enough about the kernel to be more specific. Maybe something similar to ioctl BLKRRPART ? Cheers -- Loïc Dachary, Artisan Logiciel Libre All that is necessary for the triumph of evil is that good people do nothing. signature.asc Description: OpenPGP digital signature
blueprint: rgw multi-region disaster recovery, second phase
I've created a blueprint for the second phase of the multiregion / DR project: http://wiki.ceph.com/index.php?title=01Planning/02Blueprints/Emperor/RGW_Multi-region_%2F%2F_Disaster_Recovery_(phase_2) While a huge amount of work was done for Dumpling, there's still some work that needs to be done (mainly in the area of the disaster recovery). If you're interested in discussing this at CDS, please add yourself as an interested party to the blueprint. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blueprint: RADOS Object Temperature Monitoring
I've created a blueprint for a RADOS level mechanism for discovering cold objects. http://wiki.ceph.com/01Planning/02Blueprints/Dumpling/RADOS_Object_Temperature_Monitoring Such a mechanism will be crucial to future tiering implementations. If you are interested in discussing this at CDS, please add yourself as an interested party to the blueprint! -Sam -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blueprint: rgw quota
I created a blueprint for rgw bucket quotas. The document itself is mainly a placeholder and a reference to the older bucket quota that we prepared for Dumpling. If you're interested in discussing this at CDS, please add yourself as an interested party to the blueprint. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/RGW_Bucket_Level_Quota Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fwd: [ceph-users] Small fix for ceph.spec
-- Forwarded message -- From: Erik Logtenberg e...@logtenberg.eu Date: Mon, Jul 29, 2013 at 7:07 PM Subject: [ceph-users] Small fix for ceph.spec To: ceph-us...@lists.ceph.com Hi, The spec file used for building rpm's misses a build time dependency on snappy-devel. Please see attached patch to fix. Kind regards, Erik. ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com --- ceph.spec-orig 2013-07-30 00:24:54.70500 +0200 +++ ceph.spec 2013-07-30 00:25:34.19900 +0200 @@ -42,6 +42,7 @@ BuildRequires: libxml2-devel BuildRequires: libuuid-devel BuildRequires: leveldb-devel 1.2 +BuildRequires: snappy-devel # # specific
blueprint: librgw
I created another blueprint for defining and creating a library for rgw. This is also just a placeholder and a pointer at an older blueprint. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/librgw If you wish to discuss this at CDS, please add yourself to the blueprint. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blueprint: rgw bucket scalability
I created a new blueprint that discusses rgw bucket scalability: http://wiki.ceph.com/01Planning/02Blueprints/Emperor/rgw:_bucket_index_scalability As was brought up on the mailing list recently, bucket index may serve as a contention point. There are a few suggestions in how to solve / mitigate the issue, and we'd like to discuss these at CDS. If you want to participate in the discussion, please add yourself to the blueprint as an interested party. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
blueprint: rgw multitenancy
I created a new blueprint that discusses rgw multitenancy. The rgw multitenancy defines a level of hierarchy on top of users and their data which provides the ability to separate the users into different organizational entities. http://wiki.ceph.com/01Planning/02Blueprints/Emperor/rgw:_multitenancy As with the other blueprints, if you wish to participate please add yourself as an interested party to the blueprint. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Re: question about striped_read
On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote: [snip] I don't think the later was_short can handle the hole case. For the hole case, we should try reading next strip object instead of return. how about below patch. Hi Yan, i uesed this demo to test hole case. dd if=/dev/urandom bs=4096 count=2 of=file_with_holes dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct Using the dynamic_debug in striped_read, the message are: [ 8743.663499] ceph: file.c:350 : striped_read 0~16384 (read 0) got 16384 [ 8743.663502] ceph: file.c:390 : striped_read returns 16384 From the messages, we can see it can't hit the short-read. For the ceph-file-hole, how does the ceph handle? Or am i missing something? the default strip size is 4M, all data are written to the first object in your test case. could you try something like below. dd if=/dev/urandom bs=1M count=2 of=file_with_holes dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc dd if=file_with_holes bs=8M /dev/null From above test, i think your patch is right. Although, the original code can work but it call multi striped_read. As your said for stripe short-read,it doesn't make sense to return rather than reading next stripe. But can you add some comments for this? The short-read reasongs are two:EOF or hit-hole. But for hit-hole there are some differents case. For that i don't know. Thanks! Jianpeng Ma Regards Yan, Zheng Thanks! Jianpeng Ma Regards Yan, Zheng --- diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 271a346..6ca2921 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -350,16 +350,17 @@ more: ret, hit_stripe ? HITSTRIPE : , was_short ? SHORT : ); if (ret 0) { - int didpages = (page_align + ret) PAGE_CACHE_SHIFT; + int didpages = (page_align + this_len) PAGE_CACHE_SHIFT; - if (read pos - off) { - dout( zero gap %llu to %llu\n, off + read, pos); - ceph_zero_page_vector_range(page_align + read, - pos - off - read, pages); + if (was_short) { + dout( zero gap %llu to %llu\n, + pos + ret, pos + this_len); + ceph_zero_page_vector_range(page_align + ret, + this_len - ret, page_pos); } - pos += ret; + pos += this_len; read = pos - off; - left -= ret; + left -= this_len; page_pos += didpages; pages_left -= didpages; Thanks! Jianpeng Ma On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote: [snip] I don't think the later was_short can handle the hole case. For the hole case, we should try reading next strip object instead of return. how about below patch. Hi Yan, i uesed this demo to test hole case. dd if=/dev/urandom bs=4096 count=2 of=file_with_holes dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct Using the dynamic_debug in striped_read, the message are: [ 8743.663499] ceph: file.c:350 : striped_read 0~16384 (read 0) got 16384 [ 8743.663502] ceph: file.c:390 : striped_read returns 16384 From the messages, we can see it can't hit the short-read. For the ceph-file-hole, how does the ceph handle? Or am i missing something? the default strip size is 4M, all data are written to the first object in your test case. could you try something like below. dd if=/dev/urandom bs=1M count=2 of=file_with_holes dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc dd if=file_with_holes bs=8M /dev/null Regards Yan, Zheng Thanks! Jianpeng Ma Regards Yan, Zheng --- diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 271a346..6ca2921 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -350,16 +350,17 @@ more: ret, hit_stripe ? HITSTRIPE : , was_short ? SHORT : ); if (ret 0) { - int didpages = (page_align + ret) PAGE_CACHE_SHIFT; + int didpages = (page_align + this_len) PAGE_CACHE_SHIFT; - if (read pos - off) { - dout( zero gap %llu to %llu\n, off + read, pos); - ceph_zero_page_vector_range(page_align + read, - pos - off - read, pages); + if (was_short) { + dout( zero gap %llu to %llu\n, + pos + ret, pos + this_len); + ceph_zero_page_vector_range(page_align + ret, + this_len - ret, page_pos);
Re: Re: question about striped_read
On Tue, Jul 30, 2013 at 10:08 AM, majianpeng majianp...@gmail.com wrote: On Mon, Jul 29, 2013 at 11:00 AM, majianpeng majianp...@gmail.com wrote: [snip] I don't think the later was_short can handle the hole case. For the hole case, we should try reading next strip object instead of return. how about below patch. Hi Yan, i uesed this demo to test hole case. dd if=/dev/urandom bs=4096 count=2 of=file_with_holes dd if=/dev/urandom bs=4096 seek=7 count=2 of=file_with_holes dd if=file_with_holes of=/dev/null bs=16k count=1 iflag=direct Using the dynamic_debug in striped_read, the message are: [ 8743.663499] ceph: file.c:350 : striped_read 0~16384 (read 0) got 16384 [ 8743.663502] ceph: file.c:390 : striped_read returns 16384 From the messages, we can see it can't hit the short-read. For the ceph-file-hole, how does the ceph handle? Or am i missing something? the default strip size is 4M, all data are written to the first object in your test case. could you try something like below. dd if=/dev/urandom bs=1M count=2 of=file_with_holes dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc dd if=file_with_holes bs=8M /dev/null From above test, i think your patch is right. Although, the original code can work but it call multi striped_read. For test case --- dd if=/dev/urandom bs=1M count=2 of=file_with_holes dd if=/dev/urandom bs=1M count=2 seek=4 of=file_with_holes conv=notrunc dd if=file_with_holes bs=8M iflag=direct /dev/null I got --- ceph: striped_read 0~8388608 (read 0) got 2097152 HITSTRIPE SHORT ceph: striped_read 2097152~6291456 (read 2097152) got 0 HITSTRIPE SHORT ceph: zero tail 4194304 ceph: striped_read returns 6291456 ceph: sync_read result 6291456 ceph: aio_read 88000fb22f98 1193e8c.fffe dropping cap refs on Fcr = 6291456 the original code zeros data in range 2M~6M, it's obvious incorrect. As your said for stripe short-read,it doesn't make sense to return rather than reading next stripe. But can you add some comments for this? The short-read reasongs are two:EOF or hit-hole. But for hit-hole there are some differents case. For that i don't know. For hit-hole, there is only one case: the strip object's size is smaller then 4M. When reading a strip object, if the returned data is less than we expected, we need to check if following strip objects have data. I think the original code and my patch doesn't handle the below case properly. | object 0 | hole | hole | object 3 | dd if=testfile iflag=direct bs=16M /dev/null Could you write a patch, do some tests and submit it. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html