Always creating PGs
Hi, all I created Ceph file system using Debian7 of 64bit. I found that PGs of 'creating' status are never finished. # ceph pg stat v17596: 1204 pgs: 8 creating, 1196 active+clean; 25521 MB data, 77209 MB used, 2223 GB / 2318 GB avail ~~ always creating, why? # ceph pg dump|grep creating 1.1p6 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 0.1p7 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 1.1p7 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 0.1p6 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 1.1p8 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 0.1p9 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 1.1p9 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 0.1p8 0 0 0 0 0 0 0 creating 0.000'0 0'0 [] [] 0'0 0.00 Configuration is wrong? How to resolve this problem? [global] auth supported = cephx max open files = 131072 log_to_syslog = true pid file = /var/run/ceph/$name.pid keyring = /etc/ceph/keyring.bin [mon] mon data = /ceph/$name [mon.0] host = mon0 mon addr = 192.168.233.81:6789 [mon.1] host = mon1 mon addr = 192.168.233.82:6789 [mon.2] host = mon2 mon addr = 192.168.233.83:6789 [mds] keyring = /etc/ceph/keyring.$name [mds.0] host = mds0 [osd] osd journal = /var/journal osd journal size = 1000 keyring = /etc/ceph/keyring.$name [osd.0] host = osd0 btrfs devs = /dev/sdb1 osd data = /ceph0 osd journal = /var/journal0 [osd.1] host = osd0 btrfs devs = /dev/sdc1 osd data = /ceph1 osd journal = /var/journal1 [osd.2] host = osd1 btrfs devs = /dev/sdb1 osd data = /ceph0 osd journal = /var/journal0 [osd.3] host = osd1 btrfs devs = /dev/sdc1 osd data = /ceph1 osd journal = /var/journal1 [osd.4] host = osd2 btrfs devs = /dev/sdb1 osd data = /ceph0 osd journal = /var/journal0 [osd.5] host = osd2 btrfs devs = /dev/sdc1 osd data = /ceph1 osd journal = /var/journal1 [osd.6] host = osd3 btrfs devs = /dev/sdb1 osd data = /ceph0 osd journal = /var/journal0 [osd.7] host = osd3 btrfs devs = /dev/sdc1 osd data = /ceph1 osd journal = /var/journal1 [osd.8] host = osd4 btrfs devs = /dev/sdb1 osd data = /ceph0 osd journal = /var/journal0 [osd.9] host = osd4 btrfs devs = /dev/sdc1 osd data = /ceph1 osd journal = /var/journal1 Thanks. -- Tomoki BENIYA ben...@bit-isle.co.jp -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
slow performance even when using SSDs
Dear List, i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is. my testsetup: 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x 1Gbit/s LAN each All 3 are running as mon a-c and osd 0-2. Two of them are also running as mds.2 and mds.3 (has 8GB RAM instead of 4GB). All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use eth0+eth1 as bond0 (mode 6). This gives me: rados -p rbd bench 60 write ... Total time run:61.465323 Total writes made: 776 Write size:4194304 Bandwidth (MB/sec):50.500 Average Latency: 1.2654 Max latency: 2.77124 Min latency: 0.170936 Shouldn't it be at least 100MB/s? (1Gbit/s / 8) And rados -p rbd bench 60 write -b 4096 gives pretty bad results: Total time run:60.221130 Total writes made: 6401 Write size:4096 Bandwidth (MB/sec):0.415 Average Latency: 0.150525 Max latency: 1.12647 Min latency: 0.026599 All btrfs ssds are also mounted with noatime. Thanks for your help! Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow performance even when using SSDs
OK, here some retests. I had the SDDs conected to an old Raid controller even i did used them as JBODs (oops). Here are two new Tests (using kernel 3.4-rc6) it would be great if someone could tell me if they're fine or bad. New tests with all 3 SSDs connected to the mainboard. #~ rados -p rbd bench 60 write Total time run:60.342419 Total writes made: 2021 Write size:4194304 Bandwidth (MB/sec):133.969 Average Latency: 0.477476 Max latency: 0.942029 Min latency: 0.109467 #~ rados -p rbd bench 60 write -b 4096 Total time run:60.726326 Total writes made: 59026 Write size:4096 Bandwidth (MB/sec):3.797 Average Latency: 0.016459 Max latency: 0.874841 Min latency: 0.002392 Another test with only osd on the disk and the journal in memory / tmpfs: #~ rados -p rbd bench 60 write Total time run:60.513240 Total writes made: 2555 Write size:4194304 Bandwidth (MB/sec):168.889 Average Latency: 0.378775 Max latency: 4.59233 Min latency: 0.055179 #~ rados -p rbd bench 60 write -b 4096 Total time run:60.116260 Total writes made: 281903 Write size:4096 Bandwidth (MB/sec):18.318 Average Latency: 0.00341067 Max latency: 0.720486 Min latency: 0.000602 Another problem i have is i'm always getting: 2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN] message from mon.2 was stamped 0.109244s in the future, clocks not synchronized even on all systems ntp is running fine. Stefan Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG: Dear List, i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is. my testsetup: 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x 1Gbit/s LAN each All 3 are running as mon a-c and osd 0-2. Two of them are also running as mds.2 and mds.3 (has 8GB RAM instead of 4GB). All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use eth0+eth1 as bond0 (mode 6). This gives me: rados -p rbd bench 60 write ... Total time run:61.465323 Total writes made: 776 Write size:4194304 Bandwidth (MB/sec):50.500 Average Latency: 1.2654 Max latency: 2.77124 Min latency: 0.170936 Shouldn't it be at least 100MB/s? (1Gbit/s / 8) And rados -p rbd bench 60 write -b 4096 gives pretty bad results: Total time run:60.221130 Total writes made: 6401 Write size:4096 Bandwidth (MB/sec):0.415 Average Latency: 0.150525 Max latency: 1.12647 Min latency: 0.026599 All btrfs ssds are also mounted with noatime. Thanks for your help! Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Designing a cluster guide
Hi, the Designing a cluster guide http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it still leaves some questions unanswered. It mentions for example Fast CPU for the mds system. What does fast mean? Just the speed of one core? Or is ceph designed to use multi core? Is multi core or more speed important? The Cluster Design Recommendations mentions to seperate all Daemons on dedicated machines. Is this also for the MON useful? As they're so leightweight why not running them on the OSDs? Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd and you should go for 22x SSD Disks in a Raid 6? Is it more useful the use a Raid 6 HW Controller or the btrfs raid? Use single socket Xeon for the OSDs or Dual Socket? Thanks and greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ceph: add tracepoints for message send queueing and completion, reply handling
Signed-off-by: Jim Schutt jasc...@sandia.gov --- include/trace/events/ceph.h | 67 +++ net/ceph/messenger.c|9 +- net/ceph/osd_client.c |1 + 3 files changed, 76 insertions(+), 1 deletions(-) diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h index 182af2c..b390f78 100644 --- a/include/trace/events/ceph.h +++ b/include/trace/events/ceph.h @@ -4,6 +4,9 @@ #if !defined(_TRACE_CEPH_H) || defined(TRACE_HEADER_MULTI_READ) #define _TRACE_CEPH_H +#if defined(CEPH_TRACE_FS_FILE) \ + || defined(CEPH_TRACE_FS_ADDR) \ + || defined(CEPH_TRACE_NET_OSDC) #if !defined(TRACE_HEADER_MULTI_READ) @@ -68,8 +71,72 @@ DEFINE_CEPH_START_REQ_EVENT(ceph_async_readpages_req); #ifdef CEPH_TRACE_NET_OSDC DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_writepages_req); DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_readpages_req); + +TRACE_EVENT(ceph_handle_reply_msg, + TP_PROTO(struct ceph_connection *con, +struct ceph_msg *msg, +struct ceph_osd_reply_head *reply, +void *req), + TP_ARGS(con, msg, reply, req), + TP_STRUCT__entry( + __field(unsigned long long, tid) + __field(long long, peer_num) + __field(void*, req) + __field(unsigned, peer_type) + __field(int, result) + __field(int, flags) + ), + TP_fast_assign( + __entry-tid = le64_to_cpu(msg-hdr.tid); + __entry-peer_num = le64_to_cpu(con-peer_name.num); + __entry-peer_type = con-peer_name.type; + __entry-req = req; + __entry-result = le32_to_cpu(reply-result); + __entry-flags = le32_to_cpu(reply-flags); + ), + TP_printk(peer %s%lld tid %llu result %d flags 0x%08x (req %p), + ceph_entity_type_name(__entry-peer_type), __entry-peer_num, + __entry-tid, __entry-result, __entry-flags, __entry-req + ) +); #endif +#endif /* CEPH_TRACE_FS_FILE || CEPH_TRACE_FS_ADDR || CEPH_TRACE_NET_OSDC */ + +#if defined(CEPH_TRACE_NET_MESSENGER) + +DECLARE_EVENT_CLASS(ceph_write_msg_class, + TP_PROTO(struct ceph_connection *con, struct ceph_msg *msg), + TP_ARGS(con, msg), + TP_STRUCT__entry( + __field(unsigned long long, tid) + __field(unsigned long long, seq) + __field(long long, peer_num) + __field(unsigned, peer_type) + __field(int, sent) + ), + TP_fast_assign( + __entry-tid = le64_to_cpu(msg-hdr.tid); + __entry-seq = le64_to_cpu(msg-hdr.seq); + __entry-peer_num = le64_to_cpu(con-peer_name.num); + __entry-peer_type = con-peer_name.type; + __entry-sent = con-out_msg_pos.data_pos; + ), + TP_printk(peer %s%lld tid %llu seq %llu sent %d, + ceph_entity_type_name(__entry-peer_type), __entry-peer_num, + __entry-tid, __entry-seq, __entry-sent) +); + +#define CEPH_WRITE_MSG_EVENT(name) \ +DEFINE_EVENT(ceph_write_msg_class, name, \ + TP_PROTO(struct ceph_connection *con, struct ceph_msg *msg), \ + TP_ARGS(con, msg)) + +CEPH_WRITE_MSG_EVENT(ceph_prepare_write_msg); +CEPH_WRITE_MSG_EVENT(ceph_try_write_msg); +CEPH_WRITE_MSG_EVENT(ceph_try_write_msg_done); + +#endif /* CEPH_TRACE_NET_MESSENGER */ #endif /* _TRACE_CEPH_H */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index ad5b708..033c4ab 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -20,6 +20,11 @@ #include linux/ceph/pagelist.h #include linux/export.h +#include linux/tracepoint.h +#define CREATE_TRACE_POINTS +#define CEPH_TRACE_NET_MESSENGER +#include trace/events/ceph.h + /* * Ceph uses the messenger to exchange ceph_msg messages with other * hosts in the system. The messenger provides ordered and reliable @@ -555,7 +560,7 @@ static void prepare_write_message(struct ceph_connection *con) /* no, queue up footer too and be done */ prepare_write_message_footer(con, v); } - + trace_ceph_prepare_write_msg(con, m); set_bit(WRITE_PENDING, con-state); } @@ -1853,8 +1858,10 @@ more_kvec: /* msg pages? */ if (con-out_msg) { + trace_ceph_try_write_msg(con, con-out_msg); if (con-out_msg_done) { ceph_msg_put(con-out_msg); + trace_ceph_try_write_msg_done(con, con-out_msg); con-out_msg = NULL; /* we're done with this one */ goto do_next; } diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index f44e400..767e253 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -1200,6 +1200,7 @@ static void handle_reply(struct ceph_osd_client *osdc, struct ceph_msg *msg, /* lookup */
[PATCH 1/2] ceph: add tracepoints for message submission on read/write requests
Trace callers of ceph_osdc_start_request, so that call locations are identified implicitly. Put the tracepoints after calls to ceph_osdc_start_request, since it fills in the request transaction ID and request OSD. Signed-off-by: Jim Schutt jasc...@sandia.gov --- fs/ceph/addr.c |8 fs/ceph/file.c |6 +++ include/trace/events/ceph.h | 77 +++ net/ceph/osd_client.c |7 4 files changed, 98 insertions(+), 0 deletions(-) create mode 100644 include/trace/events/ceph.h diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 173b1d2..f552579 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -13,6 +13,12 @@ #include mds_client.h #include linux/ceph/osd_client.h + +#include linux/tracepoint.h +#define CREATE_TRACE_POINTS +#define CEPH_TRACE_FS_ADDR +#include trace/events/ceph.h + /* * Ceph address space ops. * @@ -338,6 +344,7 @@ static int start_read(struct inode *inode, struct list_head *page_list, int max) dout(start_read %p starting %p %lld~%lld\n, inode, req, off, len); ret = ceph_osdc_start_request(osdc, req, false); + trace_ceph_async_readpages_req(req); if (ret 0) goto out_pages; ceph_osdc_put_request(req); @@ -902,6 +909,7 @@ get_more_pages: req-r_request-hdr.data_len = cpu_to_le32(len); rc = ceph_osdc_start_request(fsc-client-osdc, req, true); + trace_ceph_async_writepages_req(req); BUG_ON(rc); req = NULL; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index ed72428..fc31def 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -10,6 +10,11 @@ #include super.h #include mds_client.h +#include linux/tracepoint.h +#define CREATE_TRACE_POINTS +#define CEPH_TRACE_FS_FILE +#include trace/events/ceph.h + /* * Ceph file operations * @@ -568,6 +573,7 @@ more: req-r_inode = inode; ret = ceph_osdc_start_request(fsc-client-osdc, req, false); + trace_ceph_sync_writepages_req(req); if (!ret) { if (req-r_safe_callback) { /* diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h new file mode 100644 index 000..182af2c --- /dev/null +++ b/include/trace/events/ceph.h @@ -0,0 +1,77 @@ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM ceph + +#if !defined(_TRACE_CEPH_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_CEPH_H + + +#if !defined(TRACE_HEADER_MULTI_READ) + +static __always_inline int +__ceph_req_num_ops(struct ceph_osd_request *req) +{ + struct ceph_osd_request_head *reqhead = req-r_request-front.iov_base; + return le16_to_cpu(reqhead-num_ops); +} + +static __always_inline int +__ceph_req_op_opcode(struct ceph_osd_request *req, int op) +{ + struct ceph_osd_request_head *reqhead = req-r_request-front.iov_base; + if (op le16_to_cpu(reqhead-num_ops)) + return le16_to_cpu(reqhead-ops[op].op); + else + return 0; +} + +#endif + +DECLARE_EVENT_CLASS(ceph_start_req_class, + TP_PROTO(struct ceph_osd_request *req), + TP_ARGS(req), + TP_STRUCT__entry( + __field(unsigned long long, tid) + __field(int, osd) + __field(int, num_ops) + __array(unsigned, ops, 3) + __field(unsigned, pages) + ), + TP_fast_assign( + __entry-tid = req-r_tid; + __entry-osd = req-r_osd-o_osd; + __entry-num_ops = __ceph_req_num_ops(req); + __entry-ops[0] = __ceph_req_op_opcode(req, 0); + __entry-ops[1] = __ceph_req_op_opcode(req, 1); + __entry-ops[2] = __ceph_req_op_opcode(req, 2); + __entry-pages = req-r_num_pages; + ), + TP_printk(tid %llu osd%d ops %d 0x%04x/0x%04x/0x%04x pages %u, + __entry-tid, __entry-osd, __entry-num_ops, + __entry-ops[0], __entry-ops[1], __entry-ops[2], + __entry-pages + ) +); + +#define DEFINE_CEPH_START_REQ_EVENT(name) \ +DEFINE_EVENT(ceph_start_req_class, name, \ + TP_PROTO(struct ceph_osd_request *req), TP_ARGS(req)) + +#ifdef CEPH_TRACE_FS_FILE +DEFINE_CEPH_START_REQ_EVENT(ceph_sync_writepages_req); +#endif + +#ifdef CEPH_TRACE_FS_ADDR +DEFINE_CEPH_START_REQ_EVENT(ceph_async_writepages_req); +DEFINE_CEPH_START_REQ_EVENT(ceph_async_readpages_req); +#endif + +#ifdef CEPH_TRACE_NET_OSDC +DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_writepages_req); +DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_readpages_req); +#endif + + +#endif /* _TRACE_CEPH_H */ + +/* This part must be outside protection */ +#include trace/define_trace.h diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 5e25405..f44e400 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -18,6 +18,11 @@ #include linux/ceph/auth.h #include linux/ceph/pagelist.h +#include linux/tracepoint.h +#define
[PATCH 0/2] Ceph tracepoints
Hi Alex, I ran across tracker #2374 today - I've been carrying these two tracepoint patches for a while. Perhaps you'll find them useful. Jim Schutt (2): ceph: add tracepoints for message submission on read/write requests ceph: add tracepoints for message send queueing and completion, reply handling fs/ceph/addr.c |8 +++ fs/ceph/file.c |6 ++ include/trace/events/ceph.h | 144 +++ net/ceph/messenger.c|9 +++- net/ceph/osd_client.c |8 +++ 5 files changed, 174 insertions(+), 1 deletions(-) create mode 100644 include/trace/events/ceph.h -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] Ceph tracepoints
On 05/10/2012 09:35 AM, Jim Schutt wrote: Hi Alex, I ran across tracker #2374 today - I've been carrying these two tracepoint patches for a while. Perhaps you'll find them useful. GREAT! I haven't looked at them but I will as soon as I get the chance. I don't expect there's any reason not to use this as the foundation I was looking for. Thanks a lot. -Alex Jim Schutt (2): ceph: add tracepoints for message submission on read/write requests ceph: add tracepoints for message send queueing and completion, reply handling fs/ceph/addr.c |8 +++ fs/ceph/file.c |6 ++ include/trace/events/ceph.h | 144 +++ net/ceph/messenger.c|9 +++- net/ceph/osd_client.c |8 +++ 5 files changed, 174 insertions(+), 1 deletions(-) create mode 100644 include/trace/events/ceph.h -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote: Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net: On Tue, 24 Apr 2012, Josef Bacik wrote: On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote: After running ceph on XFS for some time, I decided to try btrfs again. Performance with the current for-linux-min branch and big metadata is much better. The only problem (?) I'm still seeing is a warning that seems to occur from time to time: Actually, before you do that... we have a new tool, test_filestore_workloadgen, that generates a ceph-osd-like workload on the local file system. It's a subset of what a full OSD might do, but if we're lucky it will be sufficient to reproduce this issue. Something like test_filestore_workloadgen --osd-data /foo --osd-journal /bar will hopefully do the trick. Christian, maybe you can see if that is able to trigger this warning? You'll need to pull it from the current master branch; it wasn't in the last release. Trying to reproduce with test_filestore_workloadgen didn't work for me. So here are some instructions on how to reproduce with a minimal ceph setup. You will need a single system with two disks and a bit of memory. - Compile and install ceph (detailed instructions: http://ceph.newdream.net/docs/master/ops/install/mkcephfs/) - For the test setup I've used two tmpfs files as journal devices. To create these, do the following: # mkdir -p /ceph/temp # mount -t tmpfs tmpfs /ceph/temp # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k - Now you should create and mount btrfs. Here is what I did: # mkfs.btrfs -l 64k -n 64k /dev/sda # mkfs.btrfs -l 64k -n 64k /dev/sdb # mkdir /ceph/osd.000 # mkdir /ceph/osd.001 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001 - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You will probably have to change the btrfs devices and the hostname (os39). - Create the ceph filesystems: # mkdir /ceph/mon # mkcephfs -a -c /etc/ceph/ceph.conf - Start ceph (e.g. service ceph start) - Now you should be able to use ceph - ceph -s will tell you about the state of the ceph cluster. - rbd create -size 100 testimg will create an rbd image on the ceph cluster. - Compile my test with gcc -o rbdtest rbdtest.c -lrbd and run it with ./rbdtest testimg. I can see the first btrfs_orphan_commit_root warning after an hour or so... I hope that I've described all necessary steps. If there is a problem just send me a note. Well it's only taken me 2 weeks but I've finally git it all up and running, hopefully I'll reproduce. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph kernel client - kernel craches
Sorry for my late response. I reproduced the above bug with the Linux kernel 3.3.4 and without using XEN: uname -a Linux node33 3.3.4 #1 SMP Wed May 9 13:00:07 EEST 2012 x86_64 GNU/Linux The trace is shown below: [ 763.984023] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 763.984177] BUG: unable to handle kernel paging request at 880037bd0800 [ 763.984402] IP: [880037bd0800] 0x880037bd07ff [ 763.984568] PGD 1806063 PUD 180a063 PMD 800037a001e3 [ 763.984845] Oops: 0011 [#1] SMP [ 763.985058] CPU 3 [ 763.985124] Modules linked in: cbc netconsole loop snd_pcm snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core usbcore usb_common tg3 libphy mptsas mptscsih mptbase scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan] [ 763.988002] [ 763.988002] Pid: 0, comm: swapper/3 Not tainted 3.3.4 #1 HP ProLiant DL160 G5 [ 763.988002] RIP: 0010:[880037bd0800] [880037bd0800] 0x880037bd07ff [ 763.988002] RSP: 0018:8800bfcc3e78 EFLAGS: 00010292 [ 763.988002] RAX: 8800b97745b0 RBX: 8800bfcce770 RCX: 880037bd0800 [ 763.988002] RDX: 880037bd1600 RSI: b9b6a040 RDI: 880037bd1600 [ 763.988002] RBP: 81820080 R08: 8800b9dd0b00 R09: 00018020001c [ 763.988002] R10: 8020001c R11: 816075c0 R12: 8800bfcce7a0 [ 763.988002] R13: 8800b97745b0 R14: 0003 R15: 000a [ 763.988002] FS: () GS:8800bfcc() knlGS: [ 763.988002] CS: 0010 DS: ES: CR0: 8005003b [ 763.988002] CR2: 880037bd0800 CR3: b895b000 CR4: 06e0 [ 763.988002] DR0: DR1: DR2: [ 763.988002] DR3: DR6: 0ff0 DR7: 0400 [ 763.988002] Process swapper/3 (pid: 0, threadinfo 8800bbae, task 8800bbad8000) [ 763.988002] Stack: [ 763.988002] 8109b44d 8800bbacd820 8800b97745b0 8800bbae0010 [ 763.988002] 8800bbad8000 8800bfcc3ea0 0048 8800bbae1fd8 [ 763.988002] 0100 0001 0009 8800bbae1fd8 [ 763.988002] Call Trace: [ 763.988002] IRQ [ 763.988002] [8109b44d] ? __rcu_process_callbacks+0x1e9/0x335 [ 763.988002] [8109b8fb] ? rcu_process_callbacks+0x2c/0x56 [ 763.988002] [8103e3b1] ? __do_softirq+0xc4/0x1a0 [ 763.988002] [8102515b] ? lapic_next_event+0x18/0x1d [ 763.988002] [815d3b1c] ? call_softirq+0x1c/0x30 [ 763.988002] [8100fba3] ? do_softirq+0x3f/0x79 [ 763.988002] [8103e186] ? irq_exit+0x44/0xb1 [ 763.988002] [81025c61] ? smp_apic_timer_interrupt+0x85/0x93 [ 763.988002] [815d311e] ? apic_timer_interrupt+0x6e/0x80 [ 763.988002] EOI [ 763.988002] [810145e1] ? native_sched_clock+0x28/0x33 [ 763.988002] [810152f6] ? mwait_idle+0x8c/0xbc [ 763.988002] [810152ae] ? mwait_idle+0x44/0xbc [ 763.988002] [8100de94] ? cpu_idle+0xb9/0xf7 [ 763.988002] [815c43c6] ? start_secondary+0x270/0x275 [ 763.988002] Code: 00 00 00 00 04 8a b8 00 88 ff ff 00 04 8a b8 00 88 ff ff 00 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 16 bd 37 00 88 ff ff 40 ab cd bf 00 88 ff ff 20 15 42 b9 00 [ 763.988002] RIP [880037bd0800] 0x880037bd07ff [ 763.988002] RSP 8800bfcc3e78 [ 763.988002] CR2: 880037bd0800 [ 763.988002] ---[ end trace 614049dc850267ac ]--- [ 763.988002] Kernel panic - not syncing: Fatal exception in interrupt [ 763.997833] [ cut here ] [ 763.997936] WARNING: at arch/x86/kernel/smp.c:120 update_process_times+0x57/0x63() [ 763.998072] Hardware name: ProLiant DL160 G5 [ 763.998171] Modules linked in: cbc netconsole loop snd_pcm snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core usbcore usb_common tg3 libphy mptsas mptscsih mptbase scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan] [ 764.001205] Pid: 0, comm: swapper/3 Tainted: G D 3.3.4 #1 [ 764.001311] Call Trace: [ 764.001404] IRQ [81038bb0] ? warn_slowpath_common+0x78/0x8c [ 764.001573] [81044937] ? update_process_times+0x57/0x63 [ 764.001681] [81075dbe] ? tick_sched_timer+0x65/0x8b [ 764.001788] [810561bd] ? __run_hrtimer+0xb2/0x13d [ 764.001832] [81013ca9] ? read_tsc+0x5/0x16 [ 764.001832] [81056482] ? hrtimer_interrupt+0xd8/0x1a7 [
Re: slow performance even when using SSDs
I was getting roughly the same results of your tmpfs test using spinning disks for OSDs with a 160GB Intel 320 SSD being used for the journal. Theoretically the 520 SSD should give better performance than my 320s. Keep in mind that even with balance-alb, multiple GigE connections will only be used if there are multiple TCP sessions being used by Ceph. You don't mention it in your email, but if you're using kernel 3.4+ you'll want to make sure your create your btrfs filesystem using the large node leaf size (Big Metadata - I've heard recommendations of 32k instead of default 4k) so your performance doesn't degrade over time. I'm curious what speed you're getting from dd in a streaming write. You might try running a dd if=/dev/zero of=intel ssd partition bs=128k count=something to see what the SSD will spit out without Ceph in the picture. Calvin On Thu, May 10, 2012 at 7:09 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: OK, here some retests. I had the SDDs conected to an old Raid controller even i did used them as JBODs (oops). Here are two new Tests (using kernel 3.4-rc6) it would be great if someone could tell me if they're fine or bad. New tests with all 3 SSDs connected to the mainboard. #~ rados -p rbd bench 60 write Total time run: 60.342419 Total writes made: 2021 Write size: 4194304 Bandwidth (MB/sec): 133.969 Average Latency: 0.477476 Max latency: 0.942029 Min latency: 0.109467 #~ rados -p rbd bench 60 write -b 4096 Total time run: 60.726326 Total writes made: 59026 Write size: 4096 Bandwidth (MB/sec): 3.797 Average Latency: 0.016459 Max latency: 0.874841 Min latency: 0.002392 Another test with only osd on the disk and the journal in memory / tmpfs: #~ rados -p rbd bench 60 write Total time run: 60.513240 Total writes made: 2555 Write size: 4194304 Bandwidth (MB/sec): 168.889 Average Latency: 0.378775 Max latency: 4.59233 Min latency: 0.055179 #~ rados -p rbd bench 60 write -b 4096 Total time run: 60.116260 Total writes made: 281903 Write size: 4096 Bandwidth (MB/sec): 18.318 Average Latency: 0.00341067 Max latency: 0.720486 Min latency: 0.000602 Another problem i have is i'm always getting: 2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN] message from mon.2 was stamped 0.109244s in the future, clocks not synchronized even on all systems ntp is running fine. Stefan Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG: Dear List, i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is. my testsetup: 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x 1Gbit/s LAN each All 3 are running as mon a-c and osd 0-2. Two of them are also running as mds.2 and mds.3 (has 8GB RAM instead of 4GB). All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use eth0+eth1 as bond0 (mode 6). This gives me: rados -p rbd bench 60 write ... Total time run: 61.465323 Total writes made: 776 Write size: 4194304 Bandwidth (MB/sec): 50.500 Average Latency: 1.2654 Max latency: 2.77124 Min latency: 0.170936 Shouldn't it be at least 100MB/s? (1Gbit/s / 8) And rados -p rbd bench 60 write -b 4096 gives pretty bad results: Total time run: 60.221130 Total writes made: 6401 Write size: 4096 Bandwidth (MB/sec): 0.415 Average Latency: 0.150525 Max latency: 1.12647 Min latency: 0.026599 All btrfs ssds are also mounted with noatime. Thanks for your help! Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph on btrfs 3.4rc
On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote: Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net: On Tue, 24 Apr 2012, Josef Bacik wrote: On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote: After running ceph on XFS for some time, I decided to try btrfs again. Performance with the current for-linux-min branch and big metadata is much better. The only problem (?) I'm still seeing is a warning that seems to occur from time to time: Actually, before you do that... we have a new tool, test_filestore_workloadgen, that generates a ceph-osd-like workload on the local file system. It's a subset of what a full OSD might do, but if we're lucky it will be sufficient to reproduce this issue. Something like test_filestore_workloadgen --osd-data /foo --osd-journal /bar will hopefully do the trick. Christian, maybe you can see if that is able to trigger this warning? You'll need to pull it from the current master branch; it wasn't in the last release. Trying to reproduce with test_filestore_workloadgen didn't work for me. So here are some instructions on how to reproduce with a minimal ceph setup. You will need a single system with two disks and a bit of memory. - Compile and install ceph (detailed instructions: http://ceph.newdream.net/docs/master/ops/install/mkcephfs/) - For the test setup I've used two tmpfs files as journal devices. To create these, do the following: # mkdir -p /ceph/temp # mount -t tmpfs tmpfs /ceph/temp # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k - Now you should create and mount btrfs. Here is what I did: # mkfs.btrfs -l 64k -n 64k /dev/sda # mkfs.btrfs -l 64k -n 64k /dev/sdb # mkdir /ceph/osd.000 # mkdir /ceph/osd.001 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001 - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You will probably have to change the btrfs devices and the hostname (os39). - Create the ceph filesystems: # mkdir /ceph/mon # mkcephfs -a -c /etc/ceph/ceph.conf - Start ceph (e.g. service ceph start) - Now you should be able to use ceph - ceph -s will tell you about the state of the ceph cluster. - rbd create -size 100 testimg will create an rbd image on the ceph cluster. - Compile my test with gcc -o rbdtest rbdtest.c -lrbd and run it with ./rbdtest testimg. I can see the first btrfs_orphan_commit_root warning after an hour or so... I hope that I've described all necessary steps. If there is a problem just send me a note. Well I feel like an idiot, I finally get it to reproduce, go look at where I want to put my printks and theres the problem staring me right in the face. I've looked seriously at this problem 2 or 3 times and have missed this every single freaking time. Here is the patch I'm trying, please try it on yours to make sure it fixes the problem. It takes like 2 hours for it to reproduce for me so I won't be able to fully test it until tomorrow, but so far it hasn't broken anything so it should be good. Thanks, Josef diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index eefe573..4ad628d 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -57,9 +57,6 @@ struct btrfs_inode { /* used to order data wrt metadata */ struct btrfs_ordered_inode_tree ordered_tree; - /* for keeping track of orphaned inodes */ - struct list_head i_orphan; - /* list of all the delalloc inodes in the FS. There are times we need * to write all the delalloc pages to disk, and this list is used * to walk them all. @@ -164,6 +161,7 @@ struct btrfs_inode { unsigned dummy_inode:1; unsigned in_defrag:1; unsigned delalloc_meta_reserved:1; + unsigned has_orphan_item:1; /* * always compress this one file diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 8a89888..6dd20f3 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1375,7 +1375,7 @@ struct btrfs_root { struct list_head root_list; spinlock_t orphan_lock; - struct list_head orphan_list; + atomic_t orphan_inodes; struct btrfs_block_rsv *orphan_block_rsv; int orphan_item_inserted; int orphan_cleanup_state; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7f849b3..8bbe8c4 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1148,7 +1148,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 sectorsize, root-orphan_block_rsv = NULL; INIT_LIST_HEAD(root-dirty_list); - INIT_LIST_HEAD(root-orphan_list); INIT_LIST_HEAD(root-root_list); spin_lock_init(root-orphan_lock); spin_lock_init(root-inode_lock); @@ -1161,6 +1160,7 @@ static void
Re: Can I use btrfs-restore to restore ceph osds?
On Wed, May 9, 2012 at 10:15 AM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: I'm currently trying to re-enable my experimental ceph cluster that has been offline for a few months. Unfortunately, it appears that, out of the six btrfs volumes involved, only one can still be mounted, the other five are broken somehow. (If I ever use Ceph in production, it's probably not going to be on btrfs after this... I cannot recall whether or not the servers were properly shut down the last time, but even if not, this is a bit ridiculous.) I cannot seem to repair the broken filesystem with btrfsck, but I can extract data from them with btrfs-restore. OSD uses btrfs snapshots internally. Any restore operation would have to bring the snapshots back exactly as they were, too. It seems there's a -s option for that, but whether things will work out is hard to predict.. since it was a test cluster, perhaps you're better off scrapping the data and setting up a new cluster. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compile error in rgw/rgw_xml.h in 0.46
Oops, missed posting it to the list (seeing Tommi's comment). On Wed, May 9, 2012 at 11:04 AM, Yehuda Sadeh yeh...@inktank.com wrote: On Wed, May 9, 2012 at 6:46 AM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Compiling Ceph 0.46 fails at rgw/rgw_dencoder.cc with the following errors: In file included from rgw/rgw_dencoder.cc:7: rgw/rgw_acl_s3.h:9:19: error: expat.h: No such file or directory In file included from rgw/rgw_acl_s3.h:11, from rgw/rgw_dencoder.cc:7: rgw/rgw_xml.h:59: error: 'XML_Parser' does not name a type make[3]: *** [ceph_dencoder-rgw_dencoder.o] Error 1 make[3]: Leaving directory `/root/ceph-0.46/src' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/root/ceph-0.46/src' make[1]: *** [all] Error 2 make[1]: Leaving directory `/root/ceph-0.46/src' make: *** [all-recursive] Error 1 This is on CentOS 6.2. I managed to get it to compile by installing expat-devel first. Maybe the configure script should check for the existence of the expat header files? Actually, the config script checks for that, but only if rgw is enabled. We need to figure out what would be the best solution. I don't think expat should be a dependency when rgw is not enabled, so we should need to figure how to remove this dependency. It might be that the easiest solution would be to not compile all the rgw stuff into the dencoder when rgw is not enabled (which makes sense anyway). I opened bug #2390 to track this issue. Thanks, Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Always creating PGs
On Thu, May 10, 2012 at 2:17 AM, Tomoki BENIYA ben...@bit-isle.co.jp wrote: I found that PGs of 'creating' status are never finished. # ceph pg stat v17596: 1204 pgs: 8 creating, 1196 active+clean; 25521 MB data, 77209 MB used, 2223 GB / 2318 GB avail ~~ always creating, why? # ceph pg dump|grep creating 1.1p6 0 0 0 0 0 0 0 creating 0.00 0'0 0'0 [] [] 0'0 0.00 Sounds like another report we've had recently, under the subject PGs stuck in creating state http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6078 Please read that thread. That wart should go away at 0.47. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure
Hi Sage, I've been trying to solve the issue mentioned in tracker #2047, which I think is the same as I described in http://www.spinics.net/lists/ceph-devel/msg05824.html The attached patches seem to fix it for me. I also attempted to address the local search issue you mentioned in #2047. I'm testing this using a cluster with 3 rows, 2 racks/row, 2 hosts/rack, 4 osds/host. I tested against a CRUSH map with the rules: step take root step chooseleaf firstn 0 type rack step emit I'm in the processes of testing this as follows: I wrote some data to the cluster, then started shutting down OSDs using init-ceph stop osd.n. For the first rack's worth, I shut OSDs down sequentially. I waited for recovery to complete each time before stopping the next OSD. For the next rack I shut down the first 3 OSDs on a host at the same time, waited for recovery to complete, then shut down the last OSD on that host. For the next racks, I shut down all the OSDs on the hosts in the rack at the same time. Right now I'm waiting for recovery to complete after shutting down the third rack. Once recovery completed after each phase so far, there were no degraded objects. So, this is looking fairly solid to me so far. What do you think? Thanks -- Jim Jim Schutt (2): ceph: retry CRUSH map descent before retrying bucket ceph: retry CRUSH map descent from root if leaf is failed src/crush/mapper.c | 30 ++ 1 files changed, 22 insertions(+), 8 deletions(-) -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/2] ceph: retry CRUSH map descent before retrying bucket
For the first few rejections or collisions, we'll retry the descent to keep objects spread across the cluster. After that, we'll fall back to exhaustive search of the bucket to avoid trying forever in the event a bucket has only a few in items and the hash doesn't do a good job of finding them. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/crush/mapper.c | 20 ++-- 1 files changed, 14 insertions(+), 6 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index 8857577..e5dc950 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -350,8 +350,7 @@ static int crush_choose(const struct crush_map *map, reject = 1; goto reject; } - if (flocal = (in-size1) - flocal orig_tries) + if (ftotal = orig_tries)/* exhaustive bucket search */ item = bucket_perm_choose(in, x, r); else item = crush_bucket_choose(in, x, r); @@ -420,10 +419,19 @@ reject: if (reject || collide) { ftotal++; flocal++; - - if (collide flocal 3) - /* retry locally a few times */ - retry_bucket = 1; + /* +* For the first couple rejections or collisions, +* we'll retry the descent to keep objects spread +* across the cluster. After that, we'll fall back +* to exhaustive search of buckets to avoid trying +* forever in the event a bucket has only a few +* in items and the hash doesn't do a good job +* of finding them. Note that we need to retry +* descent during that phase so that multiple +* buckets can be exhaustively searched. +*/ + if (ftotal = orig_tries) + retry_descent = 1; else if (flocal = in-size + orig_tries) /* exhaustive bucket search */ retry_bucket = 1; -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] ceph: retry CRUSH map descent from root if leaf is failed
When an object is re-replicated after a leaf failure, the remapped replica ends up under the bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves of a bucket but one fail, that remaining leaf holds all the data from its failed peers. For example, consider the crush rule step chooseleaf firstn 0 type node_type This rule means that n replicas will be chosen in such a manner that each chosen leaf's branch will contain a unique instance of node_type. For such ops, the tree descent has two steps: call them the inner and outer descent. If the tree descent down to node_type is the outer descent, and the descent from node_type down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, but we want to retry the outer descent. This ensures that re-replication after a leaf failure disperses the re-replicated objects as widely as possible across the storage cluster. Fix this by causing the inner descent to return immediately on choosing a failed leaf, unless we've fallen back to exhaustive search. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt jasc...@sandia.gov --- src/crush/mapper.c | 12 +--- 1 files changed, 9 insertions(+), 3 deletions(-) diff --git a/src/crush/mapper.c b/src/crush/mapper.c index e5dc950..698da55 100644 --- a/src/crush/mapper.c +++ b/src/crush/mapper.c @@ -286,6 +286,7 @@ static int is_out(const struct crush_map *map, const __u32 *weight, int item, in * @param outpos our position in that vector * @param firstn true if choosing first n items, false if choosing indep * @param recurseto_leaf: true if we want one device under each item of given type + * @param descend_once true if we should only try one descent before giving up * @param out2 second output vector for leaf items (if @a recurse_to_leaf) */ static int crush_choose(const struct crush_map *map, @@ -293,7 +294,7 @@ static int crush_choose(const struct crush_map *map, const __u32 *weight, int x, int numrep, int type, int *out, int outpos, - int firstn, int recurse_to_leaf, + int firstn, int recurse_to_leaf, int descend_once, int *out2) { int rep; @@ -397,6 +398,7 @@ static int crush_choose(const struct crush_map *map, x, outpos+1, 0, out2, outpos, firstn, 0, +ftotal orig_tries, NULL) = outpos) /* didn't get leaf */ reject = 1; @@ -430,7 +432,10 @@ reject: * descent during that phase so that multiple * buckets can be exhaustively searched. */ - if (ftotal = orig_tries) + if (reject descend_once) + /* let outer call try again */ + skip_rep = 1; + else if (ftotal = orig_tries) retry_descent = 1; else if (flocal = in-size + orig_tries) /* exhaustive bucket search */ @@ -491,6 +496,7 @@ int crush_do_rule(const struct crush_map *map, int i, j; int numrep; int firstn; + const int descend_once = 0; if ((__u32)ruleno = map-max_rules) { dprintk( bad ruleno %d\n, ruleno); @@ -550,7 +556,7 @@ int crush_do_rule(const struct crush_map *map, curstep-arg2, o+osize, j, firstn, - recurse_to_leaf, c+osize); + recurse_to_leaf, descend_once, c+osize); } if (recurse_to_leaf) -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: converting btrfs osds to xfs?
After I run the 'ceph osd out 123' command, is there a specific ceph command I can poll so I know when it's OK to kill the OSD daemon and begin the reformat process? On Tue, May 8, 2012 at 12:38 PM, Sage Weil s...@newdream.net wrote: On Tue, 8 May 2012, Tommi Virtanen wrote: On Tue, May 8, 2012 at 8:39 AM, Nick Bartos n...@pistoncloud.com wrote: I am considering converting some OSDs to xfs (currently running btrfs) for stability reasons. I have a couple of ideas for doing this, and was hoping to get some comments: Method #1: 1. Check cluster health and make sure data on a specific OSD is replicated elsewhere. 2. Bring down the OSD 3. Reformat it to xfs 4. Restart OSD 5. Repeat 1-4 until all btrfs OSDs have been converted. ... Obviously #1 seems much more appetizing, but unfortunately I can't seem to find out how to verify that data on a specific OSD is replicated elsewhere. I could go off general cluster health, but that seems more error prone. You can set the osd weight in crush to 0 and wait for the files inside the osd data dir to disappear. If you want to control how much You can also just mark the osd 'out' without touching the CRUSH map; that'll be easier and a bit more efficient wrt data movement: ceph osd out 123 When the one comes back, you'll need to ceph osd in 123 sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] ceph: messenger: read_partial() cleanups
This short series adds the use of read_partial() in a few places that it is not already. It also gets rid of the in/out to argument (which continues to cause confusion every time I see it), using an in-only end argument in its place. -Alex -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: converting btrfs osds to xfs?
On Thu, May 10, 2012 at 3:44 PM, Nick Bartos n...@pistoncloud.com wrote: After I run the 'ceph osd out 123' command, is there a specific ceph command I can poll so I know when it's OK to kill the OSD daemon and begin the reformat process? Good question! ceph -s will show you that. This is from a run where I ran ceph osd out 1 on a cluster of 3 osds. See the active+clean counts going up and active+recovering counts going down, and the degraded percentage dropping. The last line is an example of an all done situation. 2012-05-10 17:19:47.376864pg v144: 24 pgs: 14 active+clean, 10 active+recovering; 180 MB data, 100217 MB used, 173 GB / 285 GB avail; 88/132 degraded (66.667%) 2012-05-10 17:19:59.220607pg v146: 24 pgs: 19 active+clean, 5 active+recovering; 180 MB data, 100227 MB used, 173 GB / 285 GB avail; 24/132 degraded (18.182%) 2012-05-10 17:20:16.522978pg v148: 24 pgs: 24 active+clean; 180 MB data, 100146 MB used, 173 GB / 285 GB avail If you want a lower-level double-check, you can peek inside the osd data directory and see that the current subdirectory has no *_head entries, du is low, etc. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: converting btrfs osds to xfs?
On Thu, May 10, 2012 at 5:23 PM, Tommi Virtanen t...@inktank.com wrote: Good question! ceph -s will show you that. This is from a run where I ran ceph osd out 1 on a cluster of 3 osds. See the active+clean counts going up and active+recovering counts going down, and the degraded percentage dropping. The last line is an example of an all done situation. Oh, and if your cluster is busy enough that there's always some rebalancing going on, you might never get to 100% active+clean. In that case, I do believe that ceph pg dump probably contains all the information needed, and --format=json makes it parseable, but it's just not currently documented. We really should provide a good way of accessing that information. I filed http://tracker.newdream.net/issues/2394 to keep track of this task. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] ceph: messenger: use read_partial() in read_partial_message()
There are two blocks of code in read_partial_message()--those that read the header and footer of the message--that can be replaced by a call to read_partial(). Do that. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 30 ++ 1 file changed, 10 insertions(+), 20 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index f0993af..673133e 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1628,7 +1628,7 @@ static int read_partial_message(struct ceph_connection *con) { struct ceph_msg *m = con-in_msg; int ret; - int to, left; + int to; unsigned front_len, middle_len, data_len; bool do_datacrc = !con-msgr-nocrc; int skip; @@ -1638,15 +1638,10 @@ static int read_partial_message(struct ceph_connection *con) dout(read_partial_message con %p msg %p\n, con, m); /* header */ - while (con-in_base_pos sizeof(con-in_hdr)) { - left = sizeof(con-in_hdr) - con-in_base_pos; - ret = ceph_tcp_recvmsg(con-sock, - (char *)con-in_hdr + con-in_base_pos, - left); - if (ret = 0) - return ret; - con-in_base_pos += ret; - } + to = 0; + ret = read_partial(con, to, sizeof (con-in_hdr), con-in_hdr); + if (ret = 0) + return ret; crc = crc32c(0, con-in_hdr, offsetof(struct ceph_msg_header, crc)); if (cpu_to_le32(crc) != con-in_hdr.crc) { @@ -1759,16 +1754,11 @@ static int read_partial_message(struct ceph_connection *con) } /* footer */ - to = sizeof(m-hdr) + sizeof(m-footer); - while (con-in_base_pos to) { - left = to - con-in_base_pos; - ret = ceph_tcp_recvmsg(con-sock, (char *)m-footer + - (con-in_base_pos - sizeof(m-hdr)), - left); - if (ret = 0) - return ret; - con-in_base_pos += ret; - } + to = sizeof (m-hdr); + ret = read_partial(con, to, sizeof (m-footer), m-footer); + if (ret = 0) + return ret; + dout(read_partial_message got msg %p %d (%u) + %d (%u) + %d (%u)\n, m, front_len, m-footer.front_crc, middle_len, m-footer.middle_crc, data_len, m-footer.data_crc); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] ceph: messenger: update to in read_partial() caller
read_partial() always increases whatever to value is supplied by adding the requested size to it. That's the only thing it does with that pointed-to value. Do that pointer advance in the caller (and then only when the updated value will be subsequently used), and change the to parameter to be an in-only and non-pointer value. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 31 --- 1 file changed, 16 insertions(+), 15 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 673133e..37fd2ae 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -992,11 +992,12 @@ static int prepare_read_message(struct ceph_connection *con) static int read_partial(struct ceph_connection *con, - int *to, int size, void *object) + int to, int size, void *object) { - *to += size; - while (con-in_base_pos *to) { - int left = *to - con-in_base_pos; + int end = to + size; + + while (con-in_base_pos end) { + int left = end - con-in_base_pos; int have = size - left; int ret = ceph_tcp_recvmsg(con-sock, object + have, left); if (ret = 0) @@ -1017,14 +1018,16 @@ static int read_partial_banner(struct ceph_connection *con) dout(read_partial_banner %p at %d\n, con, con-in_base_pos); /* peer's banner */ - ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner); + ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner); if (ret = 0) goto out; - ret = read_partial(con, to, sizeof(con-actual_peer_addr), + to += strlen(CEPH_BANNER); + ret = read_partial(con, to, sizeof(con-actual_peer_addr), con-actual_peer_addr); if (ret = 0) goto out; - ret = read_partial(con, to, sizeof(con-peer_addr_for_me), + to += sizeof(con-actual_peer_addr); + ret = read_partial(con, to, sizeof(con-peer_addr_for_me), con-peer_addr_for_me); if (ret = 0) goto out; @@ -1038,10 +1041,11 @@ static int read_partial_connect(struct ceph_connection *con) dout(read_partial_connect %p at %d\n, con, con-in_base_pos); - ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply); + ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply); if (ret = 0) goto out; - ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len), + to += sizeof(con-in_reply); + ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len), con-auth_reply_buf); if (ret = 0) goto out; @@ -1491,9 +1495,7 @@ static int process_connect(struct ceph_connection *con) */ static int read_partial_ack(struct ceph_connection *con) { - int to = 0; - - return read_partial(con, to, sizeof(con-in_temp_ack), + return read_partial(con, 0, sizeof(con-in_temp_ack), con-in_temp_ack); } @@ -1638,8 +1640,7 @@ static int read_partial_message(struct ceph_connection *con) dout(read_partial_message con %p msg %p\n, con, m); /* header */ - to = 0; - ret = read_partial(con, to, sizeof (con-in_hdr), con-in_hdr); + ret = read_partial(con, 0, sizeof (con-in_hdr), con-in_hdr); if (ret = 0) return ret; @@ -1755,7 +1756,7 @@ static int read_partial_message(struct ceph_connection *con) /* footer */ to = sizeof (m-hdr); - ret = read_partial(con, to, sizeof (m-footer), m-footer); + ret = read_partial(con, to, sizeof (m-footer), m-footer); if (ret = 0) return ret; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] ceph: messenger: change read_partial() to take end arg
Make the second argument to read_partial() be the ending input byte position rather than the beginning offset it now represents. This amounts to moving the addition to + size into the caller. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 59 -- 1 file changed, 38 insertions(+), 21 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 37fd2ae..364c902 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -992,10 +992,8 @@ static int prepare_read_message(struct ceph_connection *con) static int read_partial(struct ceph_connection *con, - int to, int size, void *object) + int end, int size, void *object) { - int end = to + size; - while (con-in_base_pos end) { int left = end - con-in_base_pos; int have = size - left; @@ -1013,40 +1011,52 @@ static int read_partial(struct ceph_connection *con, */ static int read_partial_banner(struct ceph_connection *con) { - int ret, to = 0; + int size; + int end; + int ret; dout(read_partial_banner %p at %d\n, con, con-in_base_pos); /* peer's banner */ - ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner); + size = strlen(CEPH_BANNER); + end = size; + ret = read_partial(con, end, size, con-in_banner); if (ret = 0) goto out; - to += strlen(CEPH_BANNER); - ret = read_partial(con, to, sizeof(con-actual_peer_addr), - con-actual_peer_addr); + + size = sizeof (con-actual_peer_addr); + end += size; + ret = read_partial(con, end, size, con-actual_peer_addr); if (ret = 0) goto out; - to += sizeof(con-actual_peer_addr); - ret = read_partial(con, to, sizeof(con-peer_addr_for_me), - con-peer_addr_for_me); + + size = sizeof (con-peer_addr_for_me); + end += size; + ret = read_partial(con, end, size, con-peer_addr_for_me); if (ret = 0) goto out; + out: return ret; } static int read_partial_connect(struct ceph_connection *con) { - int ret, to = 0; + int size; + int end; + int ret; dout(read_partial_connect %p at %d\n, con, con-in_base_pos); - ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply); + size = sizeof (con-in_reply); + end = size; + ret = read_partial(con, end, size, con-in_reply); if (ret = 0) goto out; - to += sizeof(con-in_reply); - ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len), - con-auth_reply_buf); + + size = le32_to_cpu(con-in_reply.authorizer_len); + end += size; + ret = read_partial(con, end, size, con-auth_reply_buf); if (ret = 0) goto out; @@ -1495,8 +1505,10 @@ static int process_connect(struct ceph_connection *con) */ static int read_partial_ack(struct ceph_connection *con) { - return read_partial(con, 0, sizeof(con-in_temp_ack), - con-in_temp_ack); + int size = sizeof (con-in_temp_ack); + int end = size; + + return read_partial(con, end, size, con-in_temp_ack); } @@ -1629,6 +1641,8 @@ static int read_partial_message_bio(struct ceph_connection *con, static int read_partial_message(struct ceph_connection *con) { struct ceph_msg *m = con-in_msg; + int size; + int end; int ret; int to; unsigned front_len, middle_len, data_len; @@ -1640,7 +1654,9 @@ static int read_partial_message(struct ceph_connection *con) dout(read_partial_message con %p msg %p\n, con, m); /* header */ - ret = read_partial(con, 0, sizeof (con-in_hdr), con-in_hdr); + size = sizeof (con-in_hdr); + end = size; + ret = read_partial(con, end, size, con-in_hdr); if (ret = 0) return ret; @@ -1755,8 +1771,9 @@ static int read_partial_message(struct ceph_connection *con) } /* footer */ - to = sizeof (m-hdr); - ret = read_partial(con, to, sizeof (m-footer), m-footer); + size = sizeof (m-footer); + end += size; + ret = read_partial(con, end, size, m-footer); if (ret = 0) return ret; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html