Always creating PGs

2012-05-10 Thread Tomoki BENIYA
Hi, all

I created Ceph file system using Debian7 of 64bit.

I found that PGs of 'creating' status are never finished.

# ceph pg stat
v17596: 1204 pgs: 8 creating, 1196 active+clean; 25521 MB data, 77209 MB used, 
2223 GB / 2318 GB avail
  ~~
  always creating, why?

# ceph pg dump|grep creating
1.1p6   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
0.1p7   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
1.1p7   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
0.1p6   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
1.1p8   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
0.1p9   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
1.1p9   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00
0.1p8   0   0   0   0   0   0   0   creating
0.000'0 0'0 []  []  0'0 0.00

Configuration is wrong?
How to resolve this problem?


[global]
auth supported = cephx
max open files = 131072
log_to_syslog = true
pid file = /var/run/ceph/$name.pid
keyring = /etc/ceph/keyring.bin

[mon]
mon data = /ceph/$name

[mon.0]
host = mon0
mon addr = 192.168.233.81:6789

[mon.1]
host = mon1
mon addr = 192.168.233.82:6789

[mon.2]
host = mon2
mon addr = 192.168.233.83:6789

[mds]
keyring = /etc/ceph/keyring.$name

[mds.0]
host = mds0

[osd]
osd journal = /var/journal
osd journal size = 1000
keyring = /etc/ceph/keyring.$name

[osd.0]
host = osd0
btrfs devs = /dev/sdb1
osd data = /ceph0
osd journal = /var/journal0

[osd.1]
host = osd0
btrfs devs = /dev/sdc1
osd data = /ceph1
osd journal = /var/journal1

[osd.2]
host = osd1
btrfs devs = /dev/sdb1
osd data = /ceph0
osd journal = /var/journal0

[osd.3]
host = osd1
btrfs devs = /dev/sdc1
osd data = /ceph1
osd journal = /var/journal1

[osd.4]
host = osd2
btrfs devs = /dev/sdb1
osd data = /ceph0
osd journal = /var/journal0

[osd.5]
host = osd2
btrfs devs = /dev/sdc1
osd data = /ceph1
osd journal = /var/journal1

[osd.6]
host = osd3
btrfs devs = /dev/sdb1
osd data = /ceph0
osd journal = /var/journal0

[osd.7]
host = osd3
btrfs devs = /dev/sdc1
osd data = /ceph1
osd journal = /var/journal1

[osd.8]
host = osd4
btrfs devs = /dev/sdb1
osd data = /ceph0
osd journal = /var/journal0

[osd.9]
host = osd4
btrfs devs = /dev/sdc1
osd data = /ceph1
osd journal = /var/journal1


Thanks.

-- 
Tomoki BENIYA ben...@bit-isle.co.jp

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


slow performance even when using SSDs

2012-05-10 Thread Stefan Priebe - Profihost AG
Dear List,

i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.

my testsetup:
3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
1Gbit/s LAN each

All 3 are running as mon a-c and osd 0-2. Two of them are also running
as mds.2 and mds.3 (has 8GB RAM instead of 4GB).

All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
eth0+eth1 as bond0 (mode 6).

This gives me:
rados -p rbd bench 60 write

...
Total time run:61.465323
Total writes made: 776
Write size:4194304
Bandwidth (MB/sec):50.500

Average Latency:   1.2654
Max latency:   2.77124
Min latency:   0.170936

Shouldn't it be at least 100MB/s? (1Gbit/s / 8)

And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
Total time run:60.221130
Total writes made: 6401
Write size:4096
Bandwidth (MB/sec):0.415

Average Latency:   0.150525
Max latency:   1.12647
Min latency:   0.026599

All btrfs ssds are also mounted with noatime.

Thanks for your help!

Greets Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow performance even when using SSDs

2012-05-10 Thread Stefan Priebe - Profihost AG
OK, here some retests. I had the SDDs conected to an old Raid controller
even i did used them as JBODs (oops).

Here are two new Tests (using kernel 3.4-rc6) it would be great if
someone could tell me if they're fine or bad.

New tests with all 3 SSDs connected to the mainboard.

#~ rados -p rbd bench 60 write
Total time run:60.342419
Total writes made: 2021
Write size:4194304
Bandwidth (MB/sec):133.969

Average Latency:   0.477476
Max latency:   0.942029
Min latency:   0.109467

#~ rados -p rbd bench 60 write -b 4096
Total time run:60.726326
Total writes made: 59026
Write size:4096
Bandwidth (MB/sec):3.797

Average Latency:   0.016459
Max latency:   0.874841
Min latency:   0.002392

Another test with only osd on the disk and the journal in memory / tmpfs:
#~ rados -p rbd bench 60 write
Total time run:60.513240
Total writes made: 2555
Write size:4194304
Bandwidth (MB/sec):168.889

Average Latency:   0.378775
Max latency:   4.59233
Min latency:   0.055179

#~ rados -p rbd bench 60 write -b 4096
Total time run:60.116260
Total writes made: 281903
Write size:4096
Bandwidth (MB/sec):18.318

Average Latency:   0.00341067
Max latency:   0.720486
Min latency:   0.000602

Another problem i have is i'm always getting:
2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
message from mon.2 was stamped 0.109244s in the future, clocks not
synchronized

even on all systems ntp is running fine.

Stefan

Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG:
 Dear List,
 
 i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.
 
 my testsetup:
 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
 1Gbit/s LAN each
 
 All 3 are running as mon a-c and osd 0-2. Two of them are also running
 as mds.2 and mds.3 (has 8GB RAM instead of 4GB).
 
 All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
 them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
 eth0+eth1 as bond0 (mode 6).
 
 This gives me:
 rados -p rbd bench 60 write
 
 ...
 Total time run:61.465323
 Total writes made: 776
 Write size:4194304
 Bandwidth (MB/sec):50.500
 
 Average Latency:   1.2654
 Max latency:   2.77124
 Min latency:   0.170936
 
 Shouldn't it be at least 100MB/s? (1Gbit/s / 8)
 
 And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
 Total time run:60.221130
 Total writes made: 6401
 Write size:4096
 Bandwidth (MB/sec):0.415
 
 Average Latency:   0.150525
 Max latency:   1.12647
 Min latency:   0.026599
 
 All btrfs ssds are also mounted with noatime.
 
 Thanks for your help!
 
 Greets Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Designing a cluster guide

2012-05-10 Thread Stefan Priebe - Profihost AG
Hi,

the Designing a cluster guide
http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
still leaves some questions unanswered.

It mentions for example Fast CPU for the mds system. What does fast
mean? Just the speed of one core? Or is ceph designed to use multi core?
Is multi core or more speed important?

The Cluster Design Recommendations mentions to seperate all Daemons on
dedicated machines. Is this also for the MON useful? As they're so
leightweight why not running them on the OSDs?

Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
and you should go for 22x SSD Disks in a Raid 6?  Is it more useful the
use a Raid 6 HW Controller or the btrfs raid?

Use single socket Xeon for the OSDs or Dual Socket?

Thanks and greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] ceph: add tracepoints for message send queueing and completion, reply handling

2012-05-10 Thread Jim Schutt

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 include/trace/events/ceph.h |   67 +++
 net/ceph/messenger.c|9 +-
 net/ceph/osd_client.c   |1 +
 3 files changed, 76 insertions(+), 1 deletions(-)

diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
index 182af2c..b390f78 100644
--- a/include/trace/events/ceph.h
+++ b/include/trace/events/ceph.h
@@ -4,6 +4,9 @@
 #if !defined(_TRACE_CEPH_H) || defined(TRACE_HEADER_MULTI_READ)
 #define _TRACE_CEPH_H
 
+#if defined(CEPH_TRACE_FS_FILE) \
+ || defined(CEPH_TRACE_FS_ADDR) \
+ || defined(CEPH_TRACE_NET_OSDC)
 
 #if !defined(TRACE_HEADER_MULTI_READ)
 
@@ -68,8 +71,72 @@ DEFINE_CEPH_START_REQ_EVENT(ceph_async_readpages_req);
 #ifdef CEPH_TRACE_NET_OSDC
 DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_writepages_req);
 DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_readpages_req);
+
+TRACE_EVENT(ceph_handle_reply_msg,
+   TP_PROTO(struct ceph_connection *con,
+struct ceph_msg *msg,
+struct ceph_osd_reply_head *reply,
+void *req),
+   TP_ARGS(con, msg, reply, req),
+   TP_STRUCT__entry(
+   __field(unsigned long long, tid)
+   __field(long long, peer_num)
+   __field(void*, req)
+   __field(unsigned, peer_type)
+   __field(int, result)
+   __field(int, flags)
+   ),
+   TP_fast_assign(
+   __entry-tid = le64_to_cpu(msg-hdr.tid);
+   __entry-peer_num = le64_to_cpu(con-peer_name.num);
+   __entry-peer_type = con-peer_name.type;
+   __entry-req = req;
+   __entry-result = le32_to_cpu(reply-result);
+   __entry-flags = le32_to_cpu(reply-flags);
+   ),
+   TP_printk(peer %s%lld tid %llu result %d flags 0x%08x (req %p),
+ ceph_entity_type_name(__entry-peer_type), __entry-peer_num,
+ __entry-tid, __entry-result, __entry-flags,  __entry-req
+   )
+);
 #endif
 
+#endif /* CEPH_TRACE_FS_FILE || CEPH_TRACE_FS_ADDR || CEPH_TRACE_NET_OSDC */
+
+#if defined(CEPH_TRACE_NET_MESSENGER)
+
+DECLARE_EVENT_CLASS(ceph_write_msg_class,
+   TP_PROTO(struct ceph_connection *con, struct ceph_msg *msg),
+   TP_ARGS(con, msg),
+   TP_STRUCT__entry(
+   __field(unsigned long long, tid)
+   __field(unsigned long long, seq)
+   __field(long long, peer_num)
+   __field(unsigned, peer_type)
+   __field(int, sent)
+   ),
+   TP_fast_assign(
+   __entry-tid = le64_to_cpu(msg-hdr.tid);
+   __entry-seq = le64_to_cpu(msg-hdr.seq);
+   __entry-peer_num = le64_to_cpu(con-peer_name.num);
+   __entry-peer_type = con-peer_name.type;
+   __entry-sent = con-out_msg_pos.data_pos;
+   ),
+   TP_printk(peer %s%lld tid %llu seq %llu sent %d,
+ ceph_entity_type_name(__entry-peer_type), __entry-peer_num,
+ __entry-tid, __entry-seq, __entry-sent)
+);
+
+#define CEPH_WRITE_MSG_EVENT(name) \
+DEFINE_EVENT(ceph_write_msg_class, name, \
+   TP_PROTO(struct ceph_connection *con, struct ceph_msg *msg), \
+   TP_ARGS(con, msg))
+
+CEPH_WRITE_MSG_EVENT(ceph_prepare_write_msg);
+CEPH_WRITE_MSG_EVENT(ceph_try_write_msg);
+CEPH_WRITE_MSG_EVENT(ceph_try_write_msg_done);
+
+#endif /* CEPH_TRACE_NET_MESSENGER */
 
 #endif /* _TRACE_CEPH_H */
 
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ad5b708..033c4ab 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -20,6 +20,11 @@
 #include linux/ceph/pagelist.h
 #include linux/export.h
 
+#include linux/tracepoint.h
+#define CREATE_TRACE_POINTS
+#define CEPH_TRACE_NET_MESSENGER
+#include trace/events/ceph.h
+
 /*
  * Ceph uses the messenger to exchange ceph_msg messages with other
  * hosts in the system.  The messenger provides ordered and reliable
@@ -555,7 +560,7 @@ static void prepare_write_message(struct ceph_connection 
*con)
/* no, queue up footer too and be done */
prepare_write_message_footer(con, v);
}
-
+   trace_ceph_prepare_write_msg(con, m);
set_bit(WRITE_PENDING, con-state);
 }
 
@@ -1853,8 +1858,10 @@ more_kvec:
 
/* msg pages? */
if (con-out_msg) {
+   trace_ceph_try_write_msg(con, con-out_msg);
if (con-out_msg_done) {
ceph_msg_put(con-out_msg);
+   trace_ceph_try_write_msg_done(con, con-out_msg);
con-out_msg = NULL;   /* we're done with this one */
goto do_next;
}
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index f44e400..767e253 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -1200,6 +1200,7 @@ static void handle_reply(struct ceph_osd_client *osdc, 
struct ceph_msg *msg,
/* lookup */

[PATCH 1/2] ceph: add tracepoints for message submission on read/write requests

2012-05-10 Thread Jim Schutt
Trace callers of ceph_osdc_start_request, so that call locations
are identified implicitly.

Put the tracepoints after calls to ceph_osdc_start_request,
since it fills in the request transaction ID and request OSD.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 fs/ceph/addr.c  |8 
 fs/ceph/file.c  |6 +++
 include/trace/events/ceph.h |   77 +++
 net/ceph/osd_client.c   |7 
 4 files changed, 98 insertions(+), 0 deletions(-)
 create mode 100644 include/trace/events/ceph.h

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 173b1d2..f552579 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -13,6 +13,12 @@
 #include mds_client.h
 #include linux/ceph/osd_client.h
 
+
+#include linux/tracepoint.h
+#define CREATE_TRACE_POINTS
+#define CEPH_TRACE_FS_ADDR
+#include trace/events/ceph.h
+
 /*
  * Ceph address space ops.
  *
@@ -338,6 +344,7 @@ static int start_read(struct inode *inode, struct list_head 
*page_list, int max)
 
dout(start_read %p starting %p %lld~%lld\n, inode, req, off, len);
ret = ceph_osdc_start_request(osdc, req, false);
+   trace_ceph_async_readpages_req(req);
if (ret  0)
goto out_pages;
ceph_osdc_put_request(req);
@@ -902,6 +909,7 @@ get_more_pages:
req-r_request-hdr.data_len = cpu_to_le32(len);
 
rc = ceph_osdc_start_request(fsc-client-osdc, req, true);
+   trace_ceph_async_writepages_req(req);
BUG_ON(rc);
req = NULL;
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index ed72428..fc31def 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -10,6 +10,11 @@
 #include super.h
 #include mds_client.h
 
+#include linux/tracepoint.h
+#define CREATE_TRACE_POINTS
+#define CEPH_TRACE_FS_FILE
+#include trace/events/ceph.h
+
 /*
  * Ceph file operations
  *
@@ -568,6 +573,7 @@ more:
req-r_inode = inode;
 
ret = ceph_osdc_start_request(fsc-client-osdc, req, false);
+   trace_ceph_sync_writepages_req(req);
if (!ret) {
if (req-r_safe_callback) {
/*
diff --git a/include/trace/events/ceph.h b/include/trace/events/ceph.h
new file mode 100644
index 000..182af2c
--- /dev/null
+++ b/include/trace/events/ceph.h
@@ -0,0 +1,77 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ceph
+
+#if !defined(_TRACE_CEPH_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_CEPH_H
+
+
+#if !defined(TRACE_HEADER_MULTI_READ)
+
+static __always_inline int
+__ceph_req_num_ops(struct ceph_osd_request *req)
+{
+   struct ceph_osd_request_head *reqhead = req-r_request-front.iov_base;
+   return le16_to_cpu(reqhead-num_ops);
+}
+
+static __always_inline int
+__ceph_req_op_opcode(struct ceph_osd_request *req, int op)
+{
+   struct ceph_osd_request_head *reqhead = req-r_request-front.iov_base;
+   if (op  le16_to_cpu(reqhead-num_ops))
+   return le16_to_cpu(reqhead-ops[op].op);
+   else
+   return 0;
+}
+
+#endif
+
+DECLARE_EVENT_CLASS(ceph_start_req_class,
+   TP_PROTO(struct ceph_osd_request *req),
+   TP_ARGS(req),
+   TP_STRUCT__entry(
+   __field(unsigned long long, tid)
+   __field(int, osd)
+   __field(int, num_ops)
+   __array(unsigned, ops, 3)
+   __field(unsigned, pages)
+   ),
+   TP_fast_assign(
+   __entry-tid = req-r_tid;
+   __entry-osd = req-r_osd-o_osd;
+   __entry-num_ops = __ceph_req_num_ops(req);
+   __entry-ops[0] = __ceph_req_op_opcode(req, 0);
+   __entry-ops[1] = __ceph_req_op_opcode(req, 1);
+   __entry-ops[2] = __ceph_req_op_opcode(req, 2);
+   __entry-pages = req-r_num_pages;
+   ),
+   TP_printk(tid %llu osd%d ops %d 0x%04x/0x%04x/0x%04x pages %u,
+ __entry-tid, __entry-osd, __entry-num_ops,
+ __entry-ops[0], __entry-ops[1], __entry-ops[2],
+ __entry-pages
+   )
+);
+
+#define DEFINE_CEPH_START_REQ_EVENT(name) \
+DEFINE_EVENT(ceph_start_req_class, name, \
+   TP_PROTO(struct ceph_osd_request *req), TP_ARGS(req))
+
+#ifdef CEPH_TRACE_FS_FILE
+DEFINE_CEPH_START_REQ_EVENT(ceph_sync_writepages_req);
+#endif
+
+#ifdef CEPH_TRACE_FS_ADDR
+DEFINE_CEPH_START_REQ_EVENT(ceph_async_writepages_req);
+DEFINE_CEPH_START_REQ_EVENT(ceph_async_readpages_req);
+#endif
+
+#ifdef CEPH_TRACE_NET_OSDC
+DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_writepages_req);
+DEFINE_CEPH_START_REQ_EVENT(ceph_osdc_readpages_req);
+#endif
+
+
+#endif /* _TRACE_CEPH_H */
+
+/* This part must be outside protection */
+#include trace/define_trace.h
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 5e25405..f44e400 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -18,6 +18,11 @@
 #include linux/ceph/auth.h
 #include linux/ceph/pagelist.h
 
+#include linux/tracepoint.h
+#define 

[PATCH 0/2] Ceph tracepoints

2012-05-10 Thread Jim Schutt
Hi Alex,

I ran across tracker #2374 today - I've been carrying these two
tracepoint patches for a while.  Perhaps you'll find them useful.

Jim Schutt (2):
  ceph: add tracepoints for message submission on read/write requests
  ceph: add tracepoints for message send queueing and completion, reply
handling

 fs/ceph/addr.c  |8 +++
 fs/ceph/file.c  |6 ++
 include/trace/events/ceph.h |  144 +++
 net/ceph/messenger.c|9 +++-
 net/ceph/osd_client.c   |8 +++
 5 files changed, 174 insertions(+), 1 deletions(-)
 create mode 100644 include/trace/events/ceph.h

-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] Ceph tracepoints

2012-05-10 Thread Alex Elder

On 05/10/2012 09:35 AM, Jim Schutt wrote:

Hi Alex,

I ran across tracker #2374 today - I've been carrying these two
tracepoint patches for a while.  Perhaps you'll find them useful.


GREAT!

I haven't looked at them but I will as soon as I get the chance.
I don't expect there's any reason not to use this as the foundation
I was looking for.

Thanks a lot.

-Alex


Jim Schutt (2):
   ceph: add tracepoints for message submission on read/write requests
   ceph: add tracepoints for message send queueing and completion, reply
 handling

  fs/ceph/addr.c  |8 +++
  fs/ceph/file.c  |6 ++
  include/trace/events/ceph.h |  144 +++
  net/ceph/messenger.c|9 +++-
  net/ceph/osd_client.c   |8 +++
  5 files changed, 174 insertions(+), 1 deletions(-)
  create mode 100644 include/trace/events/ceph.h



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-10 Thread Josef Bacik
On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
 Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net:
  On Tue, 24 Apr 2012, Josef Bacik wrote:
  On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
   After running ceph on XFS for some time, I decided to try btrfs again.
   Performance with the current for-linux-min branch and big metadata
   is much better. The only problem (?) I'm still seeing is a warning
   that seems to occur from time to time:
 
  Actually, before you do that... we have a new tool,
  test_filestore_workloadgen, that generates a ceph-osd-like workload on the
  local file system.  It's a subset of what a full OSD might do, but if
  we're lucky it will be sufficient to reproduce this issue.  Something like
 
   test_filestore_workloadgen --osd-data /foo --osd-journal /bar
 
  will hopefully do the trick.
 
  Christian, maybe you can see if that is able to trigger this warning?
  You'll need to pull it from the current master branch; it wasn't in the
  last release.
 
 Trying to reproduce with test_filestore_workloadgen didn't work for
 me. So here are some instructions on how to reproduce with a minimal
 ceph setup.
 
 You will need a single system with two disks and a bit of memory.
 
 - Compile and install ceph (detailed instructions:
 http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
 
 - For the test setup I've used two tmpfs files as journal devices. To
 create these, do the following:
 
 # mkdir -p /ceph/temp
 # mount -t tmpfs tmpfs /ceph/temp
 # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
 # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
 
 - Now you should create and mount btrfs. Here is what I did:
 
 # mkfs.btrfs -l 64k -n 64k /dev/sda
 # mkfs.btrfs -l 64k -n 64k /dev/sdb
 # mkdir /ceph/osd.000
 # mkdir /ceph/osd.001
 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
 
 - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
 will probably have to change the btrfs devices and the hostname
 (os39).
 
 - Create the ceph filesystems:
 
 # mkdir /ceph/mon
 # mkcephfs -a -c /etc/ceph/ceph.conf
 
 - Start ceph (e.g. service ceph start)
 
 - Now you should be able to use ceph - ceph -s will tell you about
 the state of the ceph cluster.
 
 - rbd create -size 100 testimg will create an rbd image on the ceph cluster.
 
 - Compile my test with gcc -o rbdtest rbdtest.c -lrbd and run it
 with ./rbdtest testimg.
 
 I can see the first btrfs_orphan_commit_root warning after an hour or
 so... I hope that I've described all necessary steps. If there is a
 problem just send me a note.
 

Well it's only taken me 2 weeks but I've finally git it all up and running,
hopefully I'll reproduce.  Thanks,

Josef
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph kernel client - kernel craches

2012-05-10 Thread Giorgos Kappes
Sorry for my late response. I reproduced the above bug with the Linux
kernel 3.3.4 and without using XEN:

uname -a
Linux node33 3.3.4 #1 SMP Wed May 9 13:00:07 EEST 2012 x86_64 GNU/Linux

The trace is shown below:


[  763.984023] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[  763.984177] BUG: unable to handle kernel paging request at 880037bd0800
[  763.984402] IP: [880037bd0800] 0x880037bd07ff
[  763.984568] PGD 1806063 PUD 180a063 PMD 800037a001e3
[  763.984845] Oops: 0011 [#1] SMP
[  763.985058] CPU 3
[  763.985124] Modules linked in: cbc netconsole loop snd_pcm
snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac
tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys
button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod
cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core
usbcore usb_common tg3 libphy mptsas mptscsih mptbase
scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan]
[  763.988002]
[  763.988002] Pid: 0, comm: swapper/3 Not tainted 3.3.4 #1 HP ProLiant DL160 G5
[  763.988002] RIP: 0010:[880037bd0800]  [880037bd0800]
0x880037bd07ff
[  763.988002] RSP: 0018:8800bfcc3e78  EFLAGS: 00010292
[  763.988002] RAX: 8800b97745b0 RBX: 8800bfcce770 RCX: 880037bd0800
[  763.988002] RDX: 880037bd1600 RSI: b9b6a040 RDI: 880037bd1600
[  763.988002] RBP: 81820080 R08: 8800b9dd0b00 R09: 00018020001c
[  763.988002] R10: 8020001c R11: 816075c0 R12: 8800bfcce7a0
[  763.988002] R13: 8800b97745b0 R14: 0003 R15: 000a
[  763.988002] FS:  () GS:8800bfcc()
knlGS:
[  763.988002] CS:  0010 DS:  ES:  CR0: 8005003b
[  763.988002] CR2: 880037bd0800 CR3: b895b000 CR4: 06e0
[  763.988002] DR0:  DR1:  DR2: 
[  763.988002] DR3:  DR6: 0ff0 DR7: 0400
[  763.988002] Process swapper/3 (pid: 0, threadinfo 8800bbae,
task 8800bbad8000)
[  763.988002] Stack:
[  763.988002]  8109b44d 8800bbacd820 8800b97745b0
8800bbae0010
[  763.988002]  8800bbad8000 8800bfcc3ea0 0048
8800bbae1fd8
[  763.988002]  0100 0001 0009
8800bbae1fd8
[  763.988002] Call Trace:
[  763.988002]  IRQ
[  763.988002]  [8109b44d] ? __rcu_process_callbacks+0x1e9/0x335
[  763.988002]  [8109b8fb] ? rcu_process_callbacks+0x2c/0x56
[  763.988002]  [8103e3b1] ? __do_softirq+0xc4/0x1a0
[  763.988002]  [8102515b] ? lapic_next_event+0x18/0x1d
[  763.988002]  [815d3b1c] ? call_softirq+0x1c/0x30
[  763.988002]  [8100fba3] ? do_softirq+0x3f/0x79
[  763.988002]  [8103e186] ? irq_exit+0x44/0xb1
[  763.988002]  [81025c61] ? smp_apic_timer_interrupt+0x85/0x93
[  763.988002]  [815d311e] ? apic_timer_interrupt+0x6e/0x80
[  763.988002]  EOI
[  763.988002]  [810145e1] ? native_sched_clock+0x28/0x33
[  763.988002]  [810152f6] ? mwait_idle+0x8c/0xbc
[  763.988002]  [810152ae] ? mwait_idle+0x44/0xbc
[  763.988002]  [8100de94] ? cpu_idle+0xb9/0xf7
[  763.988002]  [815c43c6] ? start_secondary+0x270/0x275
[  763.988002] Code: 00 00 00 00 04 8a b8 00 88 ff ff 00 04 8a b8 00
88 ff ff 00 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 16 bd 37 00 88 ff ff 40 ab cd bf 00 88 ff ff 20 15 42
b9 00
[  763.988002] RIP  [880037bd0800] 0x880037bd07ff
[  763.988002]  RSP 8800bfcc3e78
[  763.988002] CR2: 880037bd0800
[  763.988002] ---[ end trace 614049dc850267ac ]---
[  763.988002] Kernel panic - not syncing: Fatal exception in interrupt
[  763.997833] [ cut here ]
[  763.997936] WARNING: at arch/x86/kernel/smp.c:120
update_process_times+0x57/0x63()
[  763.998072] Hardware name: ProLiant DL160 G5
[  763.998171] Modules linked in: cbc netconsole loop snd_pcm
snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac
tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys
button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod
cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core
usbcore usb_common tg3 libphy mptsas mptscsih mptbase
scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan]
[  764.001205] Pid: 0, comm: swapper/3 Tainted: G      D      3.3.4 #1
[  764.001311] Call Trace:
[  764.001404]  IRQ  [81038bb0] ? warn_slowpath_common+0x78/0x8c
[  764.001573]  [81044937] ? update_process_times+0x57/0x63
[  764.001681]  [81075dbe] ? tick_sched_timer+0x65/0x8b
[  764.001788]  [810561bd] ? __run_hrtimer+0xb2/0x13d
[  764.001832]  [81013ca9] ? read_tsc+0x5/0x16
[  764.001832]  [81056482] ? hrtimer_interrupt+0xd8/0x1a7
[  

Re: slow performance even when using SSDs

2012-05-10 Thread Calvin Morrow
I was getting roughly the same results of your tmpfs test using
spinning disks for OSDs with a 160GB Intel 320 SSD being used for the
journal.  Theoretically the 520 SSD should give better performance
than my 320s.

Keep in mind that even with balance-alb, multiple GigE connections
will only be used if there are multiple TCP sessions being used by
Ceph.

You don't mention it in your email, but if you're using kernel 3.4+
you'll want to make sure your create your btrfs filesystem using the
large node  leaf size (Big Metadata - I've heard recommendations of
32k instead of default 4k) so your performance doesn't degrade over
time.

I'm curious what speed you're getting from dd in a streaming write.
You might try running a dd if=/dev/zero of=intel ssd partition
bs=128k count=something to see what the SSD will spit out without
Ceph in the picture.

Calvin

On Thu, May 10, 2012 at 7:09 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 OK, here some retests. I had the SDDs conected to an old Raid controller
 even i did used them as JBODs (oops).

 Here are two new Tests (using kernel 3.4-rc6) it would be great if
 someone could tell me if they're fine or bad.

 New tests with all 3 SSDs connected to the mainboard.

 #~ rados -p rbd bench 60 write
 Total time run:        60.342419
 Total writes made:     2021
 Write size:            4194304
 Bandwidth (MB/sec):    133.969

 Average Latency:       0.477476
 Max latency:           0.942029
 Min latency:           0.109467

 #~ rados -p rbd bench 60 write -b 4096
 Total time run:        60.726326
 Total writes made:     59026
 Write size:            4096
 Bandwidth (MB/sec):    3.797

 Average Latency:       0.016459
 Max latency:           0.874841
 Min latency:           0.002392

 Another test with only osd on the disk and the journal in memory / tmpfs:
 #~ rados -p rbd bench 60 write
 Total time run:        60.513240
 Total writes made:     2555
 Write size:            4194304
 Bandwidth (MB/sec):    168.889

 Average Latency:       0.378775
 Max latency:           4.59233
 Min latency:           0.055179

 #~ rados -p rbd bench 60 write -b 4096
 Total time run:        60.116260
 Total writes made:     281903
 Write size:            4096
 Bandwidth (MB/sec):    18.318

 Average Latency:       0.00341067
 Max latency:           0.720486
 Min latency:           0.000602

 Another problem i have is i'm always getting:
 2012-05-10 15:05:22.140027 mon.0 192.168.0.100:6789/0 19 : [WRN]
 message from mon.2 was stamped 0.109244s in the future, clocks not
 synchronized

 even on all systems ntp is running fine.

 Stefan

 Am 10.05.2012 14:09, schrieb Stefan Priebe - Profihost AG:
 Dear List,

 i'm doing a testsetup with ceph v0.46 and wanted to know how fast ceph is.

 my testsetup:
 3 servers with Intel Xeon X3440, 180GB SSD Intel 520 Series, 4GB RAM, 2x
 1Gbit/s LAN each

 All 3 are running as mon a-c and osd 0-2. Two of them are also running
 as mds.2 and mds.3 (has 8GB RAM instead of 4GB).

 All machines run ceph v0.46 and vanilla Linux Kernel v3.0.30 and all of
 them use btrfs on the ssd which serves /srv/{osd,mon}.X. All of them use
 eth0+eth1 as bond0 (mode 6).

 This gives me:
 rados -p rbd bench 60 write

 ...
 Total time run:        61.465323
 Total writes made:     776
 Write size:            4194304
 Bandwidth (MB/sec):    50.500

 Average Latency:       1.2654
 Max latency:           2.77124
 Min latency:           0.170936

 Shouldn't it be at least 100MB/s? (1Gbit/s / 8)

 And rados -p rbd bench 60 write -b 4096 gives pretty bad results:
 Total time run:        60.221130
 Total writes made:     6401
 Write size:            4096
 Bandwidth (MB/sec):    0.415

 Average Latency:       0.150525
 Max latency:           1.12647
 Min latency:           0.026599

 All btrfs ssds are also mounted with noatime.

 Thanks for your help!

 Greets Stefan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph on btrfs 3.4rc

2012-05-10 Thread Josef Bacik
On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
 Am 24. April 2012 18:26 schrieb Sage Weil s...@newdream.net:
  On Tue, 24 Apr 2012, Josef Bacik wrote:
  On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
   After running ceph on XFS for some time, I decided to try btrfs again.
   Performance with the current for-linux-min branch and big metadata
   is much better. The only problem (?) I'm still seeing is a warning
   that seems to occur from time to time:
 
  Actually, before you do that... we have a new tool,
  test_filestore_workloadgen, that generates a ceph-osd-like workload on the
  local file system.  It's a subset of what a full OSD might do, but if
  we're lucky it will be sufficient to reproduce this issue.  Something like
 
   test_filestore_workloadgen --osd-data /foo --osd-journal /bar
 
  will hopefully do the trick.
 
  Christian, maybe you can see if that is able to trigger this warning?
  You'll need to pull it from the current master branch; it wasn't in the
  last release.
 
 Trying to reproduce with test_filestore_workloadgen didn't work for
 me. So here are some instructions on how to reproduce with a minimal
 ceph setup.
 
 You will need a single system with two disks and a bit of memory.
 
 - Compile and install ceph (detailed instructions:
 http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
 
 - For the test setup I've used two tmpfs files as journal devices. To
 create these, do the following:
 
 # mkdir -p /ceph/temp
 # mount -t tmpfs tmpfs /ceph/temp
 # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
 # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
 
 - Now you should create and mount btrfs. Here is what I did:
 
 # mkfs.btrfs -l 64k -n 64k /dev/sda
 # mkfs.btrfs -l 64k -n 64k /dev/sdb
 # mkdir /ceph/osd.000
 # mkdir /ceph/osd.001
 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
 # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
 
 - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
 will probably have to change the btrfs devices and the hostname
 (os39).
 
 - Create the ceph filesystems:
 
 # mkdir /ceph/mon
 # mkcephfs -a -c /etc/ceph/ceph.conf
 
 - Start ceph (e.g. service ceph start)
 
 - Now you should be able to use ceph - ceph -s will tell you about
 the state of the ceph cluster.
 
 - rbd create -size 100 testimg will create an rbd image on the ceph cluster.
 
 - Compile my test with gcc -o rbdtest rbdtest.c -lrbd and run it
 with ./rbdtest testimg.
 
 I can see the first btrfs_orphan_commit_root warning after an hour or
 so... I hope that I've described all necessary steps. If there is a
 problem just send me a note.
 

Well I feel like an idiot, I finally get it to reproduce, go look at where I
want to put my printks and theres the problem staring me right in the face.
I've looked seriously at this problem 2 or 3 times and have missed this every
single freaking time.  Here is the patch I'm trying, please try it on yours to
make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
me so I won't be able to fully test it until tomorrow, but so far it hasn't
broken anything so it should be good.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eefe573..4ad628d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -164,6 +161,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a89888..6dd20f3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 7f849b3..8bbe8c4 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1148,7 +1148,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root-orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(root-dirty_list);
-   INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
@@ -1161,6 +1160,7 @@ static void 

Re: Can I use btrfs-restore to restore ceph osds?

2012-05-10 Thread Tommi Virtanen
On Wed, May 9, 2012 at 10:15 AM, Guido Winkelmann
guido-c...@thisisnotatest.de wrote:
 I'm currently trying to re-enable my experimental ceph cluster that has been
 offline for a few months. Unfortunately, it appears that, out of the six btrfs
 volumes involved, only one can still be mounted, the other five are broken
 somehow. (If I ever use Ceph in production, it's probably not going to be on
 btrfs after this... I cannot recall whether or not the servers were properly
 shut down the last time, but even if not, this is a bit ridiculous.)

 I cannot seem to repair the broken filesystem with btrfsck, but I can extract
 data from them with btrfs-restore.

OSD uses btrfs snapshots internally. Any restore operation would have
to bring the snapshots back exactly as they were, too. It seems
there's a -s option for that, but whether things will work out is hard
to predict.. since it was a test cluster, perhaps you're better off
scrapping the data and setting up a new cluster.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compile error in rgw/rgw_xml.h in 0.46

2012-05-10 Thread Yehuda Sadeh
Oops, missed posting it to the list (seeing Tommi's comment).

On Wed, May 9, 2012 at 11:04 AM, Yehuda Sadeh yeh...@inktank.com wrote:

 On Wed, May 9, 2012 at 6:46 AM, Guido Winkelmann
 guido-c...@thisisnotatest.de wrote:
  Compiling Ceph 0.46 fails at rgw/rgw_dencoder.cc with the following
  errors:
 
  In file included from rgw/rgw_dencoder.cc:7:
  rgw/rgw_acl_s3.h:9:19: error: expat.h: No such file or directory
  In file included from rgw/rgw_acl_s3.h:11,
                  from rgw/rgw_dencoder.cc:7:
  rgw/rgw_xml.h:59: error: 'XML_Parser' does not name a type
  make[3]: *** [ceph_dencoder-rgw_dencoder.o] Error 1
  make[3]: Leaving directory `/root/ceph-0.46/src'
  make[2]: *** [all-recursive] Error 1
  make[2]: Leaving directory `/root/ceph-0.46/src'
  make[1]: *** [all] Error 2
  make[1]: Leaving directory `/root/ceph-0.46/src'
  make: *** [all-recursive] Error 1
 
  This is on CentOS 6.2.
 
  I managed to get it to compile by installing expat-devel first. Maybe
  the
  configure script should check for the existence of the expat header
  files?
 
 Actually, the config script checks for that, but only if rgw is
 enabled. We need to figure out what would be the best solution. I
 don't think expat should be a dependency when rgw is not enabled, so
 we should need to figure how to remove this dependency. It might be
 that the easiest solution would be to not compile all the rgw stuff
 into the dencoder when rgw is not enabled (which makes sense anyway).
 I opened bug #2390 to track this issue.

 Thanks,
 Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Always creating PGs

2012-05-10 Thread Tommi Virtanen
On Thu, May 10, 2012 at 2:17 AM, Tomoki BENIYA ben...@bit-isle.co.jp wrote:
 I found that PGs of 'creating' status are never finished.

 # ceph pg stat
 v17596: 1204 pgs: 8 creating, 1196 active+clean; 25521 MB data, 77209 MB 
 used, 2223 GB / 2318 GB avail
                  ~~
                  always creating, why?

 # ceph pg dump|grep creating
 1.1p6   0       0       0       0       0       0       0       creating      
   0.00        0'0     0'0     []      []      0'0     0.00

Sounds like another report we've had recently, under the subject PGs
stuck in creating state
http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6078

Please read that thread. That wart should go away at 0.47.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 0/2] Distribute re-replicated objects evenly after OSD failure

2012-05-10 Thread Jim Schutt
Hi Sage,

I've been trying to solve the issue mentioned in tracker #2047, which I 
think is the same as I described in
  http://www.spinics.net/lists/ceph-devel/msg05824.html

The attached patches seem to fix it for me.  I also attempted to 
address the local search issue you mentioned in #2047.

I'm testing this using a cluster with 3 rows, 2 racks/row, 2 hosts/rack,
4 osds/host. I tested against a CRUSH map with the rules:
step take root
step chooseleaf firstn 0 type rack
step emit

I'm in the processes of testing this as follows:

I wrote some data to the cluster, then started shutting down OSDs using
init-ceph stop osd.n. For the first rack's worth, I shut OSDs down
sequentially.  I waited for recovery to complete each time before
stopping the next OSD.  For the next rack I shut down the first 3 OSDs
on a host at the same time, waited for recovery to complete, then shut
down the last OSD on that host.  For the next racks, I shut down all
the OSDs on the hosts in the rack at the same time.

Right now I'm waiting for recovery to complete after shutting down
the third rack.  Once recovery completed after each phase so far,
there were no degraded objects.

So, this is looking fairly solid to me so far.  What do you think?

Thanks -- Jim

Jim Schutt (2):
  ceph: retry CRUSH map descent before retrying bucket
  ceph: retry CRUSH map descent from root if leaf is failed

 src/crush/mapper.c |   30 ++
 1 files changed, 22 insertions(+), 8 deletions(-)

-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 1/2] ceph: retry CRUSH map descent before retrying bucket

2012-05-10 Thread Jim Schutt
For the first few rejections or collisions, we'll retry the descent to
keep objects spread across the cluster.  After that, we'll fall back
to exhaustive search of the bucket to avoid trying forever in the event
a bucket has only a few in items and the hash doesn't do a good job of
finding them.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 src/crush/mapper.c |   20 ++--
 1 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index 8857577..e5dc950 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -350,8 +350,7 @@ static int crush_choose(const struct crush_map *map,
reject = 1;
goto reject;
}
-   if (flocal = (in-size1) 
-   flocal  orig_tries)
+   if (ftotal = orig_tries)/* exhaustive 
bucket search */
item = bucket_perm_choose(in, x, r);
else
item = crush_bucket_choose(in, x, r);
@@ -420,10 +419,19 @@ reject:
if (reject || collide) {
ftotal++;
flocal++;
-
-   if (collide  flocal  3)
-   /* retry locally a few times */
-   retry_bucket = 1;
+   /*
+* For the first couple rejections or 
collisions,
+* we'll retry the descent to keep 
objects spread
+* across the cluster.  After that, 
we'll fall back
+* to exhaustive search of buckets to 
avoid trying
+* forever in the event a bucket has 
only a few
+* in items and the hash doesn't do a 
good job
+* of finding them.  Note that we need 
to retry
+* descent during that phase so that 
multiple
+* buckets can be exhaustively searched.
+*/
+   if (ftotal = orig_tries)
+   retry_descent = 1;
else if (flocal = in-size + 
orig_tries)
/* exhaustive bucket search */
retry_bucket = 1;
-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH 2/2] ceph: retry CRUSH map descent from root if leaf is failed

2012-05-10 Thread Jim Schutt
When an object is re-replicated after a leaf failure, the remapped replica
ends up under the bucket that held the failed leaf.  This causes uneven
data distribution across the storage cluster, to the point that when all
the leaves of a bucket but one fail, that remaining leaf holds all the
data from its failed peers.

For example, consider the crush rule
  step chooseleaf firstn 0 type node_type

This rule means that n replicas will be chosen in such a manner that
each chosen leaf's branch will contain a unique instance of node_type.

For such ops, the tree descent has two steps: call them the inner and
outer descent.

If the tree descent down to node_type is the outer descent, and the
descent from node_type down to a leaf is the inner descent, the issue
is that a down leaf is detected on the inner descent, but we want to
retry the outer descent.  This ensures that re-replication after a leaf
failure disperses the re-replicated objects as widely as possible across
the storage cluster.

Fix this by causing the inner descent to return immediately on choosing
a failed leaf, unless we've fallen back to exhaustive search.

Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary.  This
requires that OSD's data for that PG to need moving as well.  This
seems unavoidable but should be relatively rare.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 src/crush/mapper.c |   12 +---
 1 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/src/crush/mapper.c b/src/crush/mapper.c
index e5dc950..698da55 100644
--- a/src/crush/mapper.c
+++ b/src/crush/mapper.c
@@ -286,6 +286,7 @@ static int is_out(const struct crush_map *map, const __u32 
*weight, int item, in
  * @param outpos our position in that vector
  * @param firstn true if choosing first n items, false if choosing indep
  * @param recurseto_leaf: true if we want one device under each item of given 
type
+ * @param descend_once true if we should only try one descent before giving up
  * @param out2 second output vector for leaf items (if @a recurse_to_leaf)
  */
 static int crush_choose(const struct crush_map *map,
@@ -293,7 +294,7 @@ static int crush_choose(const struct crush_map *map,
const __u32 *weight,
int x, int numrep, int type,
int *out, int outpos,
-   int firstn, int recurse_to_leaf,
+   int firstn, int recurse_to_leaf, int descend_once,
int *out2)
 {
int rep;
@@ -397,6 +398,7 @@ static int crush_choose(const struct crush_map *map,
 x, outpos+1, 0,
 out2, outpos,
 firstn, 0,
+ftotal  orig_tries,
 NULL) = outpos)
/* didn't get leaf */
reject = 1;
@@ -430,7 +432,10 @@ reject:
 * descent during that phase so that 
multiple
 * buckets can be exhaustively searched.
 */
-   if (ftotal = orig_tries)
+   if (reject  descend_once)
+   /* let outer call try again */
+   skip_rep = 1;
+   else if (ftotal = orig_tries)
retry_descent = 1;
else if (flocal = in-size + 
orig_tries)
/* exhaustive bucket search */
@@ -491,6 +496,7 @@ int crush_do_rule(const struct crush_map *map,
int i, j;
int numrep;
int firstn;
+   const int descend_once = 0;
 
if ((__u32)ruleno = map-max_rules) {
dprintk( bad ruleno %d\n, ruleno);
@@ -550,7 +556,7 @@ int crush_do_rule(const struct crush_map *map,
  curstep-arg2,
  o+osize, j,
  firstn,
- recurse_to_leaf, c+osize);
+ recurse_to_leaf, 
descend_once, c+osize);
}
 
if (recurse_to_leaf)
-- 
1.7.8.2


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: converting btrfs osds to xfs?

2012-05-10 Thread Nick Bartos
After I run the 'ceph osd out 123' command, is there a specific ceph
command I can poll so I know when it's OK to kill the OSD daemon and
begin the reformat process?

On Tue, May 8, 2012 at 12:38 PM, Sage Weil s...@newdream.net wrote:
 On Tue, 8 May 2012, Tommi Virtanen wrote:
 On Tue, May 8, 2012 at 8:39 AM, Nick Bartos n...@pistoncloud.com wrote:
  I am considering converting some OSDs to xfs (currently running btrfs)
  for stability reasons.  I have a couple of ideas for doing this, and
  was hoping to get some comments:
 
  Method #1:
  1.  Check cluster health and make sure data on a specific OSD is
  replicated elsewhere.
  2.  Bring down the OSD
  3.  Reformat it to xfs
  4.  Restart OSD
  5.  Repeat 1-4 until all btrfs OSDs have been converted.
 ...
  Obviously #1 seems much more appetizing, but unfortunately I can't
  seem to find out how to verify that data on a specific OSD is
  replicated elsewhere.  I could go off general cluster health, but that
  seems more error prone.

 You can set the osd weight in crush to 0 and wait for the files inside
 the osd data dir to disappear. If you want to control how much

 You can also just mark the osd 'out' without touching the CRUSH map;
 that'll be easier and a bit more efficient wrt data movement:

        ceph osd out 123

 When the one comes back, you'll need to

        ceph osd in 123

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] ceph: messenger: read_partial() cleanups

2012-05-10 Thread Alex Elder

This short series adds the use of read_partial() in a few places
that it is not already.  It also gets rid of the in/out to
argument (which continues to cause confusion every time I see it),
using an in-only end argument in its place.

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting btrfs osds to xfs?

2012-05-10 Thread Tommi Virtanen
On Thu, May 10, 2012 at 3:44 PM, Nick Bartos n...@pistoncloud.com wrote:
 After I run the 'ceph osd out 123' command, is there a specific ceph
 command I can poll so I know when it's OK to kill the OSD daemon and
 begin the reformat process?

Good question! ceph -s will show you that. This is from a run where
I ran ceph osd out 1 on a cluster of 3 osds. See the active+clean
counts going up and active+recovering counts going down, and the
degraded percentage dropping. The last line is an example of an all
done situation.

2012-05-10 17:19:47.376864pg v144: 24 pgs: 14 active+clean, 10
active+recovering; 180 MB data, 100217 MB used, 173 GB / 285 GB avail;
88/132 degraded (66.667%)

2012-05-10 17:19:59.220607pg v146: 24 pgs: 19 active+clean, 5
active+recovering; 180 MB data, 100227 MB used, 173 GB / 285 GB avail;
24/132 degraded (18.182%)

2012-05-10 17:20:16.522978pg v148: 24 pgs: 24 active+clean; 180 MB
data, 100146 MB used, 173 GB / 285 GB avail

If you want a lower-level double-check, you can peek inside the osd
data directory and see that the current subdirectory has no *_head
entries, du is low, etc.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: converting btrfs osds to xfs?

2012-05-10 Thread Tommi Virtanen
On Thu, May 10, 2012 at 5:23 PM, Tommi Virtanen t...@inktank.com wrote:
 Good question! ceph -s will show you that. This is from a run where
 I ran ceph osd out 1 on a cluster of 3 osds. See the active+clean
 counts going up and active+recovering counts going down, and the
 degraded percentage dropping. The last line is an example of an all
 done situation.

Oh, and if your cluster is busy enough that there's always some
rebalancing going on, you might never get to 100% active+clean. In
that case, I do believe that ceph pg dump probably contains all the
information needed, and --format=json makes it parseable, but it's
just not currently documented. We really should provide a good way of
accessing that information.

I filed http://tracker.newdream.net/issues/2394 to keep track of this task.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] ceph: messenger: use read_partial() in read_partial_message()

2012-05-10 Thread Alex Elder

There are two blocks of code in read_partial_message()--those that
read the header and footer of the message--that can be replaced by a
call to read_partial().  Do that.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   30 ++
 1 file changed, 10 insertions(+), 20 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f0993af..673133e 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1628,7 +1628,7 @@ static int read_partial_message(struct 
ceph_connection *con)

 {
struct ceph_msg *m = con-in_msg;
int ret;
-   int to, left;
+   int to;
unsigned front_len, middle_len, data_len;
bool do_datacrc = !con-msgr-nocrc;
int skip;
@@ -1638,15 +1638,10 @@ static int read_partial_message(struct 
ceph_connection *con)

dout(read_partial_message con %p msg %p\n, con, m);

/* header */
-   while (con-in_base_pos  sizeof(con-in_hdr)) {
-   left = sizeof(con-in_hdr) - con-in_base_pos;
-   ret = ceph_tcp_recvmsg(con-sock,
-  (char *)con-in_hdr + con-in_base_pos,
-  left);
-   if (ret = 0)
-   return ret;
-   con-in_base_pos += ret;
-   }
+   to = 0;
+   ret = read_partial(con, to, sizeof (con-in_hdr), con-in_hdr);
+   if (ret = 0)
+   return ret;

crc = crc32c(0, con-in_hdr, offsetof(struct ceph_msg_header, crc));
if (cpu_to_le32(crc) != con-in_hdr.crc) {
@@ -1759,16 +1754,11 @@ static int read_partial_message(struct 
ceph_connection *con)

}

/* footer */
-   to = sizeof(m-hdr) + sizeof(m-footer);
-   while (con-in_base_pos  to) {
-   left = to - con-in_base_pos;
-   ret = ceph_tcp_recvmsg(con-sock, (char *)m-footer +
-  (con-in_base_pos - sizeof(m-hdr)),
-  left);
-   if (ret = 0)
-   return ret;
-   con-in_base_pos += ret;
-   }
+   to = sizeof (m-hdr);
+   ret = read_partial(con, to, sizeof (m-footer), m-footer);
+   if (ret = 0)
+   return ret;
+
dout(read_partial_message got msg %p %d (%u) + %d (%u) + %d (%u)\n,
 m, front_len, m-footer.front_crc, middle_len,
 m-footer.middle_crc, data_len, m-footer.data_crc);
--
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] ceph: messenger: update to in read_partial() caller

2012-05-10 Thread Alex Elder

read_partial() always increases whatever to value is supplied by
adding the requested size to it.  That's the only thing it does with
that pointed-to value.

Do that pointer advance in the caller (and then only when the
updated value will be subsequently used), and change the to
parameter to be an in-only and non-pointer value.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   31 ---
 1 file changed, 16 insertions(+), 15 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 673133e..37fd2ae 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -992,11 +992,12 @@ static int prepare_read_message(struct 
ceph_connection *con)



 static int read_partial(struct ceph_connection *con,
-   int *to, int size, void *object)
+   int to, int size, void *object)
 {
-   *to += size;
-   while (con-in_base_pos  *to) {
-   int left = *to - con-in_base_pos;
+   int end = to + size;
+
+   while (con-in_base_pos  end) {
+   int left = end - con-in_base_pos;
int have = size - left;
int ret = ceph_tcp_recvmsg(con-sock, object + have, left);
if (ret = 0)
@@ -1017,14 +1018,16 @@ static int read_partial_banner(struct 
ceph_connection *con)

dout(read_partial_banner %p at %d\n, con, con-in_base_pos);

/* peer's banner */
-   ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner);
+   ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner);
if (ret = 0)
goto out;
-   ret = read_partial(con, to, sizeof(con-actual_peer_addr),
+   to += strlen(CEPH_BANNER);
+   ret = read_partial(con, to, sizeof(con-actual_peer_addr),
   con-actual_peer_addr);
if (ret = 0)
goto out;
-   ret = read_partial(con, to, sizeof(con-peer_addr_for_me),
+   to += sizeof(con-actual_peer_addr);
+   ret = read_partial(con, to, sizeof(con-peer_addr_for_me),
   con-peer_addr_for_me);
if (ret = 0)
goto out;
@@ -1038,10 +1041,11 @@ static int read_partial_connect(struct 
ceph_connection *con)


dout(read_partial_connect %p at %d\n, con, con-in_base_pos);

-   ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply);
+   ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply);
if (ret = 0)
goto out;
-   ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len),
+   to += sizeof(con-in_reply);
+   ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len),
   con-auth_reply_buf);
if (ret = 0)
goto out;
@@ -1491,9 +1495,7 @@ static int process_connect(struct ceph_connection 
*con)

  */
 static int read_partial_ack(struct ceph_connection *con)
 {
-   int to = 0;
-
-   return read_partial(con, to, sizeof(con-in_temp_ack),
+   return read_partial(con, 0, sizeof(con-in_temp_ack),
con-in_temp_ack);
 }

@@ -1638,8 +1640,7 @@ static int read_partial_message(struct 
ceph_connection *con)

dout(read_partial_message con %p msg %p\n, con, m);

/* header */
-   to = 0;
-   ret = read_partial(con, to, sizeof (con-in_hdr), con-in_hdr);
+   ret = read_partial(con, 0, sizeof (con-in_hdr), con-in_hdr);
if (ret = 0)
return ret;

@@ -1755,7 +1756,7 @@ static int read_partial_message(struct 
ceph_connection *con)


/* footer */
to = sizeof (m-hdr);
-   ret = read_partial(con, to, sizeof (m-footer), m-footer);
+   ret = read_partial(con, to, sizeof (m-footer), m-footer);
if (ret = 0)
return ret;

--
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] ceph: messenger: change read_partial() to take end arg

2012-05-10 Thread Alex Elder

Make the second argument to read_partial() be the ending input byte
position rather than the beginning offset it now represents.  This
amounts to moving the addition to + size into the caller.

Signed-off-by: Alex Elder el...@inktank.com
---
 net/ceph/messenger.c |   59 
--

 1 file changed, 38 insertions(+), 21 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 37fd2ae..364c902 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -992,10 +992,8 @@ static int prepare_read_message(struct 
ceph_connection *con)



 static int read_partial(struct ceph_connection *con,
-   int to, int size, void *object)
+   int end, int size, void *object)
 {
-   int end = to + size;
-
while (con-in_base_pos  end) {
int left = end - con-in_base_pos;
int have = size - left;
@@ -1013,40 +1011,52 @@ static int read_partial(struct ceph_connection *con,
  */
 static int read_partial_banner(struct ceph_connection *con)
 {
-   int ret, to = 0;
+   int size;
+   int end;
+   int ret;

dout(read_partial_banner %p at %d\n, con, con-in_base_pos);

/* peer's banner */
-   ret = read_partial(con, to, strlen(CEPH_BANNER), con-in_banner);
+   size = strlen(CEPH_BANNER);
+   end = size;
+   ret = read_partial(con, end, size, con-in_banner);
if (ret = 0)
goto out;
-   to += strlen(CEPH_BANNER);
-   ret = read_partial(con, to, sizeof(con-actual_peer_addr),
-  con-actual_peer_addr);
+
+   size = sizeof (con-actual_peer_addr);
+   end += size;
+   ret = read_partial(con, end, size, con-actual_peer_addr);
if (ret = 0)
goto out;
-   to += sizeof(con-actual_peer_addr);
-   ret = read_partial(con, to, sizeof(con-peer_addr_for_me),
-  con-peer_addr_for_me);
+
+   size = sizeof (con-peer_addr_for_me);
+   end += size;
+   ret = read_partial(con, end, size, con-peer_addr_for_me);
if (ret = 0)
goto out;
+
 out:
return ret;
 }

 static int read_partial_connect(struct ceph_connection *con)
 {
-   int ret, to = 0;
+   int size;
+   int end;
+   int ret;

dout(read_partial_connect %p at %d\n, con, con-in_base_pos);

-   ret = read_partial(con, to, sizeof(con-in_reply), con-in_reply);
+   size = sizeof (con-in_reply);
+   end = size;
+   ret = read_partial(con, end, size, con-in_reply);
if (ret = 0)
goto out;
-   to += sizeof(con-in_reply);
-   ret = read_partial(con, to, le32_to_cpu(con-in_reply.authorizer_len),
-  con-auth_reply_buf);
+
+   size = le32_to_cpu(con-in_reply.authorizer_len);
+   end += size;
+   ret = read_partial(con, end, size, con-auth_reply_buf);
if (ret = 0)
goto out;

@@ -1495,8 +1505,10 @@ static int process_connect(struct ceph_connection 
*con)

  */
 static int read_partial_ack(struct ceph_connection *con)
 {
-   return read_partial(con, 0, sizeof(con-in_temp_ack),
-   con-in_temp_ack);
+   int size = sizeof (con-in_temp_ack);
+   int end = size;
+
+   return read_partial(con, end, size, con-in_temp_ack);
 }


@@ -1629,6 +1641,8 @@ static int read_partial_message_bio(struct 
ceph_connection *con,

 static int read_partial_message(struct ceph_connection *con)
 {
struct ceph_msg *m = con-in_msg;
+   int size;
+   int end;
int ret;
int to;
unsigned front_len, middle_len, data_len;
@@ -1640,7 +1654,9 @@ static int read_partial_message(struct 
ceph_connection *con)

dout(read_partial_message con %p msg %p\n, con, m);

/* header */
-   ret = read_partial(con, 0, sizeof (con-in_hdr), con-in_hdr);
+   size = sizeof (con-in_hdr);
+   end = size;
+   ret = read_partial(con, end, size, con-in_hdr);
if (ret = 0)
return ret;

@@ -1755,8 +1771,9 @@ static int read_partial_message(struct 
ceph_connection *con)

}

/* footer */
-   to = sizeof (m-hdr);
-   ret = read_partial(con, to, sizeof (m-footer), m-footer);
+   size = sizeof (m-footer);
+   end += size;
+   ret = read_partial(con, end, size, m-footer);
if (ret = 0)
return ret;

--
1.7.9.5

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html