date:20120517

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Martin Mailand


Hi Josef,

somehow I still get the kernel Bug messages, I used your patch from the 
16th against rc7.


-martin

Am 16.05.2012 21:20, schrieb Josef Bacik:

Hrm ok so I finally got some time to try and debug it and let the test run a
good long while (5 hours almost) and I couldn't hit either the original bug or
the one you guys were hitting.  So either my extra little bit of locking did the
trick or I get to keep my Worst reproducer ever award.  Can you guys give this
one a whirl and if it panics send the entire dmesg since it should spit out a
WARN_ON() to let me know what I thought was the problem was it.  Thanks,


[ 2868.813236] [ cut here ]
[ 2868.813297] kernel BUG at fs/btrfs/inode.c:2220!
[ 2868.813355] invalid opcode:  [#2] SMP
[ 2868.813479] CPU 2
[ 2868.813516] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma 
enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb megaraid_sas dca mdio

[ 2868.814871]
[ 2868.814925] Pid: 5325, comm: ceph-osd Tainted: G  D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 2868.815108] RIP: 0010:[a02212f2]  [a02212f2] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]

[ 2868.815236] RSP: 0018:880296e89d18  EFLAGS: 00010282
[ 2868.815294] RAX: fffe RBX: 88101ef3c390 RCX: 
00562497
[ 2868.815355] RDX: 00562496 RSI: 88101ef1 RDI: 
ea00407bc400
[ 2868.815416] RBP: 880296e89d58 R08: 60ef8fd0 R09: 
a01f8c6a
[ 2868.815476] R10:  R11: 011d R12: 
880fdf602790
[ 2868.815537] R13: 880fdf602400 R14: 0001 R15: 
0001
[ 2868.815598] FS:  7f07d5512700() GS:88107fc4() 
knlGS:

[ 2868.815675] CS:  0010 DS:  ES:  CR0: 80050033
[ 2868.815734] CR2: 0ab16000 CR3: 00082a6b2000 CR4: 
000407e0
[ 2868.815796] DR0:  DR1:  DR2: 

[ 2868.815858] DR3:  DR6: 0ff0 DR7: 
0400
[ 2868.815920] Process ceph-osd (pid: 5325, threadinfo 880296e88000, 
task 8810170616e0)

[ 2868.815997] Stack:
[ 2868.816049]  0c07 88101ef12960 880296e89d38 
88101ef12960
[ 2868.816262]   880fdf602400 88101ef3c390 
880b4ce2f260
[ 2868.816485]  880296e89e08 a0225628 88101ef3c390 


[ 2868.816694] Call Trace:
[ 2868.816755]  [a0225628] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 2868.816817]  [81188afd] ? path_lookupat+0x6d/0x750
[ 2868.816880]  [a0227021] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 2868.816940]  [811955c3] notify_change+0x183/0x320
[ 2868.816998]  [8117889e] do_truncate+0x5e/0xa0
[ 2868.817056]  [81178a24] sys_truncate+0x144/0x1b0
[ 2868.817115]  [8165fd29] system_call_fastpath+0x16/0x1b
[ 2868.817173] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec

[ 2868.819501] RIP  [a02212f2] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 2868.819602]  RSP 880296e89d18
[ 2868.819703] ---[ end trace 94d17b770b376c84 ]---
[ 3249.857453] [ cut here ]
[ 3249.857481] kernel BUG at fs/btrfs/inode.c:2220!
[ 3249.857506] invalid opcode:  [#3] SMP
[ 3249.857534] CPU 0
[ 3249.857538] Modules linked in: btrfs zlib_deflate libcrc32c ext2 
bonding coretemp ghash_clmulni_intel aesni_intel cryptd aes_x86_64 
microcode psmouse serio_raw sb_edac edac_core joydev mei(C) ses ioatdma 
enclosure mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb megaraid_sas dca mdio

[ 3249.857721]
[ 3249.857740] Pid: 5384, comm: ceph-osd Tainted: G  D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 3249.857791] RIP: 0010:[a02212f2]  [a02212f2] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]

[ 3249.857847] RSP: 0018:880abe8b5d18  EFLAGS: 00010282
[ 3249.857873] RAX: fffe RBX: 8807eb8b6670 RCX: 
0077a084
[ 3249.857902] RDX: 0077a083 RSI: 88101ee497e0 RDI: 
ea00407b9240
[ 3249.857931] RBP: 880abe8b5d58 R08: 60ef8fd0 R09: 
a01f8c6a
[ 3249.857959] R10:  R11: 0153 R12: 
880d56825390
[ 3249.857988] R13: 880d56825000 R14: 0001 R15: 
0001
[ 3249.858017] FS:  7f06bd13b700() GS:88107fc0() 
knlGS:

[ 3249.858062] CS:  0010 DS:  ES:  CR0: 80050033
[ 3249.858088] CR2: 043d2000 CR3: 000e7ebe5000 CR4: 
000407f0
[ 3249.858117] DR0:  DR1:  DR2: 

[ 3249.858146] DR3:  DR6: 0ff0 DR7:

Journal too small

2012-05-17 Thread Karol Jurak

Hi,

During an ongoing recovery in one of my clusters a couple of OSDs 
complained about too small journal. For instance:

2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at 
863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 
start 863363072)
2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item 
1693745152  journal 1048571904 (usable)

I was under the impression that the OSDs stopped participating in recovery 
after this event. (ceph -w showed that the number of PGs in state 
active+clean no longer increased.) They resumed recovery after I enlarged 
their journals (stop osd, --flush-journal, --mkjournal, start osd).

How serious is such situation? Do the OSDs know how to handle it 
correctly? Or could this result in some data loss or corruption? After the 
recovery finished (ceph -w showed that all PGs are in active+clean state) 
I noticed that a few rbd images were corrupted.

The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the 
recovery no clients were accessing the cluster.

Best regards,
Karol
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] libceph: avoid unregistering osd request when not registered

2012-05-17 Thread Alex Elder


(Sending this on behalf of Sage, because I'm about to commit it
for testing and I wanted the list to have better visibility.)

There is a race between two __unregister_request() callers: the
reply path and the ceph_osdc_wait_request().  If we get a reply
*and* the timeout expires at roughly the same time, both callers
will try to unregister the request, and the second one will do bad
things.

Simply check if the request is still already unregistered; if so,
return immediately and do nothing.

Fixes http://tracker.newdream.net/issues/2420

Signed-off-by: Sage Weil s...@inktank.com
Reviewed-by: Alex Elder el...@inktank.com
---
 net/ceph/osd_client.c |6 ++
 1 file changed, 6 insertions(+)

Index: b/net/ceph/osd_client.c
===
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -841,6 +841,12 @@ static void register_request(struct ceph
 static void __unregister_request(struct ceph_osd_client *osdc,
 struct ceph_osd_request *req)
 {
+   if (RB_EMPTY_NODE(req-r_node)) {
+   dout(__unregister_request %p tid %lld not registered\n,
+   req, req-r_tid);
+   return;
+   }
+
dout(__unregister_request %p tid %lld\n, req, req-r_tid);
rb_erase(req-r_node, osdc-requests);
osdc-num_requests--;
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 00/16] ceph: messenger cleanups and fixes

2012-05-17 Thread Alex Elder


This series culminates in fixing a bug in which ceph connection
header might get sent over the wire before its associated authorizer
structure has been fully prepared.  More info is found here:
http://tracker.newdream.net/issues/2424

Sage preemptively reviewed them before I got them posted to the
list, and the patches that follow indicate that.  These patches are
being tested now, and unless I get actionable feedback (and if they
successfully pass testing) I'll be committing them as-is.

Along the way a few other small bugs are fixed (or avoided) and some
data structures and interfaces are modified a bit to simplify
things.  Here is a bit of summary information about them.

These first four rearrange some places where a connection's out_kvec
fields get reset, and where/when a banner gets put out prior to a
connection header.  A couple of functions can then eliminate a
parameter as a result.
libceph: don't reset kvec in prepare_write_banner()
ceph: messenger: reset connection kvec caller
ceph: messenger: send banner in process_connect()
ceph: drop msgr argument from prepare_write_connect()

This defers setting WRITE_PENDING when a connection header has been
queued to write but not its associated authorizer buffer.
ceph: don't set WRITE_PENDING too early

These add error checking in two spots, and rearranges a function so
a simple case can be handled without dropping a connection's mutex.
ceph: messenger: check prepare_write_connect() result
ceph: messenger: rework prepare_connect_authorizer()
ceph: messenger: check return from get_authorizer

This defines a type to group some authorizer-related fields, then
uses it to simplify some function interfaces.  It also adds some
additional checking before using method function pointers.
ceph: define ceph_auth_handshake type
ceph: messenger: reduce args to create_authorizer
ceph: ensure auth ops are defined before use
ceph: have get_authorizer methods return pointers
ceph: use info returned by get_authorizer
ceph: return pointer from prepare_connect_authorizer()

These final two implement the final fix for bug 2424 mentioned
above.  It doesn't place the connection header into out_kvec until
it is fully initialized, and then ensures the associated authorizer
buffer is also added before marking the WRITE_PENDING flag on the
connection (and also before dropping the connection mutex).
ceph: rename prepare_connect_authorizer()
ceph: add auth buf in prepare_write_connect()

-Alex
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 01/16] libceph: don't reset kvec in prepare_write_banner()

2012-05-17 Thread Alex Elder


Move the kvec reset for a connection out of prepare_write_banner and
into its only caller.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index a659b4d..bcbd409 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -686,7 +686,6 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

 static void prepare_write_banner(struct ceph_messenger *msgr,
 struct ceph_connection *con)
 {
-   ceph_con_out_kvec_reset(con);
ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
ceph_con_out_kvec_add(con, sizeof (msgr-my_enc_addr),
msgr-my_enc_addr);
@@ -726,10 +725,9 @@ static int prepare_write_connect(struct 
ceph_messenger *msgr,

con-out_connect.protocol_version = cpu_to_le32(proto);
con-out_connect.flags = 0;

+   ceph_con_out_kvec_reset(con);
if (include_banner)
prepare_write_banner(msgr, con);
-   else
-   ceph_con_out_kvec_reset(con);
ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);

con-out_more = 0;
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 02/16] ceph: messenger: reset connection kvec caller

2012-05-17 Thread Alex Elder


Reset a connection's kvec fields in the caller rather than in
prepare_write_connect().   This ends up repeating a few lines of
code but it's improving the separation between distinct operations
on the connection, which we can take advantage of later.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index bcbd409..cca3cf3 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -725,7 +725,6 @@ static int prepare_write_connect(struct 
ceph_messenger *msgr,

con-out_connect.protocol_version = cpu_to_le32(proto);
con-out_connect.flags = 0;

-   ceph_con_out_kvec_reset(con);
if (include_banner)
prepare_write_banner(msgr, con);
ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);
@@ -1389,6 +1388,7 @@ static int process_connect(struct ceph_connection 
*con)

return -1;
}
con-auth_retry = 1;
+   ceph_con_out_kvec_reset(con);
ret = prepare_write_connect(con-msgr, con, 0);
if (ret  0)
return ret;
@@ -1409,6 +1409,7 @@ static int process_connect(struct ceph_connection 
*con)

   ENTITY_NAME(con-peer_name),
   ceph_pr_addr(con-peer_addr.in_addr));
reset_connection(con);
+   ceph_con_out_kvec_reset(con);
prepare_write_connect(con-msgr, con, 0);
prepare_read_connect(con);

@@ -1432,6 +1433,7 @@ static int process_connect(struct ceph_connection 
*con)

 le32_to_cpu(con-out_connect.connect_seq),
 le32_to_cpu(con-in_connect.connect_seq));
con-connect_seq = le32_to_cpu(con-in_connect.connect_seq);
+   ceph_con_out_kvec_reset(con);
prepare_write_connect(con-msgr, con, 0);
prepare_read_connect(con);
break;
@@ -1446,6 +1448,7 @@ static int process_connect(struct ceph_connection 
*con)

 le32_to_cpu(con-in_connect.global_seq));
get_global_seq(con-msgr,
   le32_to_cpu(con-in_connect.global_seq));
+   ceph_con_out_kvec_reset(con);
prepare_write_connect(con-msgr, con, 0);
prepare_read_connect(con);
break;
@@ -1851,6 +1854,7 @@ more:

/* open the socket first? */
if (con-sock == NULL) {
+   ceph_con_out_kvec_reset(con);
prepare_write_connect(msgr, con, 1);
prepare_read_banner(con);
set_bit(CONNECTING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 03/16] ceph: messenger: send banner in process_connect()

2012-05-17 Thread Alex Elder


prepare_write_connect() has an argument indicating whether a banner
should be sent out before sending out a connection message.  It's
only ever set in one of its callers, so move the code that arranges
to send the banner into that caller and drop the include_banner
argument from prepare_write_connect().

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   16 +++-
 1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index cca3cf3..6b38b6f 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -695,8 +695,7 @@ static void prepare_write_banner(struct 
ceph_messenger *msgr,

 }

 static int prepare_write_connect(struct ceph_messenger *msgr,
-struct ceph_connection *con,
-int include_banner)
+struct ceph_connection *con)
 {
unsigned global_seq = get_global_seq(con-msgr, 0);
int proto;
@@ -725,8 +724,6 @@ static int prepare_write_connect(struct 
ceph_messenger *msgr,

con-out_connect.protocol_version = cpu_to_le32(proto);
con-out_connect.flags = 0;

-   if (include_banner)
-   prepare_write_banner(msgr, con);
ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);

con-out_more = 0;
@@ -1389,7 +1386,7 @@ static int process_connect(struct ceph_connection 
*con)

}
con-auth_retry = 1;
ceph_con_out_kvec_reset(con);
-   ret = prepare_write_connect(con-msgr, con, 0);
+   ret = prepare_write_connect(con-msgr, con);
if (ret  0)
return ret;
prepare_read_connect(con);
@@ -1410,7 +1407,7 @@ static int process_connect(struct ceph_connection 
*con)

   ceph_pr_addr(con-peer_addr.in_addr));
reset_connection(con);
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con, 0);
+   prepare_write_connect(con-msgr, con);
prepare_read_connect(con);

/* Tell ceph about it. */
@@ -1434,7 +1431,7 @@ static int process_connect(struct ceph_connection 
*con)

 le32_to_cpu(con-in_connect.connect_seq));
con-connect_seq = le32_to_cpu(con-in_connect.connect_seq);
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con, 0);
+   prepare_write_connect(con-msgr, con);
prepare_read_connect(con);
break;

@@ -1449,7 +1446,7 @@ static int process_connect(struct ceph_connection 
*con)

get_global_seq(con-msgr,
   le32_to_cpu(con-in_connect.global_seq));
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con, 0);
+   prepare_write_connect(con-msgr, con);
prepare_read_connect(con);
break;

@@ -1855,7 +1852,8 @@ more:
/* open the socket first? */
if (con-sock == NULL) {
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(msgr, con, 1);
+   prepare_write_banner(msgr, con);
+   prepare_write_connect(msgr, con);
prepare_read_banner(con);
set_bit(CONNECTING, con-state);
clear_bit(NEGOTIATING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 04/16] ceph: drop msgr argument from prepare_write_connect()

2012-05-17 Thread Alex Elder


In all cases, the value passed as the msgr argument to
prepare_write_connect() is just con-msgr.  Just get the msgr
value from the ceph connection and drop the unneeded argument.

The only msgr passed to prepare_write_banner() is also therefore
just the one from con-msgr, so change that function to drop the
msgr argument as well.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   25 +++--
 1 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 6b38b6f..47499dc 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -683,19 +683,17 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

 /*
  * We connected to a peer and are saying hello.
  */
-static void prepare_write_banner(struct ceph_messenger *msgr,
-struct ceph_connection *con)
+static void prepare_write_banner(struct ceph_connection *con)
 {
ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
-   ceph_con_out_kvec_add(con, sizeof (msgr-my_enc_addr),
-   msgr-my_enc_addr);
+   ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
+   con-msgr-my_enc_addr);

con-out_more = 0;
set_bit(WRITE_PENDING, con-state);
 }

-static int prepare_write_connect(struct ceph_messenger *msgr,
-struct ceph_connection *con)
+static int prepare_write_connect(struct ceph_connection *con)
 {
unsigned global_seq = get_global_seq(con-msgr, 0);
int proto;
@@ -717,7 +715,7 @@ static int prepare_write_connect(struct 
ceph_messenger *msgr,

dout(prepare_write_connect %p cseq=%d gseq=%d proto=%d\n, con,
 con-connect_seq, global_seq, proto);

-   con-out_connect.features = cpu_to_le64(msgr-supported_features);
+   con-out_connect.features = cpu_to_le64(con-msgr-supported_features);
con-out_connect.host_type = cpu_to_le32(CEPH_ENTITY_TYPE_CLIENT);
con-out_connect.connect_seq = cpu_to_le32(con-connect_seq);
con-out_connect.global_seq = cpu_to_le32(global_seq);
@@ -1386,7 +1384,7 @@ static int process_connect(struct ceph_connection 
*con)

}
con-auth_retry = 1;
ceph_con_out_kvec_reset(con);
-   ret = prepare_write_connect(con-msgr, con);
+   ret = prepare_write_connect(con);
if (ret  0)
return ret;
prepare_read_connect(con);
@@ -1407,7 +1405,7 @@ static int process_connect(struct ceph_connection 
*con)

   ceph_pr_addr(con-peer_addr.in_addr));
reset_connection(con);
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con);
+   prepare_write_connect(con);
prepare_read_connect(con);

/* Tell ceph about it. */
@@ -1431,7 +1429,7 @@ static int process_connect(struct ceph_connection 
*con)

 le32_to_cpu(con-in_connect.connect_seq));
con-connect_seq = le32_to_cpu(con-in_connect.connect_seq);
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con);
+   prepare_write_connect(con);
prepare_read_connect(con);
break;

@@ -1446,7 +1444,7 @@ static int process_connect(struct ceph_connection 
*con)

get_global_seq(con-msgr,
   le32_to_cpu(con-in_connect.global_seq));
ceph_con_out_kvec_reset(con);
-   prepare_write_connect(con-msgr, con);
+   prepare_write_connect(con);
prepare_read_connect(con);
break;

@@ -1840,7 +1838,6 @@ static void process_message(struct ceph_connection 
*con)

  */
 static int try_write(struct ceph_connection *con)
 {
-   struct ceph_messenger *msgr = con-msgr;
int ret = 1;

dout(try_write start %p state %lu nref %d\n, con, con-state,
@@ -1852,8 +1849,8 @@ more:
/* open the socket first? */
if (con-sock == NULL) {
ceph_con_out_kvec_reset(con);
-   prepare_write_banner(msgr, con);
-   prepare_write_connect(msgr, con);
+   prepare_write_banner(con);
+   prepare_write_connect(con);
prepare_read_banner(con);
set_bit(CONNECTING, con-state);
clear_bit(NEGOTIATING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 05/16] ceph: don't set WRITE_PENDING too early

2012-05-17 Thread Alex Elder


prepare_write_connect() prepares a connect message, then sets
WRITE_PENDING on the connection.  Then *after* this, it calls
prepare_connect_authorizer(), which updates the content of the
connection buffer already queued for sending.  It's also possible it
will result in prepare_write_connect() returning -EAGAIN despite the
WRITE_PENDING big getting set.

Fix this by preparing the connect authorizer first, setting the
WRITE_PENDING bit only after that is done.

Partially addresses http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |6 +-
 1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 47499dc..cf29293 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -697,6 +697,7 @@ static int prepare_write_connect(struct 
ceph_connection *con)

 {
unsigned global_seq = get_global_seq(con-msgr, 0);
int proto;
+   int ret;

switch (con-peer_name.type) {
case CEPH_ENTITY_TYPE_MON:
@@ -723,11 +724,14 @@ static int prepare_write_connect(struct 
ceph_connection *con)

con-out_connect.flags = 0;

ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);
+   ret = prepare_connect_authorizer(con);
+   if (ret)
+   return ret;

con-out_more = 0;
set_bit(WRITE_PENDING, con-state);

-   return prepare_connect_authorizer(con);
+   return 0;
 }

 /*
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 08/16] ceph: messenger: check return from get_authorizer

2012-05-17 Thread Alex Elder


In prepare_connect_authorizer(), a connection's get_authorizer
method is called but ignores its return value.  This function can
return an error, so check for it and return it if that ever occurs.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 09409a3..e0532d5 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -658,6 +658,7 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

void *auth_buf;
int auth_len;
int auth_protocol;
+   int ret;

if (!con-ops-get_authorizer) {
con-out_connect.authorizer_protocol = CEPH_AUTH_UNKNOWN;
@@ -673,11 +674,14 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

auth_buf = NULL;
auth_len = 0;
auth_protocol = CEPH_AUTH_UNKNOWN;
-   con-ops-get_authorizer(con, auth_buf, auth_len, auth_protocol,
-   con-auth_reply_buf, con-auth_reply_buf_len,
-   con-auth_retry);
+   ret = con-ops-get_authorizer(con, auth_buf, auth_len,
+   auth_protocol, con-auth_reply_buf,
+   con-auth_reply_buf_len, con-auth_retry);
mutex_lock(con-mutex);

+   if (ret)
+   return ret;
+
if (test_bit(CLOSED, con-state) || test_bit(OPENING, con-state))
return -EAGAIN;

--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 09/16] ceph: define ceph_auth_handshake type

2012-05-17 Thread Alex Elder


The definitions for the ceph_mds_session and ceph_osd both contain
five fields related only to authorizers.  Encapsulate those fields
into their own struct type, allowing for better isolation in some
upcoming patches.

Fix the #includes in linux/ceph/osd_client.h to lay out their more
complete canonical path.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 fs/ceph/mds_client.c|   32 
 fs/ceph/mds_client.h|5 ++---
 include/linux/ceph/auth.h   |8 
 include/linux/ceph/osd_client.h |   11 +--
 net/ceph/osd_client.c   |   32 
 5 files changed, 47 insertions(+), 41 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 89971e1..42013c6 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -334,10 +334,10 @@ void ceph_put_mds_session(struct ceph_mds_session *s)
dout(mdsc put_session %p %d - %d\n, s,
 atomic_read(s-s_ref), atomic_read(s-s_ref)-1);
if (atomic_dec_and_test(s-s_ref)) {
-   if (s-s_authorizer)
+   if (s-s_auth.authorizer)
 s-s_mdsc-fsc-client-monc.auth-ops-destroy_authorizer(
 s-s_mdsc-fsc-client-monc.auth,
-s-s_authorizer);
+s-s_auth.authorizer);
kfree(s);
}
 }
@@ -3404,29 +3404,29 @@ static int get_authorizer(struct ceph_connection 
*con,

struct ceph_auth_client *ac = mdsc-fsc-client-monc.auth;
int ret = 0;

-   if (force_new  s-s_authorizer) {
-   ac-ops-destroy_authorizer(ac, s-s_authorizer);
-   s-s_authorizer = NULL;
+   if (force_new  s-s_auth.authorizer) {
+   ac-ops-destroy_authorizer(ac, s-s_auth.authorizer);
+   s-s_auth.authorizer = NULL;
}
-   if (s-s_authorizer == NULL) {
+   if (s-s_auth.authorizer == NULL) {
if (ac-ops-create_authorizer) {
ret = ac-ops-create_authorizer(
ac, CEPH_ENTITY_TYPE_MDS,
-   s-s_authorizer,
-   s-s_authorizer_buf,
-   s-s_authorizer_buf_len,
-   s-s_authorizer_reply_buf,
-   s-s_authorizer_reply_buf_len);
+   s-s_auth.authorizer,
+   s-s_auth.authorizer_buf,
+   s-s_auth.authorizer_buf_len,
+   s-s_auth.authorizer_reply_buf,
+   s-s_auth.authorizer_reply_buf_len);
if (ret)
return ret;
}
}

*proto = ac-protocol;
-   *buf = s-s_authorizer_buf;
-   *len = s-s_authorizer_buf_len;
-   *reply_buf = s-s_authorizer_reply_buf;
-   *reply_len = s-s_authorizer_reply_buf_len;
+   *buf = s-s_auth.authorizer_buf;
+   *len = s-s_auth.authorizer_buf_len;
+   *reply_buf = s-s_auth.authorizer_reply_buf;
+   *reply_len = s-s_auth.authorizer_reply_buf_len;
return 0;
 }

@@ -3437,7 +3437,7 @@ static int verify_authorizer_reply(struct 
ceph_connection *con, int len)

struct ceph_mds_client *mdsc = s-s_mdsc;
struct ceph_auth_client *ac = mdsc-fsc-client-monc.auth;

-   return ac-ops-verify_authorizer_reply(ac, s-s_authorizer, len);
+   return ac-ops-verify_authorizer_reply(ac, s-s_auth.authorizer, len);
 }

 static int invalidate_authorizer(struct ceph_connection *con)
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 8c7c04e..dd26846 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -11,6 +11,7 @@
 #include linux/ceph/types.h
 #include linux/ceph/messenger.h
 #include linux/ceph/mdsmap.h
+#include linux/ceph/auth.h

 /*
  * Some lock dependencies:
@@ -113,9 +114,7 @@ struct ceph_mds_session {

struct ceph_connection s_con;

-   struct ceph_authorizer *s_authorizer;
-   void *s_authorizer_buf, *s_authorizer_reply_buf;
-   size_ts_authorizer_buf_len, s_authorizer_reply_buf_len;
+   struct ceph_auth_handshake s_auth;

/* protected by s_gen_ttl_lock */
spinlock_ts_gen_ttl_lock;
diff --git a/include/linux/ceph/auth.h b/include/linux/ceph/auth.h
index aa13392..5b774d1 100644
--- a/include/linux/ceph/auth.h
+++ b/include/linux/ceph/auth.h
@@ -14,6 +14,14 @@
 struct ceph_auth_client;
 struct ceph_authorizer;

+struct ceph_auth_handshake {
+   struct ceph_authorizer *authorizer;
+   void *authorizer_buf;
+   size_t authorizer_buf_len;
+   void *authorizer_reply_buf;
+   size_t authorizer_reply_buf_len;
+};
+
 struct ceph_auth_client_ops {
const char *name;

diff --git a/include/linux/ceph/osd_client.h

[PATCH 11/16] ceph: ensure auth ops are defined before use

2012-05-17 Thread Alex Elder


In the create_authorizer method for both the mds and osd clients,
the auth_client-ops pointer is blindly dereferenced.  There is no
obvious guarantee that this pointer has been assigned.  And
furthermore, even if the ops pointer is non-null there is definitely
no guarantee that the create_authorizer or destroy_authorizer
methods are defined.

Add checks in both routines to make sure they are defined (non-null)
before use.  Add similar checks in a few other spots in these files
while we're at it.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 fs/ceph/mds_client.c  |   14 ++
 net/ceph/osd_client.c |   15 ++-
 2 files changed, 16 insertions(+), 13 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b71ffd2..4622817 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3406,16 +3406,14 @@ static int get_authorizer(struct ceph_connection 
*con,

int ret = 0;

if (force_new  auth-authorizer) {
-   ac-ops-destroy_authorizer(ac, auth-authorizer);
+   if (ac-ops  ac-ops-destroy_authorizer)
+   ac-ops-destroy_authorizer(ac, auth-authorizer);
auth-authorizer = NULL;
}
-   if (auth-authorizer == NULL) {
-   if (ac-ops-create_authorizer) {
-   ret = ac-ops-create_authorizer(ac,
-   CEPH_ENTITY_TYPE_MDS, auth);
-   if (ret)
-   return ret;
-   }
+   if (!auth-authorizer  ac-ops  ac-ops-create_authorizer) {
+   ret = ac-ops-create_authorizer(ac, CEPH_ENTITY_TYPE_MDS, 
auth);
+   if (ret)
+   return ret;
}

*proto = ac-protocol;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 2da4b9e..f640bdf 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -664,10 +664,10 @@ static void put_osd(struct ceph_osd *osd)
 {
dout(put_osd %p %d - %d\n, osd, atomic_read(osd-o_ref),
 atomic_read(osd-o_ref) - 1);
-   if (atomic_dec_and_test(osd-o_ref)) {
+   if (atomic_dec_and_test(osd-o_ref)  osd-o_auth.authorizer) {
struct ceph_auth_client *ac = osd-o_osdc-client-monc.auth;

-   if (osd-o_auth.authorizer)
+   if (ac-ops  ac-ops-destroy_authorizer)
ac-ops-destroy_authorizer(ac, osd-o_auth.authorizer);
kfree(osd);
}
@@ -2119,10 +2119,11 @@ static int get_authorizer(struct ceph_connection 
*con,

int ret = 0;

if (force_new  auth-authorizer) {
-   ac-ops-destroy_authorizer(ac, auth-authorizer);
+   if (ac-ops  ac-ops-destroy_authorizer)
+   ac-ops-destroy_authorizer(ac, auth-authorizer);
auth-authorizer = NULL;
}
-   if (auth-authorizer == NULL) {
+   if (!auth-authorizer  ac-ops  ac-ops-create_authorizer) {
ret = ac-ops-create_authorizer(ac, CEPH_ENTITY_TYPE_OSD, 
auth);
if (ret)
return ret;
@@ -2144,6 +2145,10 @@ static int verify_authorizer_reply(struct 
ceph_connection *con, int len)

struct ceph_osd_client *osdc = o-o_osdc;
struct ceph_auth_client *ac = osdc-client-monc.auth;

+   /*
+* XXX If ac-ops or ac-ops-verify_authorizer_reply is null,
+* XXX which do we do:  succeed or fail?
+*/
return ac-ops-verify_authorizer_reply(ac, o-o_auth.authorizer, len);
 }

@@ -2153,7 +2158,7 @@ static int invalidate_authorizer(struct 
ceph_connection *con)

struct ceph_osd_client *osdc = o-o_osdc;
struct ceph_auth_client *ac = osdc-client-monc.auth;

-   if (ac-ops-invalidate_authorizer)
+   if (ac-ops  ac-ops-invalidate_authorizer)
ac-ops-invalidate_authorizer(ac, CEPH_ENTITY_TYPE_OSD);

return ceph_monc_validate_auth(osdc-client-monc);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 12/16] ceph: have get_authorizer methods return pointers

2012-05-17 Thread Alex Elder


Have the get_authorizer auth_client method return a ceph_auth
pointer rather than an integer, pointer-encoding any returned
error value.  This is to pave the way for making use of the
returned value in an upcoming patch.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 fs/ceph/mds_client.c   |   20 +---
 include/linux/ceph/messenger.h |8 +---
 net/ceph/messenger.c   |8 
 net/ceph/osd_client.c  |   19 ---
 4 files changed, 34 insertions(+), 21 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 4622817..67938a9 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3395,15 +3395,20 @@ out:
 /*
  * authentication
  */
-static int get_authorizer(struct ceph_connection *con,
- void **buf, int *len, int *proto,
- void **reply_buf, int *reply_len, int force_new)
+
+/*
+ * Note: returned pointer is the address of a structure that's
+ * managed separately.  Caller must *not* attempt to free it.
+ */
+static struct ceph_auth_handshake *get_authorizer(struct 
ceph_connection *con,

+   void **buf, int *len, int *proto,
+   void **reply_buf, int *reply_len,
+   int force_new)
 {
struct ceph_mds_session *s = con-private;
struct ceph_mds_client *mdsc = s-s_mdsc;
struct ceph_auth_client *ac = mdsc-fsc-client-monc.auth;
struct ceph_auth_handshake *auth = s-s_auth;
-   int ret = 0;

if (force_new  auth-authorizer) {
if (ac-ops  ac-ops-destroy_authorizer)
@@ -3411,9 +3416,10 @@ static int get_authorizer(struct ceph_connection 
*con,

auth-authorizer = NULL;
}
if (!auth-authorizer  ac-ops  ac-ops-create_authorizer) {
-   ret = ac-ops-create_authorizer(ac, CEPH_ENTITY_TYPE_MDS, 
auth);
+   int ret = ac-ops-create_authorizer(ac, CEPH_ENTITY_TYPE_MDS,
+   auth);
if (ret)
-   return ret;
+   return ERR_PTR(ret);
}

*proto = ac-protocol;
@@ -3422,7 +3428,7 @@ static int get_authorizer(struct ceph_connection *con,
*reply_buf = auth-authorizer_reply_buf;
*reply_len = auth-authorizer_reply_buf_len;

-   return 0;
+   return auth;
 }


diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 3bff047..b10b55f 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -25,9 +25,11 @@ struct ceph_connection_operations {
void (*dispatch) (struct ceph_connection *con, struct ceph_msg *m);

/* authorize an outgoing connection */
-   int (*get_authorizer) (struct ceph_connection *con,
-  void **buf, int *len, int *proto,
-  void **reply_buf, int *reply_len, int force_new);
+   struct ceph_auth_handshake *(*get_authorizer) (
+   struct ceph_connection *con,
+   void **buf, int *len, int *proto,
+   void **reply_buf, int *reply_len,
+   int force_new);
int (*verify_authorizer_reply) (struct ceph_connection *con, int len);
int (*invalidate_authorizer)(struct ceph_connection *con);

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e0532d5..ac27a2c 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -658,7 +658,7 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

void *auth_buf;
int auth_len;
int auth_protocol;
-   int ret;
+   struct ceph_auth_handshake *auth;

if (!con-ops-get_authorizer) {
con-out_connect.authorizer_protocol = CEPH_AUTH_UNKNOWN;
@@ -674,13 +674,13 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

auth_buf = NULL;
auth_len = 0;
auth_protocol = CEPH_AUTH_UNKNOWN;
-   ret = con-ops-get_authorizer(con, auth_buf, auth_len,
+   auth = con-ops-get_authorizer(con, auth_buf, auth_len,
auth_protocol, con-auth_reply_buf,
con-auth_reply_buf_len, con-auth_retry);
mutex_lock(con-mutex);

-   if (ret)
-   return ret;
+   if (IS_ERR(auth))
+   return PTR_ERR(auth);

if (test_bit(CLOSED, con-state) || test_bit(OPENING, con-state))
return -EAGAIN;
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index f640bdf..fa74ae0 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2108,15 +2108,19 @@ static void put_osd_con(struct ceph_connection *con)
 /*
  * authentication
  */
-static int get_authorizer(struct ceph_connection

[PATCH 13/16] ceph: use info returned by get_authorizer

2012-05-17 Thread Alex Elder


Rather than passing a bunch of arguments to be filled in with the
content of the ceph_auth_handshake buffer now returned by the
get_authorizer method, just use the returned information in the
caller, and drop the unnecessary arguments.

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 fs/ceph/mds_client.c   |9 +
 include/linux/ceph/messenger.h |4 +---
 net/ceph/messenger.c   |   13 +++--
 net/ceph/osd_client.c  |9 +
 4 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 67938a9..200bc87 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3401,9 +3401,7 @@ out:
  * managed separately.  Caller must *not* attempt to free it.
  */
 static struct ceph_auth_handshake *get_authorizer(struct 
ceph_connection *con,

-   void **buf, int *len, int *proto,
-   void **reply_buf, int *reply_len,
-   int force_new)
+   int *proto, int force_new)
 {
struct ceph_mds_session *s = con-private;
struct ceph_mds_client *mdsc = s-s_mdsc;
@@ -3421,12 +3419,7 @@ static struct ceph_auth_handshake 
*get_authorizer(struct ceph_connection *con,

if (ret)
return ERR_PTR(ret);
}
-
*proto = ac-protocol;
-   *buf = auth-authorizer_buf;
-   *len = auth-authorizer_buf_len;
-   *reply_buf = auth-authorizer_reply_buf;
-   *reply_len = auth-authorizer_reply_buf_len;

return auth;
 }
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index b10b55f..2521a95 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -27,9 +27,7 @@ struct ceph_connection_operations {
/* authorize an outgoing connection */
struct ceph_auth_handshake *(*get_authorizer) (
struct ceph_connection *con,
-   void **buf, int *len, int *proto,
-   void **reply_buf, int *reply_len,
-   int force_new);
+  int *proto, int force_new);
int (*verify_authorizer_reply) (struct ceph_connection *con, int len);
int (*invalidate_authorizer)(struct ceph_connection *con);

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index ac27a2c..6d82c1a 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -671,20 +671,21 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)


mutex_unlock(con-mutex);

-   auth_buf = NULL;
-   auth_len = 0;
auth_protocol = CEPH_AUTH_UNKNOWN;
-   auth = con-ops-get_authorizer(con, auth_buf, auth_len,
-   auth_protocol, con-auth_reply_buf,
-   con-auth_reply_buf_len, con-auth_retry);
+   auth = con-ops-get_authorizer(con, auth_protocol, con-auth_retry);
+
mutex_lock(con-mutex);

if (IS_ERR(auth))
return PTR_ERR(auth);
-
if (test_bit(CLOSED, con-state) || test_bit(OPENING, con-state))
return -EAGAIN;

+   auth_buf = auth-authorizer_buf;
+   auth_len = auth-authorizer_buf_len;
+   con-auth_reply_buf = auth-authorizer_reply_buf;
+   con-auth_reply_buf_len = auth-authorizer_reply_buf_len;
+
con-out_connect.authorizer_protocol = cpu_to_le32(auth_protocol);
con-out_connect.authorizer_len = cpu_to_le32(auth_len);

diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index fa74ae0..b7d633c 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -2113,9 +2113,7 @@ static void put_osd_con(struct ceph_connection *con)
  * managed separately.  Caller must *not* attempt to free it.
  */
 static struct ceph_auth_handshake *get_authorizer(struct 
ceph_connection *con,

-   void **buf, int *len, int *proto,
-   void **reply_buf, int *reply_len,
-   int force_new)
+   int *proto, int force_new)
 {
struct ceph_osd *o = con-private;
struct ceph_osd_client *osdc = o-o_osdc;
@@ -2133,12 +2131,7 @@ static struct ceph_auth_handshake 
*get_authorizer(struct ceph_connection *con,

if (ret)
return ERR_PTR(ret);
}
-
*proto = ac-protocol;
-   *buf = auth-authorizer_buf;
-   *len = auth-authorizer_buf_len;
-   *reply_buf = auth-authorizer_reply_buf;
-   *reply_len = auth-authorizer_reply_buf_len;

return auth;
 }
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at

[PATCH 15/16] ceph: rename prepare_connect_authorizer()

2012-05-17 Thread Alex Elder


Change the name of prepare_connect_authorizer().  The next
patch is going to make this function no longer add anything to the
connection's out_kvec, so it will no longer fit the pattern of
the rest of the prepare_connect_*() functions.

In addition, pass the address of a variable that will hold the
authorization protocol to use.  Move the assignment of that to the
connection's out_connect structure into prepare_write_connect().

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index f92d564..bfddd87 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -653,11 +653,11 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)

  * Connection negotiation.
  */

-static struct ceph_auth_handshake *prepare_connect_authorizer(struct 
ceph_connection *con)
+static struct ceph_auth_handshake *get_connect_authorizer(struct 
ceph_connection *con,

+   int *auth_proto)
 {
void *auth_buf;
int auth_len;
-   int auth_protocol;
struct ceph_auth_handshake *auth;

if (!con-ops-get_authorizer) {
@@ -671,8 +671,7 @@ static struct ceph_auth_handshake 
*prepare_connect_authorizer(struct ceph_connec


mutex_unlock(con-mutex);

-   auth_protocol = CEPH_AUTH_UNKNOWN;
-   auth = con-ops-get_authorizer(con, auth_protocol, con-auth_retry);
+   auth = con-ops-get_authorizer(con, auth_proto, con-auth_retry);

mutex_lock(con-mutex);

@@ -686,7 +685,6 @@ static struct ceph_auth_handshake 
*prepare_connect_authorizer(struct ceph_connec

con-auth_reply_buf = auth-authorizer_reply_buf;
con-auth_reply_buf_len = auth-authorizer_reply_buf_len;

-   con-out_connect.authorizer_protocol = cpu_to_le32(auth_protocol);
con-out_connect.authorizer_len = cpu_to_le32(auth_len);

if (auth_len)
@@ -712,6 +710,7 @@ static int prepare_write_connect(struct 
ceph_connection *con)

 {
unsigned global_seq = get_global_seq(con-msgr, 0);
int proto;
+   int auth_proto;
struct ceph_auth_handshake *auth;

switch (con-peer_name.type) {
@@ -739,9 +738,11 @@ static int prepare_write_connect(struct 
ceph_connection *con)

con-out_connect.flags = 0;

ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);
-   auth = prepare_connect_authorizer(con);
+   auth_proto = CEPH_AUTH_UNKNOWN;
+   auth = get_connect_authorizer(con, auth_proto);
if (IS_ERR(auth))
return PTR_ERR(auth);
+   con-out_connect.authorizer_protocol = cpu_to_le32(auth_proto);

con-out_more = 0;
set_bit(WRITE_PENDING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 16/16] ceph: add auth buf in prepare_write_connect()

2012-05-17 Thread Alex Elder


Move the addition of the authorizer buffer to a connection's
out_kvec out of get_connect_authorizer() and into its caller.  This
way, the caller--prepare_write_connect()--can avoid adding the
connect header to out_kvec before it has been fully initialized.

Prior to this patch, it was possible for a connect header to be
sent over the wire before the authorizer protocol or buffer length
fields were initialized.  An authorizer buffer associated with that
header could also be queued to send only after the connection header
that describes it was on the wire.

Fixes http://tracker.newdream.net/issues/2424

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   17 -
 1 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index bfddd87..c0b18dc 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -656,8 +656,6 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)
 static struct ceph_auth_handshake *get_connect_authorizer(struct 
ceph_connection *con,

int *auth_proto)
 {
-   void *auth_buf;
-   int auth_len;
struct ceph_auth_handshake *auth;

if (!con-ops-get_authorizer) {
@@ -680,15 +678,9 @@ static struct ceph_auth_handshake 
*get_connect_authorizer(struct ceph_connection

if (test_bit(CLOSED, con-state) || test_bit(OPENING, con-state))
return ERR_PTR(-EAGAIN);

-   auth_buf = auth-authorizer_buf;
-   auth_len = auth-authorizer_buf_len;
con-auth_reply_buf = auth-authorizer_reply_buf;
con-auth_reply_buf_len = auth-authorizer_reply_buf_len;

-   con-out_connect.authorizer_len = cpu_to_le32(auth_len);
-
-   if (auth_len)
-   ceph_con_out_kvec_add(con, auth_len, auth_buf);

return auth;
 }
@@ -737,12 +729,19 @@ static int prepare_write_connect(struct 
ceph_connection *con)

con-out_connect.protocol_version = cpu_to_le32(proto);
con-out_connect.flags = 0;

-   ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);
auth_proto = CEPH_AUTH_UNKNOWN;
auth = get_connect_authorizer(con, auth_proto);
if (IS_ERR(auth))
return PTR_ERR(auth);
+
con-out_connect.authorizer_protocol = cpu_to_le32(auth_proto);
+   con-out_connect.authorizer_len = cpu_to_le32(auth-authorizer_buf_len);
+
+   ceph_con_out_kvec_add(con, sizeof (con-out_connect),
+   con-out_connect);
+   if (auth-authorizer_buf_len)
+   ceph_con_out_kvec_add(con, auth-authorizer_buf_len,
+   auth-authorizer_buf);

con-out_more = 0;
set_bit(WRITE_PENDING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 14/16] ceph: return pointer from prepare_connect_authorizer()

2012-05-17 Thread Alex Elder


Change prepare_connect_authorizer() so it returns a pointer (or
pointer-coded error).

Signed-off-by: Alex Elder el...@inktank.com
Reviewed-by: Sage Weil s...@inktank.com
---
 net/ceph/messenger.c |   18 +-
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 6d82c1a..f92d564 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -653,7 +653,7 @@ static void prepare_write_keepalive(struct 
ceph_connection *con)

  * Connection negotiation.
  */

-static int prepare_connect_authorizer(struct ceph_connection *con)
+static struct ceph_auth_handshake *prepare_connect_authorizer(struct 
ceph_connection *con)

 {
void *auth_buf;
int auth_len;
@@ -664,7 +664,7 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

con-out_connect.authorizer_protocol = CEPH_AUTH_UNKNOWN;
con-out_connect.authorizer_len = 0;

-   return 0;
+   return NULL;
}

/* Can't hold the mutex while getting authorizer */
@@ -677,9 +677,9 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

mutex_lock(con-mutex);

if (IS_ERR(auth))
-   return PTR_ERR(auth);
+   return auth;
if (test_bit(CLOSED, con-state) || test_bit(OPENING, con-state))
-   return -EAGAIN;
+   return ERR_PTR(-EAGAIN);

auth_buf = auth-authorizer_buf;
auth_len = auth-authorizer_buf_len;
@@ -692,7 +692,7 @@ static int prepare_connect_authorizer(struct 
ceph_connection *con)

if (auth_len)
ceph_con_out_kvec_add(con, auth_len, auth_buf);

-   return 0;
+   return auth;
 }

 /*
@@ -712,7 +712,7 @@ static int prepare_write_connect(struct 
ceph_connection *con)

 {
unsigned global_seq = get_global_seq(con-msgr, 0);
int proto;
-   int ret;
+   struct ceph_auth_handshake *auth;

switch (con-peer_name.type) {
case CEPH_ENTITY_TYPE_MON:
@@ -739,9 +739,9 @@ static int prepare_write_connect(struct 
ceph_connection *con)

con-out_connect.flags = 0;

ceph_con_out_kvec_add(con, sizeof (con-out_connect), 
con-out_connect);
-   ret = prepare_connect_authorizer(con);
-   if (ret)
-   return ret;
+   auth = prepare_connect_authorizer(con);
+   if (IS_ERR(auth))
+   return PTR_ERR(auth);

con-out_more = 0;
set_bit(WRITE_PENDING, con-state);
--
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Martin Mailand


Hi Josef,
no there was nothing above. Here the is another dmesg output.


Was there anything above those messages?  There should have been a WARN_ON() or
something.  If not thats fine, I just need to know one way or the other so I can
figure out what to do next.  Thanks,

Josef


-martin

[   63.027277] Btrfs loaded
[   63.027485] device fsid 266726e1-439f-4d89-a374-7ef92d355daf devid 1 
transid 4 /dev/sdc

[   63.027750] btrfs: setting nodatacow
[   63.027752] btrfs: enabling auto defrag
[   63.027753] btrfs: disk space caching is enabled
[   63.027754] btrfs flagging fs with big metadata feature
[   63.036347] device fsid 070e2c6c-2ea5-478d-bc07-7ce3a954e2e4 devid 1 
transid 4 /dev/sdd

[   63.036624] btrfs: setting nodatacow
[   63.036626] btrfs: enabling auto defrag
[   63.036627] btrfs: disk space caching is enabled
[   63.036628] btrfs flagging fs with big metadata feature
[   63.045628] device fsid 6f7b82a9-a1b7-40c6-8b00-2c2a44481066 devid 1 
transid 4 /dev/sde

[   63.045910] btrfs: setting nodatacow
[   63.045912] btrfs: enabling auto defrag
[   63.045913] btrfs: disk space caching is enabled
[   63.045914] btrfs flagging fs with big metadata feature
[   63.831278] device fsid 46890b76-45c2-4ea2-96ee-2ea88e29628b devid 1 
transid 4 /dev/sdf

[   63.831577] btrfs: setting nodatacow
[   63.831579] btrfs: enabling auto defrag
[   63.831579] btrfs: disk space caching is enabled
[   63.831580] btrfs flagging fs with big metadata feature
[ 1521.820412] [ cut here ]
[ 1521.820424] kernel BUG at fs/btrfs/inode.c:2220!
[ 1521.820433] invalid opcode:  [#1] SMP
[ 1521.820448] CPU 4
[ 1521.820452] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses 
enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd 
aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev 
ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb dca megaraid_sas mdio

[ 1521.820562]
[ 1521.820567] Pid: 3095, comm: ceph-osd Tainted: G C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 1521.820591] RIP: 0010:[a02532f2]  [a02532f2] 
btrfs_orphan_del+0xe2/0xf0 [btrfs]

[ 1521.820616] RSP: 0018:881013da9d18  EFLAGS: 00010282
[ 1521.820626] RAX: fffe RBX: 881013a3b7f0 RCX: 
00395dcf
[ 1521.820640] RDX: 00395dce RSI: 88101df77480 RDI: 
ea004077ddc0
[ 1521.820654] RBP: 881013da9d58 R08: 60ef800010d0 R09: 
a022ac6a
[ 1521.820667] R10:  R11: 010a R12: 
88101e378790
[ 1521.820681] R13: 88101e378400 R14: 0001 R15: 
0001
[ 1521.820695] FS:  7faa45d30700() GS:88107fc8() 
knlGS:

[ 1521.820710] CS:  0010 DS:  ES:  CR0: 80050033
[ 1521.820738] CR2: 7fe0efba6010 CR3: 001016fec000 CR4: 
000407e0
[ 1521.820767] DR0:  DR1:  DR2: 

[ 1521.820796] DR3:  DR6: 0ff0 DR7: 
0400
[ 1521.820825] Process ceph-osd (pid: 3095, threadinfo 881013da8000, 
task 881013da44a0)

[ 1521.820870] Stack:
[ 1521.820889]  0c05 88101df9c230 881013da9d38 
88101df9c230
[ 1521.820939]   88101e378400 881013a3b7f0 
880c6880f840
[ 1521.820988]  881013da9e08 a0257628 881013a3b7f0 


[ 1521.821038] Call Trace:
[ 1521.821066]  [a0257628] btrfs_truncate+0x4d8/0x650 [btrfs]
[ 1521.821096]  [81188afd] ? path_lookupat+0x6d/0x750
[ 1521.821128]  [a0259021] btrfs_setattr+0xc1/0x1b0 [btrfs]
[ 1521.821156]  [811955c3] notify_change+0x183/0x320
[ 1521.821183]  [8117889e] do_truncate+0x5e/0xa0
[ 1521.821209]  [81178a24] sys_truncate+0x144/0x1b0
[ 1521.821237]  [8165fd29] system_call_fastpath+0x16/0x1b
[ 1521.821265] Code: e8 4c 8b 75 f0 4c 8b 7d f8 c9 c3 66 0f 1f 44 00 00 
80 bb 60 fe ff ff 84 75 b4 eb ae 0f 1f 44 00 00 48 89 df e8 50 73 fe ff 
eb b8 0f 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec

[ 1521.821458] RIP  [a02532f2] btrfs_orphan_del+0xe2/0xf0 [btrfs]
[ 1521.821492]  RSP 881013da9d18
[ 1521.821758] ---[ end trace aee4c5fe92ee2a67 ]---
[ 6888.637508] btrfs: truncated 1 orphans
[ 7641.701736] [ cut here ]
[ 7641.701764] kernel BUG at fs/btrfs/inode.c:2220!
[ 7641.701789] invalid opcode:  [#2] SMP
[ 7641.701816] CPU 3
[ 7641.701819] Modules linked in: btrfs zlib_deflate libcrc32c ext2 ses 
enclosure bonding coretemp ghash_clmulni_intel aesni_intel cryptd 
aes_x86_64 psmouse microcode serio_raw sb_edac edac_core mei(C) joydev 
ioatdma mac_hid lp parport isci libsas scsi_transport_sas usbhid hid 
ixgbe igb dca megaraid_sas mdio

[ 7641.702000]
[ 7641.702030] Pid: 3064, comm: ceph-osd Tainted: G  D  C 
3.4.0-rc7+ #10 Supermicro X9SRi/X9SRi
[ 7641.702081] RIP: 0010:[a02532f2]  [a02532f2]

Re: Journal too small

2012-05-17 Thread Sage Weil

On Thu, 17 May 2012, Karol Jurak wrote:
 Hi,
 
 During an ongoing recovery in one of my clusters a couple of OSDs 
 complained about too small journal. For instance:
 
 2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at 
 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000 
 start 863363072)
 2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item 
 1693745152  journal 1048571904 (usable)
 
 I was under the impression that the OSDs stopped participating in recovery 
 after this event. (ceph -w showed that the number of PGs in state 
 active+clean no longer increased.) They resumed recovery after I enlarged 
 their journals (stop osd, --flush-journal, --mkjournal, start osd).
 
 How serious is such situation? Do the OSDs know how to handle it 
 correctly? Or could this result in some data loss or corruption? After the 
 recovery finished (ceph -w showed that all PGs are in active+clean state) 
 I noticed that a few rbd images were corrupted.

The osds tolerate the full journal.  There will be a big latency spike, 
but they'll recover without risking data.  You should definitely increase 
the journal size if this happens regulary, though.

sage

 
 The cluster runs v0.46. The OSDs use ext4. I'm pretty sure that during the 
 recovery no clients were accessing the cluster.
 
 Best regards,
 Karol
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: global_init_daemonize: BUG: there are 1 child threads already started that will now die!

2012-05-17 Thread Sage Weil

On Thu, 17 May 2012, Székelyi Szabolcs wrote:
 Hello,
 
 I get the $subject message when starting Ceph with the init script. I have to 
 try it 15-20 times until the start suceeds. I've seen this message emitted by 
 the monitor and MDS daemons, but never by OSDs. Is this really a bug as the 
 message says? I have two theoretically identical clusters, and only one of 
 them produces this error, so something might be wrong on my side, but what?

This is just a broken/racy check.  There's a patch in the stable branch 
that removes it.  It'll also be fixed in 0.47, out tomorrow most likely.

sage

Re: pushed it to a crash

2012-05-17 Thread Tommi Virtanen

[Added ceph-devel to Cc:]

On Thu, May 17, 2012 at 1:56 AM, Clint Byrum cl...@fewbar.com wrote:
 Hi Tommi.

 I got home and just had to finish the refactor. Now I get a crash, which I've 
 pasted below.

Well, I've never seen that one before. Anyone else have a clue? Sage, Greg?

For others: that's using a setup where one of the monitors
(mon.cmon-0) gets a monmap with itself as the only member, and others
(mon.cmon-1) get a monmap that has the IP of mon.cmon-0.

 Its reproducible with the charm I have.. which I've pushed to this bzr branch:

 lp:~clint-fewbar/charms/precise/ceph-mon/rewrite

 If you are brave enough to try juju:

 juju bootstrap
 juju deploy --repository ~/charms local:ceph-mon -n 3
 ... wait
 juju set ceph-mon initializing-unit=ceph-mon/0

 The crash is on the leader (renamed to the initializing-unit to remove 
 confusion).

 I've left this running, if you'd like to do anything as far as debugging.

 Anyway, the log from the non-leaders looks like this:

 2012-05-17 08:48:12.077550 7fd3e11d1780  0 store(/mnt/mon.cmon-1) created 
 monfs at /mnt/mon.cmon-1 for cmon-1
 2012-05-17 08:48:13.714240 7fd5b13ed780  1 mon.cmon-1@-1(probing) e0 init 
 fsid f56c6f22-9ffc-11e1-bae9-22000afc46fb
 2012-05-17 08:48:13.789339 7fd5ac46a700  0 log [INF] : mon.cmon-1 calling new 
 monitor election
 2012-05-17 08:48:14.416376 7fd5ab468700  0 -- 10.252.87.112:6800/0  
 10.252.70.251:6789/0 pipe(0x2346500 sd=17 pgs=2 cs=1 l=0).fault initiating 
 reconnect

 And here is the crash log:

 root@ip-10-252-70-251:~# cat /var/log/ceph/ceph-mon.cmon-0.log
 2012-05-17 08:47:42.647248 7fc7c734d780  0 store(/mnt/mon.cmon-0) created 
 monfs at /mnt/mon.cmon-0 for cmon-0
 2012-05-17 08:47:48.263011 7f1657506780  1 mon.cmon-0@0(probing) e0 init fsid 
 f56c6f22-9ffc-11e1-bae9-22000afc46fb
 2012-05-17 08:47:48.263654 7f1657506780  1 mon.cmon-0@0(probing) e0 
 win_standalone_election
 2012-05-17 08:47:48.263748 7f1657506780  0 log [INF] : mon.cmon-0@0 won 
 leader election with quorum 0
 2012-05-17 08:47:48.297184 7f1657506780  1 mon.cmon-0@0(leader).osd e1 e1: 0 
 osds: 0 up, 0 in
 2012-05-17 08:47:48.528721 7f1657506780  1 mon.cmon-0@0(probing) e1 
 win_standalone_election
 2012-05-17 08:47:48.528829 7f1657506780  0 log [INF] : mon.cmon-0@0 won 
 leader election with quorum 0
 2012-05-17 08:48:13.749820 7f1652583700  0 mon.cmon-0@0(leader).monmap v1 
 adding cmon-2 at 10.252.33.186:6800/0 to monitor cluster
 2012-05-17 08:48:13.752916 7f1652583700  0 log [INF] : mon.cmon-0 calling new 
 monitor election
 2012-05-17 08:48:13.881387 7f1652583700 -1 mon/MonMap.h: In function 
 'entity_inst_t MonMap::get_inst(unsigned int) const' thread 7f1652583700 time 
 2012-05-17 08:48:13.880187
 mon/MonMap.h: 167: FAILED assert(m  rank_addr.size())

  ceph version 0.46-219-g54bc094 
 (commit:54bc09417917d8d0ca99d8ed8285498b7d5aa369)
  1: (Elector::defer(int)+0x6da) [0x5162aa]
  2: (Elector::handle_propose(MMonElection*)+0x573) [0x516903]
  3: (Elector::dispatch(Message*)+0xc73) [0x519863]
  4: (Monitor::_ms_dispatch(Message*)+0x3f3) [0x4865a3]
  5: (Monitor::ms_dispatch(Message*)+0x32) [0x493752]
  6: (SimpleMessenger::dispatch_entry()+0x863) [0x5bf653]
  7: (SimpleMessenger::DispatchThread::entry()+0xd) [0x589e3d]
  8: (()+0x7e9a) [0x7f16570dce9a]
  9: (clone()+0x6d) [0x7f16558954bd]
  NOTE: a copy of the executable, or `objdump -rdS executable` is needed to 
 interpret this.

 --- begin dump of recent events ---
   -10 2012-05-17 08:47:48.253421 7f1657506780  1 store(/mnt/mon.cmon-0) mount
    -9 2012-05-17 08:47:48.257204 7f1657506780  0 ceph version 
 0.46-219-g54bc094 (commit:54bc09417917d8d0ca99d8ed8285498b7d5aa369), process 
 ceph-mon, pid 7640
    -8 2012-05-17 08:47:48.263011 7f1657506780  1 mon.cmon-0@0(probing) e0 
 init fsid f56c6f22-9ffc-11e1-bae9-22000afc46fb
    -7 2012-05-17 08:47:48.263654 7f1657506780  1 mon.cmon-0@0(probing) e0 
 win_standalone_election
    -6 2012-05-17 08:47:48.263748 7f1657506780  0 log [INF] : mon.cmon-0@0 
 won leader election with quorum 0
    -5 2012-05-17 08:47:48.297184 7f1657506780  1 mon.cmon-0@0(leader).osd e1 
 e1: 0 osds: 0 up, 0 in
    -4 2012-05-17 08:47:48.528721 7f1657506780  1 mon.cmon-0@0(probing) e1 
 win_standalone_election
    -3 2012-05-17 08:47:48.528829 7f1657506780  0 log [INF] : mon.cmon-0@0 
 won leader election with quorum 0
    -2 2012-05-17 08:48:13.749820 7f1652583700  0 mon.cmon-0@0(leader).monmap 
 v1 adding cmon-2 at 10.252.33.186:6800/0 to monitor cluster
    -1 2012-05-17 08:48:13.752916 7f1652583700  0 log [INF] : mon.cmon-0 
 calling new monitor election
     0 2012-05-17 08:48:13.881387 7f1652583700 -1 mon/MonMap.h: In function 
 'entity_inst_t MonMap::get_inst(unsigned int) const' thread 7f1652583700 time 
 2012-05-17 08:48:13.880187
 mon/MonMap.h: 167: FAILED assert(m  rank_addr.size())

  ceph version 0.46-219-g54bc094 
 (commit:54bc09417917d8d0ca99d8ed8285498b7d5aa369)
  1: (Elector::defer(int)+0x6da) [0x5162aa]
  2:

Re: Journal too small

2012-05-17 Thread Tommi Virtanen

On Thu, May 17, 2012 at 9:01 AM, Sage Weil s...@inktank.com wrote:
 2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
 863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
 start 863363072)
 2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item
 1693745152  journal 1048571904 (usable)

 The osds tolerate the full journal.  There will be a big latency spike,
 but they'll recover without risking data.  You should definitely increase
 the journal size if this happens regulary, though.

I propose for your merging pleasure:
https://github.com/ceph/ceph/commits/journal-too-small
https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: global_init_daemonize: BUG: there are 1 child threads already started that will now die!

2012-05-17 Thread Székelyi Szabolcs

On 2012. May 17. 09:02:45 Sage Weil wrote:
 On Thu, 17 May 2012, Székelyi Szabolcs wrote:
  I get the $subject message when starting Ceph with the init script. I have
  to try it 15-20 times until the start suceeds. I've seen this message
  emitted by the monitor and MDS daemons, but never by OSDs. Is this really
  a bug as the message says? I have two theoretically identical clusters,
  and only one of them produces this error, so something might be wrong on
  my side, but what?
 
 This is just a broken/racy check.  There's a patch in the stable branch
 that removes it.  It'll also be fixed in 0.47, out tomorrow most likely.

Great, thank you both!

-- 
cc


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Journal too small

2012-05-17 Thread Sage Weil

On Thu, 17 May 2012, Tommi Virtanen wrote:
 On Thu, May 17, 2012 at 9:01 AM, Sage Weil s...@inktank.com wrote:
  2012-05-12 13:31:04.034144 7f491061d700  1 journal check_for_full at
  863363072 : JOURNAL FULL 863363072 = 1048571903 (max_size 1048576000
  start 863363072)
  2012-05-12 13:31:04.034680 7f491061d700  0 journal JOURNAL TOO SMALL: item
  1693745152  journal 1048571904 (usable)
 
  The osds tolerate the full journal.  There will be a big latency spike,
  but they'll recover without risking data.  You should definitely increase
  the journal size if this happens regulary, though.
 
 I propose for your merging pleasure:
 https://github.com/ceph/ceph/commits/journal-too-small
 https://github.com/ceph/ceph/commit/62db60bede8b187e25acb715f6616d2ce7cfc97f

Perfect, merged.

Re: Journal too small

2012-05-17 Thread Josh Durgin


On 05/17/2012 03:59 AM, Karol Jurak wrote:

How serious is such situation? Do the OSDs know how to handle it
correctly? Or could this result in some data loss or corruption? After the
recovery finished (ceph -w showed that all PGs are in active+clean state)
I noticed that a few rbd images were corrupted.


As Sage mentioned, the OSDs know how to handle full journals correctly.

I'd like to figure out how your rbd images got corrupted, if possible.

How did you notice the corruption?

Has your cluster always run 0.46, or did you upgrade from earlier
versions?

What happened to the cluster between your last check for corruption and
now? Did your use of it or any ceph client or server configuration
change?
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Josef Bacik

On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
 Hi Josef,
 no there was nothing above. Here the is another dmesg output.
 

Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root-orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(root-dirty_list);
-   INIT_LIST_HEAD(root-orphan_list);
INIT_LIST_HEAD(root-root_list);
spin_lock_init(root-orphan_lock);
spin_lock_init(root-inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(root-log_commit[0], 0);
atomic_set(root-log_commit[1], 0);
atomic_set(root-log_writers, 0);
+   atomic_set(root-orphan_inodes, 0);
root-log_batch = 0;
root-log_transid = 0;
root-last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..7cc1c96 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(root-orphan_list) ||
+   if (atomic_read(root-orphan_inodes) ||
root-orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(root-orphan_lock);
-   if (!list_empty(root-orphan_list)) {
+   if (atomic_read(root-orphan_inodes)) {
spin_unlock(root-orphan_lock);
return;
}
@@ -2166,8 +2166,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(BTRFS_I(inode)-i_orphan)) {
-   list_add(BTRFS_I(inode)-i_orphan, root-orphan_list);
+   if (!BTRFS_I(inode)-has_orphan_item) {
+   BTRFS_I(inode)-has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2180,6 +2180,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(root-orphan_inodes);
}
 
if (!BTRFS_I(inode)-orphan_meta_reserved) {
@@ -2198,6 +2199,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (insert = 1) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret  ret != -EEXIST) {
+   spin_lock(root-orphan_lock);
+   BTRFS_I(inode)-has_orphan_item = 0;
+   spin_unlock(root-orphan_lock);
btrfs_abort_transaction(trans, root, ret);
return ret;
}
@@ -2227,13 +2231,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
int release_rsv = 0;
int ret = 0;
 
+   /*
+* evict_inode gets called without holding the i_mutex so we need to
+* take the orphan lock to make sure we are safe in messing with these.
+*/
spin_lock(root-orphan_lock);
-   if (!list_empty(BTRFS_I(inode)-i_orphan)) {
-   list_del_init(BTRFS_I(inode)-i_orphan);
-   delete_item = 1;
+   if (BTRFS_I(inode)-has_orphan_item) {
+   if (trans) {
+   BTRFS_I(inode)-has_orphan_item = 0;
+   delete_item = 1;
+   } else {
+   WARN_ON(1);
+   }
}

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Christian Brunner

2012/5/17 Josef Bacik jo...@redhat.com:
 On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
 Hi Josef,
 no there was nothing above. Here the is another dmesg output.


 Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
 Thanks,

 Josef

Well, I hate to say it, but the new patch doesn't seem to change much...

Regards,
Christian

[  123.507444] Btrfs loaded
[  202.683630] device fsid 2aa7531c-0e3c-4955-8542-6aed7ab8c1a2 devid
1 transid 4 /dev/sda
[  202.693704] btrfs: use lzo compression
[  202.697999] btrfs: enabling inode map caching
[  202.702989] btrfs: enabling auto defrag
[  202.707190] btrfs: disk space caching is enabled
[  202.712721] btrfs flagging fs with big metadata feature
[  207.839761] device fsid f81ff6a1-c333-4daf-989f-a28139f15f08 devid
1 transid 4 /dev/sdb
[  207.849681] btrfs: use lzo compression
[  207.853987] btrfs: enabling inode map caching
[  207.858970] btrfs: enabling auto defrag
[  207.863173] btrfs: disk space caching is enabled
[  207.868635] btrfs flagging fs with big metadata feature
[  210.857328] device fsid 9b905faa-f4fa-4626-9cae-2cd0287b30f7 devid
1 transid 4 /dev/sdc
[  210.867265] btrfs: use lzo compression
[  210.871560] btrfs: enabling inode map caching
[  210.876550] btrfs: enabling auto defrag
[  210.880757] btrfs: disk space caching is enabled
[  210.886228] btrfs flagging fs with big metadata feature
[  214.296287] device fsid f7990e4c-90b0-4691-9502-92b60538574a devid
1 transid 4 /dev/sdd
[  214.306510] btrfs: use lzo compression
[  214.310855] btrfs: enabling inode map caching
[  214.315905] btrfs: enabling auto defrag
[  214.320174] btrfs: disk space caching is enabled
[  214.325706] btrfs flagging fs with big metadata feature
[ 1337.937379] [ cut here ]
[ 1337.942526] kernel BUG at fs/btrfs/inode.c:2224!
[ 1337.947671] invalid opcode:  [#1] SMP
[ 1337.952255] CPU 5
[ 1337.954300] Modules linked in: btrfs zlib_deflate libcrc32c xfs
exportfs sunrpc bonding ipv6 sg pcspkr serio_raw iTCO_wdt
iTCO_vendor_support iomemory_vsl(PO) ixgbe dca mdio i7core_edac
edac_core hpsa squashfs [last unloaded: scsi_wait_scan]
[ 1337.978570]
[ 1337.980230] Pid: 6812, comm: ceph-osd Tainted: P   O
3.3.5-1.fits.1.el6.x86_64 #1 HP ProLiant DL180 G6
[ 1337.991592] RIP: 0010:[a035675c]  [a035675c]
btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.001897] RSP: 0018:8805e1171d38  EFLAGS: 00010282
[ 1338.007815] RAX: fffe RBX: 88061c3c8400 RCX: 00b37f48
[ 1338.015768] RDX: 00b37f47 RSI: 8805ec2a1cf0 RDI: ea0017b0a840
[ 1338.023724] RBP: 8805e1171d68 R08: 60f9d88028a0 R09: a033016a
[ 1338.031675] R10:  R11: 0004 R12: 8805de7f57a0
[ 1338.039629] R13: 0001 R14: 0001 R15: 8805ec2a5280
[ 1338.047584] FS:  7f4bffc6e700() GS:8806272a()
knlGS:
[ 1338.056600] CS:  0010 DS:  ES:  CR0: 80050033
[ 1338.063003] CR2: ff600400 CR3: 0005e34c3000 CR4: 06e0
[ 1338.070954] DR0:  DR1:  DR2: 
[ 1338.078909] DR3:  DR6: 0ff0 DR7: 0400
[ 1338.086865] Process ceph-osd (pid: 6812, threadinfo
8805e117, task 88060fa81940)
[ 1338.096268] Stack:
[ 1338.098509]  8805e1171d68 8805ec2a5280 88051235b920

[ 1338.106795]  88051235b920 0008 8805e1171e08
a036043c
[ 1338.115082]    
00011000
[ 1338.123367] Call Trace:
[ 1338.126111]  [a036043c] btrfs_truncate+0x5bc/0x640 [btrfs]
[ 1338.133213]  [a03605b6] btrfs_setattr+0xf6/0x1a0 [btrfs]
[ 1338.140105]  [811816fb] notify_change+0x18b/0x2b0
[ 1338.146320]  [81276541] ? selinux_inode_permission+0xd1/0x130
[ 1338.153699]  [81165f44] do_truncate+0x64/0xa0
[ 1338.159527]  [81172669] ? inode_permission+0x49/0x100
[ 1338.166128]  [81166197] sys_truncate+0x137/0x150
[ 1338.172244]  [8158b1e9] system_call_fastpath+0x16/0x1b
[ 1338.178936] Code: 89 e7 e8 88 7d fe ff eb 89 66 0f 1f 44 00 00 be
a4 08 00 00 48 c7 c7 59 49 3b a0 45 31 ed e8 5c 78 cf e0 45 31 f6 e9
30 ff ff ff 0f 0b eb fe 55 48 89 e5 48 83 ec 40 48 89 5d d8 4c 89 65
e0 4c
[ 1338.200623] RIP  [a035675c] btrfs_orphan_del+0x14c/0x150 [btrfs]
[ 1338.208317]  RSP 8805e1171d38
[ 1338.212681] ---[ end trace 86be14f0f863ea79 ]---
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re:

2012-05-17 Thread Josh Durgin


On 05/07/2012 05:54 PM, Tim Flavin wrote:

The new site is great!  I like the Ceph documentation, however I found
a couple of typos.  Is this the best place address them?  (Some of the
apparent typos may be my not understanding what is going on.)



http://ceph.com/docs/master/config-cluster/ceph-conf/

The  Hardware Recommendations link near the bottom of the page gives
a 404.  Did you want to point to
http://ceph.com/docs/master/install/hardware-recommendations/ ?


http://ceph.com/docs/master/config-ref/osd-config

For  osd client message size cap  The default value is 500 MB but
the description lists it a 200 MB.


http://ceph.com/docs/master/api/librbdpy/

The line of code: size = 4 * 1024 * 1024  # 4 GiB appears to be
missing a * 1024, and the next line
  is rbd_inst.create('myimage', 4) when it probably should be
rbd_inst.create('myimage', size) This is repeated several times.


Thanks for the notes - I've fixed these in the master branch.

All the docs are in git under the doc directory - if you find other
problems, feel free to send a patch or a github pull request. You can
even edit it in a browser on github if you like.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ceph support for Direct I/O

2012-05-17 Thread Chang

Hi,

I was trying to test write performance of Ceph, using direct I/O.

$ dd if=/dev/zero of=/mnt/ceph/my.dat bs=1M count=1 oflag=direct

But dd is just stuck without doing any write.

Without oflag, write is done fine.

In the above example, I used Ceph kernel module 0.20, built on kernel 2.6.32.

I also tried Ceph fuse client to mount.
In this case, the above dd command gives me:

dd: opening `/mnt/ceph/my.dat': Invalid argument

Is it because Ceph does not support direct I/O?

I'm using ceph 0.46, installed using Debian release package.
There are three OSDs, 1 Mon and 1 MDS in the cluster.

Any help or clarification is appreciated.

Chang


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Designing a cluster guide

2012-05-17 Thread Gregory Farnum

Sorry this got left for so long...

On Thu, May 10, 2012 at 6:23 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi,

 the Designing a cluster guide
 http://wiki.ceph.com/wiki/Designing_a_cluster is pretty good but it
 still leaves some questions unanswered.

 It mentions for example Fast CPU for the mds system. What does fast
 mean? Just the speed of one core? Or is ceph designed to use multi core?
 Is multi core or more speed important?
Right now, it's primarily the speed of a single core. The MDS is
highly threaded but doing most things requires grabbing a big lock.
How fast is a qualitative rather than quantitative assessment at this
point, though.

 The Cluster Design Recommendations mentions to seperate all Daemons on
 dedicated machines. Is this also for the MON useful? As they're so
 leightweight why not running them on the OSDs?
It depends on what your nodes look like, and what sort of cluster
you're running. The monitors are pretty lightweight, but they will add
*some* load. More important is their disk access patterns — they have
to do a lot of syncs. So if they're sharing a machine with some other
daemon you want them to have an independent disk and to be running a
new kernelglibc so that they can use syncfs rather than sync. (The
only distribution I know for sure does this is Ubuntu 12.04.)

 Regarding the OSDs is it fine to use an SSD Raid 1 for the journal and
 perhaps 22x SATA Disks in a Raid 10 for the FS or is this quite absurd
 and you should go for 22x SSD Disks in a Raid 6?
You'll need to do your own failure calculations on this one, I'm
afraid. Just take note that you'll presumably be limited to the speed
of your journaling device here.
Given that Ceph is going to be doing its own replication, though, I
wouldn't want to add in another whole layer of replication with raid10
— do you really want to multiply your storage requirements by another
factor of two?
 Is it more useful the use a Raid 6 HW Controller or the btrfs raid?
I would use the hardware controller over btrfs raid for now; it allows
more flexibility in eg switching to xfs. :)

 Use single socket Xeon for the OSDs or Dual Socket?
Dual socket servers will be overkill given the setup you're
describing. Our WAG rule of thumb is 1GHz of modern CPU per OSD
daemon. You might consider it if you decided you wanted to do an OSD
per disk instead (that's a more common configuration, but it requires
more CPU and RAM per disk and we don't know yet which is the better
choice).
-Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph support for Direct I/O

2012-05-17 Thread Sage Weil

On Thu, 17 May 2012, Chang wrote:
 Hi,
 
 I was trying to test write performance of Ceph, using direct I/O.
 
 $ dd if=/dev/zero of=/mnt/ceph/my.dat bs=1M count=1 oflag=direct
 
 But dd is just stuck without doing any write.
 
 Without oflag, write is done fine.
 
 In the above example, I used Ceph kernel module 0.20, built on kernel 2.6.32.

That's an extremely old version of ceph.  I don't remember when direct i/o 
became well tested, but it doesn't surprise me that that version doesn't 
work.

I would suggest anything after 3.0...

sage

 
 I also tried Ceph fuse client to mount.
 In this case, the above dd command gives me:
 
 dd: opening `/mnt/ceph/my.dat': Invalid argument
 
 Is it because Ceph does not support direct I/O?
 
 I'm using ceph 0.46, installed using Debian release package.
 There are three OSDs, 1 Mon and 1 MDS in the cluster.
 
 Any help or clarification is appreciated.
 
 Chang
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: MDS crash, wont startup again

2012-05-17 Thread Josh Durgin


On 05/16/2012 01:11 AM, Felix Feinhals wrote:

Hi again,

anything on this Problem? Seems that the only choice for me is to
reinitialize the whole cephfs (mkcephfs...)
:(


Hi Felix, it looks like your first mail never reached the list.


2012/5/10 Felix Feinhalsf...@turtle-entertainment.de:

Hi List,

we installed a ceph cluster with ceph version 0.46.
3 OSDs, 3 MONs and 3 MDSs.

After copying a bunch of files to a ceph-fuse mount all MDS daemons
crash and now i cant bring them back online.
I already tried to restart the daemons in different order and also
removed one OSD, nothing really happened only now we have pgs with
active+remapped which i think is normal.
Any hints?


Are all three MDS active? At this point, more than one active MDS is
likely to crash. You can have one active and others standby.

If you've got only one active, what was the backtrace of the crash?
It'll be at the end of the MDS log (by default in /var/log/ceph).
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD Journal Failure Behavior

2012-05-17 Thread Sam Just

Loss of the journal will kill any osds using that journal.  With
btrfs, we theoretically could run without a journal, but it would be
impractically slow.  In the case of parallel journaling, loss of the
journal can result in the loss of updates which have been reported to
the client as committed, but replacing the journal device will be
enough to allow recovery from the replicas to occur.  In the case of
write ahead, loss of the journal can additionally result in
inconsistent data on the osd.  Thus, with write ahead, you should
recreate the osd and allow recovery to occur.

The --mkjournal option is used to recreate the journal on a fresh device.

This information does need to find its way into our documentation,
thanks for the heads up!

-Sam

On Fri, May 11, 2012 at 11:31 AM, Calvin Morrow calvin.mor...@gmail.com wrote:
 The Ceph Wiki (http://ceph.com/wiki/OSD_journal) does a pretty good
 job explaining the purpose of the journal and various modes available.
  What isn't clear is what happens during the failure of a journal.

 With the use of btrfs enabling parallel journaling, it sounds like
 failure of a journal device would still enable OSD writes to occur
 albeit at a slower pace.  In writeahead mode, failure of the journal
 seems like it would take all OSDs using that device for journaling
 offline.

 Is this a correct assessment?  If the journal is a single point of
 failure to a system with potentially multiple OSDs ... it might be
 worth mentioning in one of the building a cluster or hardware pages
 about journal redundancy.

 Also, it would be nice to know what steps are required to replace a
 journal.  (I think this is just an update of the cluster.conf with the
 new journal device and a restart of the OSD process, correct?)

 Best regards,
 Calvin
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: reproductible kernel oops with kernel 3.2 inside kvm

2012-05-17 Thread Josh Durgin


Hi Yann,

Sorry for the late response.

On 05/03/2012 07:05 AM, Yann Dupont wrote:

Hello. I'm stress testing ceph since some time now, with quite good
results. I really like ceph and will probably use in in some
pre-production services.

Anyway I've seen some bugs.

One of them is instability if the kernel is running inside KVM, leading
to a very fast (and reproductible) kernel oops. On bare metal this
particular oops doesn't happen.

The kernel oops itself involve ceph, but it could be a real bug in kvm too.

The host machine is runnning 3.2.2
kvm is quite ancien (0.14)
guest OS is ubuntu 12.04 with his standard kernel. Retried with custom
3.2 kernel with the same problem.


I'm not sure how many people are using the kernel client within kvm,
but I haven't seen this problem before. Since it's in d_prune, it's
probably Ceph related, but perhaps kvm makes a race condition trigger
more often in your environment.

I filed http://tracker.newdream.net/issues/2444 to track this.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph kernel client - kernel craches

2012-05-17 Thread Josh Durgin


Sorry your mail fell through the cracks before. I filed
http://tracker.newdream.net/issues/2445 to track the ceph-related
crashes. Alex, do you think the first crash is related to ceph at all?

Josh

On 05/10/2012 11:00 AM, Giorgos Kappes wrote:

Sorry for my late response. I reproduced the above bug with the Linux
kernel 3.3.4 and without using XEN:

uname -a
Linux node33 3.3.4 #1 SMP Wed May 9 13:00:07 EEST 2012 x86_64 GNU/Linux

The trace is shown below:


[  763.984023] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[  763.984177] BUG: unable to handle kernel paging request at 880037bd0800
[  763.984402] IP: [880037bd0800] 0x880037bd07ff
[  763.984568] PGD 1806063 PUD 180a063 PMD 800037a001e3
[  763.984845] Oops: 0011 [#1] SMP
[  763.985058] CPU 3
[  763.985124] Modules linked in: cbc netconsole loop snd_pcm
snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac
tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys
button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod
cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core
usbcore usb_common tg3 libphy mptsas mptscsih mptbase
scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan]
[  763.988002]
[  763.988002] Pid: 0, comm: swapper/3 Not tainted 3.3.4 #1 HP ProLiant DL160 G5
[  763.988002] RIP: 0010:[880037bd0800]  [880037bd0800]
0x880037bd07ff
[  763.988002] RSP: 0018:8800bfcc3e78  EFLAGS: 00010292
[  763.988002] RAX: 8800b97745b0 RBX: 8800bfcce770 RCX: 880037bd0800
[  763.988002] RDX: 880037bd1600 RSI: b9b6a040 RDI: 880037bd1600
[  763.988002] RBP: 81820080 R08: 8800b9dd0b00 R09: 00018020001c
[  763.988002] R10: 8020001c R11: 816075c0 R12: 8800bfcce7a0
[  763.988002] R13: 8800b97745b0 R14: 0003 R15: 000a
[  763.988002] FS:  () GS:8800bfcc()
knlGS:
[  763.988002] CS:  0010 DS:  ES:  CR0: 8005003b
[  763.988002] CR2: 880037bd0800 CR3: b895b000 CR4: 06e0
[  763.988002] DR0:  DR1:  DR2: 
[  763.988002] DR3:  DR6: 0ff0 DR7: 0400
[  763.988002] Process swapper/3 (pid: 0, threadinfo 8800bbae,
task 8800bbad8000)
[  763.988002] Stack:
[  763.988002]  8109b44d 8800bbacd820 8800b97745b0
8800bbae0010
[  763.988002]  8800bbad8000 8800bfcc3ea0 0048
8800bbae1fd8
[  763.988002]  0100 0001 0009
8800bbae1fd8
[  763.988002] Call Trace:
[  763.988002]IRQ
[  763.988002]  [8109b44d] ? __rcu_process_callbacks+0x1e9/0x335
[  763.988002]  [8109b8fb] ? rcu_process_callbacks+0x2c/0x56
[  763.988002]  [8103e3b1] ? __do_softirq+0xc4/0x1a0
[  763.988002]  [8102515b] ? lapic_next_event+0x18/0x1d
[  763.988002]  [815d3b1c] ? call_softirq+0x1c/0x30
[  763.988002]  [8100fba3] ? do_softirq+0x3f/0x79
[  763.988002]  [8103e186] ? irq_exit+0x44/0xb1
[  763.988002]  [81025c61] ? smp_apic_timer_interrupt+0x85/0x93
[  763.988002]  [815d311e] ? apic_timer_interrupt+0x6e/0x80
[  763.988002]EOI
[  763.988002]  [810145e1] ? native_sched_clock+0x28/0x33
[  763.988002]  [810152f6] ? mwait_idle+0x8c/0xbc
[  763.988002]  [810152ae] ? mwait_idle+0x44/0xbc
[  763.988002]  [8100de94] ? cpu_idle+0xb9/0xf7
[  763.988002]  [815c43c6] ? start_secondary+0x270/0x275
[  763.988002] Code: 00 00 00 00 04 8a b8 00 88 ff ff 00 04 8a b8 00
88 ff ff 00 03 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00   16 bd 37 00 88 ff ff 40 ab cd bf 00 88 ff ff 20 15 42
b9 00
[  763.988002] RIP  [880037bd0800] 0x880037bd07ff
[  763.988002]  RSP8800bfcc3e78
[  763.988002] CR2: 880037bd0800
[  763.988002] ---[ end trace 614049dc850267ac ]---
[  763.988002] Kernel panic - not syncing: Fatal exception in interrupt
[  763.997833] [ cut here ]
[  763.997936] WARNING: at arch/x86/kernel/smp.c:120
update_process_times+0x57/0x63()
[  763.998072] Hardware name: ProLiant DL160 G5
[  763.998171] Modules linked in: cbc netconsole loop snd_pcm
snd_timer snd soundcore snd_page_alloc processor tpm_tis i5400_edac
tpm edac_core tpm_bios evdev pcspkr i5k_amb rng_core thermal_sys
button shpchp pci_hotplug sd_mod crc_t10dif usbhid hid ide_cd_mod
cdrom ata_generic uhci_hcd ehci_hcd ata_piix libata piix ide_core
usbcore usb_common tg3 libphy mptsas mptscsih mptbase
scsi_transport_sas scsi_mod [last unloaded: scsi_wait_scan]
[  764.001205] Pid: 0, comm: swapper/3 Tainted: G  D  3.3.4 #1
[  764.001311] Call Trace:
[  764.001404]IRQ[81038bb0] ? warn_slowpath_common+0x78/0x8c
[  764.001573]  [81044937] ? update_process_times+0x57/0x63
[

Re: Ceph on btrfs 3.4rc

Journal too small

[PATCH] libceph: avoid unregistering osd request when not registered

[PATCH 00/16] ceph: messenger cleanups and fixes

[PATCH 01/16] libceph: don't reset kvec in prepare_write_banner()

[PATCH 02/16] ceph: messenger: reset connection kvec caller

[PATCH 03/16] ceph: messenger: send banner in process_connect()

[PATCH 04/16] ceph: drop msgr argument from prepare_write_connect()

[PATCH 05/16] ceph: don't set WRITE_PENDING too early

[PATCH 08/16] ceph: messenger: check return from get_authorizer

[PATCH 09/16] ceph: define ceph_auth_handshake type

[PATCH 11/16] ceph: ensure auth ops are defined before use

[PATCH 12/16] ceph: have get_authorizer methods return pointers

[PATCH 13/16] ceph: use info returned by get_authorizer

[PATCH 15/16] ceph: rename prepare_connect_authorizer()

[PATCH 16/16] ceph: add auth buf in prepare_write_connect()

[PATCH 14/16] ceph: return pointer from prepare_connect_authorizer()

Re: Ceph on btrfs 3.4rc

Re: Journal too small

Re: global_init_daemonize: BUG: there are 1 child threads already started that will now die!

Re: pushed it to a crash

Re: Journal too small

Re: global_init_daemonize: BUG: there are 1 child threads already started that will now die!

Re: Journal too small

Re: Journal too small

Re: Ceph on btrfs 3.4rc

Re: Ceph on btrfs 3.4rc

Re:

Ceph support for Direct I/O

Re: Designing a cluster guide

Re: Ceph support for Direct I/O

Re: MDS crash, wont startup again

Re: OSD Journal Failure Behavior

Re: reproductible kernel oops with kernel 3.2 inside kvm

Re: Ceph kernel client - kernel craches

35 matches

Site Navigation

Mail list logo

Footer information