Re: how can I achieve HA with ceph?

2012-01-05 Thread Karoly Horvath
Hi,

back from holiday.

I did a successful power unplug test now, but the FS was unavailable
for 16 minutes which is clearly wrong...

I have the log files but the MDS log is 1.2 gigabyte, if you let me
know which lines to filter / filter out I will  upload it somewhere...

-- 
Karoly Horvath


On Fri, Dec 23, 2011 at 12:00 AM, Gregory Farnum
gregory.far...@dreamhost.com wrote:
 On Wed, Dec 21, 2011 at 8:43 AM, Karoly Horvath rhsw...@gmail.com wrote:
 On Wed, Dec 21, 2011 at 4:13 PM, Gregory Farnum
 By client I assume you mean the kernel driver.. the FS is freezed, so
 I cannot unmount (cannot even `shutdown`).. how can I force the client
 to reconnect?

 Try a lazy force unmount:
 umount -lf ceph_mnt_point/
 And then mount again.

 wow, never heard about this, thanks.:)
 will report with the next mail

 In the meantime I did one test, killing mds+osd+mon on beta,
 it's jammed in '{0=alpha=up:replay}', after 45 minutes I shut it down...
 I attached the logs.

 Oh, this is very odd! The MDS goes to sleep while it waits for an
 up-to-date OSDMap, but it never seems to get woken up even though I
 see the message sending in the OSDMap.

 So let's try this one more time, but this time also add in debug
 objecter = 20 to the MDS config...Those logs will include everything
 I need, or nothing will, promise! :)
 -Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: questions about radosgw

2012-01-05 Thread Wido den Hollander

Hi,

On 01/05/2012 12:48 PM, huang jun wrote:

hi,all
I'am using the s3+radosgw, there are some questions confused me very much:
first) An object's size is 400MB, then the  whole object will stored
in OSDs as one big single object, but not striped into 4MB objects. so
how can we got workload banlance if we want to store big objects and
small ones?


That is correct, RADOS nor the RADOS gateway stripes objects.

You will get workload balance if you have a (almost) even spread of 
workload over your different objects.


The dev's might shed some more light on this.


second) Can we change the number of pgs in a pool that created by S3
clients or just using rados commands?


As far as I know you can't do this through the S3 clients but only with 
the rados commands. I also would find it rather weird if you could do 
this as a client.


Wido



thanks



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


ceph: ensure prealloc_blob is in place when removing xattr

2012-01-05 Thread Alex Elder
In __ceph_build_xattrs_blob(), if a ceph inode's extended attributes
are marked dirty, all attributes recorded in its rb_tree index are
formatted into a blob buffer.  The target buffer is recorded in
ceph_inode-i_xattrs.prealloc_blob, and it is expected to exist and
be of sufficient size to hold the attributes.

The extended attributes are marked dirty in two cases: when a new
attribute is added to the inode; or when one is removed.  In the
former case work is done to ensure the prealloc_blob buffer is
properly set up, but in the latter it is not.

Change the logic in ceph_removexattr() so it matches what is
done in ceph_setxattr().  Note that this is done in a way that
keeps the two blocks of code nearly identical, in anticipation
of a subsequent patch that encapsulates some of this logic into
one or more helper routines.

Signed-off-by: Alex Elder el...@dreamhost.com

---
 fs/ceph/xattr.c |   22 ++
 1 file changed, 22 insertions(+)

Index: b/fs/ceph/xattr.c
===
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -818,6 +818,7 @@ int ceph_removexattr(struct dentry *dent
struct ceph_vxattr_cb *vxattrs = ceph_inode_vxattrs(inode);
int issued;
int err;
+   int required_blob_size;
int dirty;
 
if (ceph_snap(inode) != CEPH_NOSNAP)
@@ -833,14 +834,34 @@ int ceph_removexattr(struct dentry *dent
return -EOPNOTSUPP;
}
 
+   err = -ENOMEM;
spin_lock(ci-i_ceph_lock);
__build_xattrs(inode);
+retry:
issued = __ceph_caps_issued(ci, NULL);
dout(removexattr %p issued %s\n, inode, ceph_cap_string(issued));
 
if (!(issued  CEPH_CAP_XATTR_EXCL))
goto do_sync;
 
+   required_blob_size = __get_required_blob_size(ci, 0, 0);
+
+   if (!ci-i_xattrs.prealloc_blob ||
+   required_blob_size  ci-i_xattrs.prealloc_blob-alloc_len) {
+   struct ceph_buffer *blob;
+
+   spin_unlock(ci-i_ceph_lock);
+   dout( preaallocating new blob size=%d\n, required_blob_size);
+   blob = ceph_buffer_new(required_blob_size, GFP_NOFS);
+   if (!blob)
+   goto out;
+   spin_lock(ci-i_ceph_lock);
+   if (ci-i_xattrs.prealloc_blob)
+   ceph_buffer_put(ci-i_xattrs.prealloc_blob);
+   ci-i_xattrs.prealloc_blob = blob;
+   goto retry;
+   }
+
err = __remove_xattr_by_name(ceph_inode(inode), name);
dirty = __ceph_mark_dirty_caps(ci, CEPH_CAP_XATTR_EXCL);
ci-i_xattrs.dirty = true;
@@ -853,6 +874,7 @@ int ceph_removexattr(struct dentry *dent
 do_sync:
spin_unlock(ci-i_ceph_lock);
err = ceph_send_removexattr(dentry, name);
+out:
return err;
 }
 


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.

2012-01-05 Thread Ian Campbell
This requires adding a new argument to various sendpage hooks up and down the
stack. At the moment this parameter is always NULL.

Signed-off-by: Ian Campbell ian.campb...@citrix.com
Cc: David S. Miller da...@davemloft.net
Cc: Alexey Kuznetsov kuz...@ms2.inr.ac.ru
Cc: Pekka Savola (ipv6) pek...@netcore.fi
Cc: James Morris jmor...@namei.org
Cc: Hideaki YOSHIFUJI yoshf...@linux-ipv6.org
Cc: Patrick McHardy ka...@trash.net
Cc: Trond Myklebust trond.mykleb...@netapp.com
Cc: Greg Kroah-Hartman gre...@suse.de
Cc: drbd-u...@lists.linbit.com
Cc: de...@driverdev.osuosl.org
Cc: cluster-de...@redhat.com
Cc: ocfs2-de...@oss.oracle.com
Cc: net...@vger.kernel.org
Cc: ceph-devel@vger.kernel.org
Cc: rds-de...@oss.oracle.com
Cc: linux-...@vger.kernel.org
---
 drivers/block/drbd/drbd_main.c   |1 +
 drivers/scsi/iscsi_tcp.c |4 ++--
 drivers/scsi/iscsi_tcp.h |3 ++-
 drivers/staging/pohmelfs/trans.c |3 ++-
 drivers/target/iscsi/iscsi_target_util.c |3 ++-
 fs/dlm/lowcomms.c|4 ++--
 fs/ocfs2/cluster/tcp.c   |1 +
 include/linux/net.h  |6 +-
 include/net/inet_common.h|4 +++-
 include/net/ip.h |4 +++-
 include/net/sock.h   |8 +---
 include/net/tcp.h|4 +++-
 net/ceph/messenger.c |2 +-
 net/core/sock.c  |6 +-
 net/ipv4/af_inet.c   |9 ++---
 net/ipv4/ip_output.c |6 --
 net/ipv4/tcp.c   |   24 
 net/ipv4/udp.c   |   11 ++-
 net/ipv4/udp_impl.h  |5 +++--
 net/rds/tcp_send.c   |1 +
 net/socket.c |   11 +++
 net/sunrpc/svcsock.c |6 +++---
 net/sunrpc/xprtsock.c|2 +-
 23 files changed, 84 insertions(+), 44 deletions(-)

diff --git a/drivers/block/drbd/drbd_main.c b/drivers/block/drbd/drbd_main.c
index 0358e55..49c7346 100644
--- a/drivers/block/drbd/drbd_main.c
+++ b/drivers/block/drbd/drbd_main.c
@@ -2584,6 +2584,7 @@ static int _drbd_send_page(struct drbd_conf *mdev, struct 
page *page,
set_fs(KERNEL_DS);
do {
sent = mdev-data.socket-ops-sendpage(mdev-data.socket, page,
+   NULL,
offset, len,
msg_flags);
if (sent == -EAGAIN) {
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 7c34d8e..3884ae1 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -284,8 +284,8 @@ static int iscsi_sw_tcp_xmit_segment(struct iscsi_tcp_conn 
*tcp_conn,
if (!segment-data) {
sg = segment-sg;
offset += segment-sg_offset + sg-offset;
-   r = tcp_sw_conn-sendpage(sk, sg_page(sg), offset,
- copy, flags);
+   r = tcp_sw_conn-sendpage(sk, sg_page(sg), NULL,
+ offset, copy, flags);
} else {
struct msghdr msg = { .msg_flags = flags };
struct kvec iov = {
diff --git a/drivers/scsi/iscsi_tcp.h b/drivers/scsi/iscsi_tcp.h
index 666fe09..1e23265 100644
--- a/drivers/scsi/iscsi_tcp.h
+++ b/drivers/scsi/iscsi_tcp.h
@@ -52,7 +52,8 @@ struct iscsi_sw_tcp_conn {
uint32_tsendpage_failures_cnt;
uint32_tdiscontiguous_hdr_cnt;
 
-   ssize_t (*sendpage)(struct socket *, struct page *, int, size_t, int);
+   ssize_t (*sendpage)(struct socket *, struct page *,
+   struct skb_frag_destructor *, int, size_t, int);
 };
 
 struct iscsi_sw_tcp_host {
diff --git a/drivers/staging/pohmelfs/trans.c b/drivers/staging/pohmelfs/trans.c
index 06c1a74..96a7921 100644
--- a/drivers/staging/pohmelfs/trans.c
+++ b/drivers/staging/pohmelfs/trans.c
@@ -104,7 +104,8 @@ static int netfs_trans_send_pages(struct netfs_trans *t, 
struct netfs_state *st)
msg.msg_flags = MSG_WAITALL | (attached_pages == 1 ? 0 :
MSG_MORE);
 
-   err = kernel_sendpage(st-socket, page, 0, size, msg.msg_flags);
+   err = kernel_sendpage(st-socket, page, NULL,
+ 0, size, msg.msg_flags);
if (err = 0) {
printk(%s: %d/%d failed to send transaction page: t: 
%p, gen: %u, size: %u, err: %d.\n,
__func__, i, t-page_num, t, t-gen, 
size, err);
diff --git a/drivers/target/iscsi/iscsi_target_util.c 

Re: how can I achieve HA with ceph?

2012-01-05 Thread Gregory Farnum
On Thu, Jan 5, 2012 at 5:24 AM, Karoly Horvath rhsw...@gmail.com wrote:
 Hi,

 back from holiday.

 I did a successful power unplug test now, but the FS was unavailable
 for 16 minutes which is clearly wrong...

 I have the log files but the MDS log is 1.2 gigabyte, if you let me
 know which lines to filter / filter out I will  upload it somewhere...

 --
 Karoly Horvath

Assuming it's the same error as last time, the log will have a line
that contains waiting for osdmap n (which blacklists prior
instance), where n is an epoch number.

Then at some later point there will be a line that looks something
like the following:
2011-12-21 13:45:17.594746 7f4885307700 -- xxx.xxx.xxx.31:6800/4438
== mon.2 xxx.xxx.xxx.35:6789/0 9  osd_map(y..z src has 1..495) v2
 748+0+0 (656995691 0 0) 0x1637400 con 0x163c000
Where y and z are an interval which contains n. (In the previous log,
and probably here too, y=z=n.) I'm going to be interested in those two
lines and the stuff following when the osdmap arrives. Probably I will
only care about objecter lines, but it might be all of them...try
trimming off the minute following that osdmap line; it'll probably
contain more than I care about. :)
-Greg


 On Fri, Dec 23, 2011 at 12:00 AM, Gregory Farnum
 gregory.far...@dreamhost.com wrote:
 On Wed, Dec 21, 2011 at 8:43 AM, Karoly Horvath rhsw...@gmail.com wrote:
 On Wed, Dec 21, 2011 at 4:13 PM, Gregory Farnum
 By client I assume you mean the kernel driver.. the FS is freezed, so
 I cannot unmount (cannot even `shutdown`).. how can I force the client
 to reconnect?

 Try a lazy force unmount:
 umount -lf ceph_mnt_point/
 And then mount again.

 wow, never heard about this, thanks.:)
 will report with the next mail

 In the meantime I did one test, killing mds+osd+mon on beta,
 it's jammed in '{0=alpha=up:replay}', after 45 minutes I shut it down...
 I attached the logs.

 Oh, this is very odd! The MDS goes to sleep while it waits for an
 up-to-date OSDMap, but it never seems to get woken up even though I
 see the message sending in the OSDMap.

 So let's try this one more time, but this time also add in debug
 objecter = 20 to the MDS config...Those logs will include everything
 I need, or nothing will, promise! :)
 -Greg
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.

2012-01-05 Thread David Miller
From: Ian Campbell ian.campb...@citrix.com
Date: Thu, 5 Jan 2012 17:13:43 +

 -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int 
 poffset,
 +static ssize_t do_tcp_sendpages(struct sock *sk,
 + struct page **pages,
 + struct skb_frag_destructor **destructors,
 + int poffset,
size_t psize, int flags)
  {
   struct tcp_sock *tp = tcp_sk(sk);

An array of destructors is madness, and the one call site that specifies this
passes an address of a single entry.

This also would never even have to occur if you put the destructor inside of
struct page instead.

Finally, except for the skb_shared_info() layout optimization in patch #1 which
I alreayd applied, this stuff is not baked enough for the 3.3 merge window.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] net: add paged frag destructor support to kernel_sendpage.

2012-01-05 Thread Ian Campbell
On Thu, 2012-01-05 at 19:15 +, David Miller wrote:
 From: Ian Campbell ian.campb...@citrix.com
 Date: Thu, 5 Jan 2012 17:13:43 +
 
  -static ssize_t do_tcp_sendpages(struct sock *sk, struct page **pages, int 
  poffset,
  +static ssize_t do_tcp_sendpages(struct sock *sk,
  +   struct page **pages,
  +   struct skb_frag_destructor **destructors,
  +   int poffset,
   size_t psize, int flags)
   {
  struct tcp_sock *tp = tcp_sk(sk);
 
 An array of destructors is madness, and the one call site that specifies this
 passes an address of a single entry.

I figured it was easy enough to accommodate the multiple destructor case
but you are right that is overkill given the current (and realistically,
expected) usage, I'll change that for the next round.

(that's assuming we don't end up with some scheme where the struct page
* is in the destructor struct like I was investigating previously to
alleviate the frag size overhead. I guess this illustrates nicely why
that approach got ugly: these array propagate all the way up the call
chain if you do that)

 This also would never even have to occur if you put the destructor inside of
 struct page instead.
 
 Finally, except for the skb_shared_info() layout optimization in patch #1 
 which
 I alreayd applied, this stuff is not baked enough for the 3.3 merge window.

Sure thing, I should have made it clear in my intro mail that I was
aiming for 3.4.

Thanks,
Ian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/2] Add resource agents to debian build, trivial CP error

2012-01-05 Thread Florian Haas
Hi,

please consider two follow-up patches to the OCF resource agents: the
first adds them to the Debian build, as a separate package
ceph-resource-agents that depends on resource-agents, the second
fixes a trivial (and embarassing, however harmless) cut and paste
error. Thanks!

Cheers,
Florian

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] debian: build ceph-resource-agents

2012-01-05 Thread Florian Haas
---
 debian/ceph-resource-agents.install |1 +
 debian/control  |   13 +
 debian/rules|2 ++
 3 files changed, 16 insertions(+), 0 deletions(-)
 create mode 100644 debian/ceph-resource-agents.install

diff --git a/debian/ceph-resource-agents.install 
b/debian/ceph-resource-agents.install
new file mode 100644
index 000..30843f6
--- /dev/null
+++ b/debian/ceph-resource-agents.install
@@ -0,0 +1 @@
+usr/lib/ocf/resource.d/ceph/*
diff --git a/debian/control b/debian/control
index e8c4d30..0f57ad3 100644
--- a/debian/control
+++ b/debian/control
@@ -112,6 +112,19 @@ Description: debugging symbols for ceph-common
  .
  This package contains the debugging symbols for ceph-common.
 
+Package: ceph-resource-agents
+Architecture: linux-any
+Recommends: pacemaker
+Priority: extra
+Depends: ceph (= ${binary:Version}), ${misc:Depends}, resource-agents
+Description: OCF-compliant resource agents for Ceph
+ Ceph is a distributed storage and network file system designed to provide
+ excellent performance, reliability, and scalability.
+ .
+ This package contains the resource agents (RAs) which integrate
+ Ceph with OCF-compliant cluster resource managers,
+ such as Pacemaker.
+
 Package: librados2
 Conflicts: librados, librados1
 Replaces: librados, librados1
diff --git a/debian/rules b/debian/rules
index 4f3fe62..0bc594a 100755
--- a/debian/rules
+++ b/debian/rules
@@ -20,6 +20,8 @@ endif
 
 export DEB_HOST_ARCH  ?= $(shell dpkg-architecture -qDEB_HOST_ARCH)
 
+extraopts += --with-ocf
+
 ifeq ($(DEB_HOST_ARCH), armel)
   # armel supports ARMv4t or above instructions sets.
   # libatomic-ops is only usable with Ceph for ARMv6 or above.
-- 
1.7.5.4

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html