Re: debian updates

2012-06-11 Thread Stefan Priebe - Profihost AG

Hi Sage,

i'm using debian but with a custom build. Should i use the debian branch 
to build or the stable branch?


Thanks,
Stefan

Am 12.06.2012 04:41, schrieb Sage Weil:

Hi Laszlo,

Can you take a look at the last 4 commits of

https://github.com/ceph/ceph/commits/debian

and let me know if they address the issues you mentioned?

Thanks-
sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] libceph: start tracking connection socket state

2012-06-11 Thread Yan, Zheng
On 06/12/2012 01:00 PM, Sage Weil wrote:
> Yep.  This was just fixed yesterday, in the testing-next branch, by 
> 'libceph: transition socket state prior to actual connect'.
> 
> Are you still hitting the bio null deref?
> 
No,

Cheers
Yan, Zheng


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 09/13] libceph: start tracking connection socket state

2012-06-11 Thread Sage Weil
On Tue, 12 Jun 2012, Yan, Zheng wrote:
> On Thu, May 31, 2012 at 3:35 AM, Alex Elder  wrote:
> > Start explicitly keeping track of the state of a ceph connection's
> > socket, separate from the state of the connection itself.  Create
> > placeholder functions to encapsulate the state transitions.
> >
> >    
> >    | NEW* |  transient initial state
> >    
> >        | con_sock_state_init()
> >        v
> >    --
> >    | CLOSED |  initialized, but no socket (and no
> >    --  TCP connection)
> >     ^      \
> >     |       \ con_sock_state_connecting()
> >     |        --
> >     |                              \
> >     + con_sock_state_closed()       \
> >     |\                               \
> >     | \                               \
> >     |  ---                     \
> >     |  | CLOSING |  socket event;       \
> >     |  ---  await close          \
> >     |       ^                            |
> >     |       |                            |
> >     |       + con_sock_state_closing()   |
> >     |      / \                           |
> >     |     /   ---            |
> >     |    /                   \           v
> >     |   /                    --
> >     |  /    -| CONNECTING |  socket created, TCP
> >     |  |   /                 --  connect initiated
> >     |  |   | con_sock_state_connected()
> >     |  |   v
> >    -
> >    | CONNECTED |  TCP connection established
> >    -
> >
> > Make the socket state an atomic variable, reinforcing that it's a
> > distinct transtion with no possible "intermediate/both" states.
> > This is almost certainly overkill at this point, though the
> > transitions into CONNECTED and CLOSING state do get called via
> > socket callback (the rest of the transitions occur with the
> > connection mutex held).  We can back out the atomicity later.
> >
> > Signed-off-by: Alex Elder 
> > ---
> >  include/linux/ceph/messenger.h |    8 -
> >  net/ceph/messenger.c           |   63
> > 
> >  2 files changed, 69 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> > index 920235e..5e852f4 100644
> > --- a/include/linux/ceph/messenger.h
> > +++ b/include/linux/ceph/messenger.h
> > @@ -137,14 +137,18 @@ struct ceph_connection {
> >        const struct ceph_connection_operations *ops;
> >
> >        struct ceph_messenger *msgr;
> > +
> > +       atomic_t sock_state;
> >        struct socket *sock;
> > +       struct ceph_entity_addr peer_addr; /* peer address */
> > +       struct ceph_entity_addr peer_addr_for_me;
> > +
> >        unsigned long flags;
> >        unsigned long state;
> >        const char *error_msg;  /* error message, if any */
> >
> > -       struct ceph_entity_addr peer_addr; /* peer address */
> >        struct ceph_entity_name peer_name; /* peer name */
> > -       struct ceph_entity_addr peer_addr_for_me;
> > +
> >        unsigned peer_features;
> >        u32 connect_seq;      /* identify the most recent connection
> >                                 attempt for this connection, client */
> > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> > index 29055df..7e11b07 100644
> > --- a/net/ceph/messenger.c
> > +++ b/net/ceph/messenger.c
> > @@ -29,6 +29,14 @@
> >  * the sender.
> >  */
> >
> > +/* State values for ceph_connection->sock_state; NEW is assumed to be 0 */
> > +
> > +#define CON_SOCK_STATE_NEW             0       /* -> CLOSED */
> > +#define CON_SOCK_STATE_CLOSED          1       /* -> CONNECTING */
> > +#define CON_SOCK_STATE_CONNECTING      2       /* -> CONNECTED or ->
> > CLOSING */
> > +#define CON_SOCK_STATE_CONNECTED       3       /* -> CLOSING or -> CLOSED
> > */
> > +#define CON_SOCK_STATE_CLOSING         4       /* -> CLOSED */
> > +
> >  /* static tag bytes (protocol control messages) */
> >  static char tag_msg = CEPH_MSGR_TAG_MSG;
> >  static char tag_ack = CEPH_MSGR_TAG_ACK;
> > @@ -147,6 +155,54 @@ void ceph_msgr_flush(void)
> >  }
> >  EXPORT_SYMBOL(ceph_msgr_flush);
> >
> > +/* Connection socket state transition functions */
> > +
> > +static void con_sock_state_init(struct ceph_connection *con)
> > +{
> > +       int old_state;
> > +
> > +       old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CLOSED);
> > +       if (WARN_ON(old_state != CON_SOCK_STATE_NEW))
> > +               printk("%s: unexpected old state %d\n", __func__,
> > old_state);
> > +}
> > +
> > +static void con_sock_state_connecting(struct ceph_connection *con)
> > +{
> > +       int old_state;
> > +
> > +       old_state = atomic_xchg(&con->sock_state,
> > CON_SOCK_STATE_CONNECTING);
> > +       if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED))
> > +               printk("%s: unexpected old state %d\n", __func__,
> > old_state);
> > +}
> > +
> > +static void con_sock_state_connected(struct ceph_co

Re: [PATCH 09/13] libceph: start tracking connection socket state

2012-06-11 Thread Yan, Zheng
On Thu, May 31, 2012 at 3:35 AM, Alex Elder  wrote:
> Start explicitly keeping track of the state of a ceph connection's
> socket, separate from the state of the connection itself.  Create
> placeholder functions to encapsulate the state transitions.
>
>    
>    | NEW* |  transient initial state
>    
>        | con_sock_state_init()
>        v
>    --
>    | CLOSED |  initialized, but no socket (and no
>    --  TCP connection)
>     ^      \
>     |       \ con_sock_state_connecting()
>     |        --
>     |                              \
>     + con_sock_state_closed()       \
>     |\                               \
>     | \                               \
>     |  ---                     \
>     |  | CLOSING |  socket event;       \
>     |  ---  await close          \
>     |       ^                            |
>     |       |                            |
>     |       + con_sock_state_closing()   |
>     |      / \                           |
>     |     /   ---            |
>     |    /                   \           v
>     |   /                    --
>     |  /    -| CONNECTING |  socket created, TCP
>     |  |   /                 --  connect initiated
>     |  |   | con_sock_state_connected()
>     |  |   v
>    -
>    | CONNECTED |  TCP connection established
>    -
>
> Make the socket state an atomic variable, reinforcing that it's a
> distinct transtion with no possible "intermediate/both" states.
> This is almost certainly overkill at this point, though the
> transitions into CONNECTED and CLOSING state do get called via
> socket callback (the rest of the transitions occur with the
> connection mutex held).  We can back out the atomicity later.
>
> Signed-off-by: Alex Elder 
> ---
>  include/linux/ceph/messenger.h |    8 -
>  net/ceph/messenger.c           |   63
> 
>  2 files changed, 69 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 920235e..5e852f4 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -137,14 +137,18 @@ struct ceph_connection {
>        const struct ceph_connection_operations *ops;
>
>        struct ceph_messenger *msgr;
> +
> +       atomic_t sock_state;
>        struct socket *sock;
> +       struct ceph_entity_addr peer_addr; /* peer address */
> +       struct ceph_entity_addr peer_addr_for_me;
> +
>        unsigned long flags;
>        unsigned long state;
>        const char *error_msg;  /* error message, if any */
>
> -       struct ceph_entity_addr peer_addr; /* peer address */
>        struct ceph_entity_name peer_name; /* peer name */
> -       struct ceph_entity_addr peer_addr_for_me;
> +
>        unsigned peer_features;
>        u32 connect_seq;      /* identify the most recent connection
>                                 attempt for this connection, client */
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 29055df..7e11b07 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -29,6 +29,14 @@
>  * the sender.
>  */
>
> +/* State values for ceph_connection->sock_state; NEW is assumed to be 0 */
> +
> +#define CON_SOCK_STATE_NEW             0       /* -> CLOSED */
> +#define CON_SOCK_STATE_CLOSED          1       /* -> CONNECTING */
> +#define CON_SOCK_STATE_CONNECTING      2       /* -> CONNECTED or ->
> CLOSING */
> +#define CON_SOCK_STATE_CONNECTED       3       /* -> CLOSING or -> CLOSED
> */
> +#define CON_SOCK_STATE_CLOSING         4       /* -> CLOSED */
> +
>  /* static tag bytes (protocol control messages) */
>  static char tag_msg = CEPH_MSGR_TAG_MSG;
>  static char tag_ack = CEPH_MSGR_TAG_ACK;
> @@ -147,6 +155,54 @@ void ceph_msgr_flush(void)
>  }
>  EXPORT_SYMBOL(ceph_msgr_flush);
>
> +/* Connection socket state transition functions */
> +
> +static void con_sock_state_init(struct ceph_connection *con)
> +{
> +       int old_state;
> +
> +       old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CLOSED);
> +       if (WARN_ON(old_state != CON_SOCK_STATE_NEW))
> +               printk("%s: unexpected old state %d\n", __func__,
> old_state);
> +}
> +
> +static void con_sock_state_connecting(struct ceph_connection *con)
> +{
> +       int old_state;
> +
> +       old_state = atomic_xchg(&con->sock_state,
> CON_SOCK_STATE_CONNECTING);
> +       if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED))
> +               printk("%s: unexpected old state %d\n", __func__,
> old_state);
> +}
> +
> +static void con_sock_state_connected(struct ceph_connection *con)
> +{
> +       int old_state;
> +
> +       old_state = atomic_xchg(&con->sock_state, CON_SOCK_STATE_CONNECTED);
> +       if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING))
> +               printk("%s: unexpected old state %d\n", __func__,
> old_state);
> +}
> +
> +static

Re: debian updates

2012-06-11 Thread Laszlo Boszormenyi (GCS)
Hi Sage,

On Mon, 2012-06-11 at 19:41 -0700, Sage Weil wrote:
> Can you take a look at the last 4 commits of
>   https://github.com/ceph/ceph/commits/debian
> and let me know if they address the issues you mentioned?
 Yes, they fixes those issues I've mentioned. However you could keep
ceph-kdump-copy for Ubuntu. It's not installable on Debian through, as
its dependency linux-crashdump is not available. I can package it if
you want. Now my fingers are crossed to accept libs3 soon, it's freeze
for Wheezy soon[1]. Ceph 0.47.2 is waiting for that to be in Wheezy.

Laszlo/GCS
[1] http://lists.debian.org/debian-devel-announce/2012/05/msg4.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


debian updates

2012-06-11 Thread Sage Weil
Hi Laszlo,

Can you take a look at the last 4 commits of

https://github.com/ceph/ceph/commits/debian

and let me know if they address the issues you mentioned?

Thanks-
sage


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph packaging in Debian GNU/Linux

2012-06-11 Thread Sage Weil
Hi Laszlo,

On Mon, 11 Jun 2012, Laszlo Boszormenyi (GCS) wrote:
> Hi Loic, Sage,
> 
> On Sun, 2012-06-10 at 14:21 +0200, Loic Dachary wrote:
> > On 06/09/2012 05:01 PM, Laszlo Boszormenyi (GCS) wrote:
> > > On Sat, 2012-06-09 at 15:39 +0200, Loic Dachary wrote:
> > >> Amazingly quick answer ;-) Did you build from sources or using the 
> > >> packages provided by ceph themselves ?
> > >  It's not that short to answer. I'm a DD and thus, I do my packages. Try
> > > not to diverge much from their packages, so sometimes I send patches to
> > > them. But you have to note, we have different objectives. They build for
> > > stable releases, I do it for Sid, targeting Wheezy.
> > Maybe they changed their policy because I used the packages they built for 
> > wheezy successfully. Are there many differences between yours and theirs ?
>  Well, a bit strange that you could build their stable packages on
> Wheezy, but not that impossible. Please note, that they favor Ubuntu, at
> least the ceph-kdump-copy package is a nonsense on Debian.

I can separate that out... we just needed it for our qa infrastructure.  

> About changes, see the attached patches for example. Upstream list
> python build-dependency twice, the first can be deleted and list an
> empty recommends line. It's better to run configure in its own target,
> as seen in the second patch.

Fixed/applied both, thanks!

> One of its build dependency, leveldb is not buildable on several
> platforms. Thus a
> sed -i "s/Architecture: linux-any/Architecture: amd64 armel armhf i386 ia64 
> mipsel/" debian/control
> is needed on their tree.

Fixed.

>  You asked if I need help with packaging. Only test that it's buildable.
> First is libs3 [1], I could test it on amd64 only. The second is ceph
> itself[2]. Could test it on amd64, interested about builds on armel,
> armhf, i386, ia64 or mipsel. Losing users on mips, powerpc, s390, s390x
> and sparc, due to leveldb not buildable on them.

>  Sage, please be honest. How do you choose other projects to depend on?
> It's clear that leveldb doesn't have any stable release, only a git
> tree. Checking the Debian build logs, it's also known that it doesn't
> build on several arches. Its maintainer tries porting it to those, and
> it contains his work that it's buildable on others.

leveldb doesn't have official releases or tidy release management, but the 
library is better than the alternatives, so it's worth the overhead for 
us to make it consumable.  I think that's reflected by the fact that it is 
now packaged, and is being actively used by several other projects 
(notably riak).

> Checking the other external dependency, libs3. It seems to be abandoned
> since 2008, according to Amazon itself[3]. As I can't access its git
> tree[4], I don't know otherwise.

libs3 was a difficult choice.  It is abandoned, but it also appears to be 
the best option for a native C S3 client library.  If someone knows of 
another, we're all ears, but in the meantime libs3 works. 

Also, notably, it is only used for rest-bench, a simple benchmarking tool.  
For the Debian packages, you are free to configure --without-rest-bench.  
It shares a bunch of code with the 'rados bench' command, so it's nice to 
have it together in the ceph source package.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw-admin: mildly confusing man page and usage message

2012-06-11 Thread Florian Haas
On 06/11/12 23:39, Yehuda Sadeh wrote:
>> If one of the Ceph guys could provide a quick comment on this, I can
>> send a patch to the man page RST. Thanks.
>>
> 
> Minimum required to create a user:
> 
> radosgw-admin user create --uid= --display-name=
> 
> The user id is actually a user 'account' name, not necessarily a
> numeric value. The email param is optional.

Thanks. https://github.com/ceph/ceph/pull/13

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: radosgw-admin: mildly confusing man page and usage message

2012-06-11 Thread Yehuda Sadeh
On Mon, Jun 11, 2012 at 2:35 PM, Florian Haas  wrote:
> Hi,
>
> just noticed that radosgw-admin comes with a bit of confusing content in
> its man page and usage message:
>
> EXAMPLES
>       Generate a new user:
>
>       $ radosgw-admin user gen --display-name="johnny rotten"
> --email=joh...@rotten.com
>
> As far as I remember "user gen" is gone, and it's now "user create".
> However:
>
> radosgw-admin user create --display-name="test" --email=test@demo
> user_id was not specified, aborting
>
> ... is followed by a usage message that doesn't mention user_id anywhere
> (the option string is --uid). So conceivably the example could also use
> a mention of --uid.
>
> Also, is there a way to retrieve the "next available" user_id or just
> tell radosgw-admin to use max(user_id)+1?
>
> If one of the Ceph guys could provide a quick comment on this, I can
> send a patch to the man page RST. Thanks.
>

Minimum required to create a user:

radosgw-admin user create --uid= --display-name=

The user id is actually a user 'account' name, not necessarily a
numeric value. The email param is optional.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


radosgw-admin: mildly confusing man page and usage message

2012-06-11 Thread Florian Haas
Hi,

just noticed that radosgw-admin comes with a bit of confusing content in
its man page and usage message:

EXAMPLES
   Generate a new user:

   $ radosgw-admin user gen --display-name="johnny rotten"
--email=joh...@rotten.com

As far as I remember "user gen" is gone, and it's now "user create".
However:

radosgw-admin user create --display-name="test" --email=test@demo
user_id was not specified, aborting

... is followed by a usage message that doesn't mention user_id anywhere
(the option string is --uid). So conceivably the example could also use
a mention of --uid.

Also, is there a way to retrieve the "next available" user_id or just
tell radosgw-admin to use max(user_id)+1?

If one of the Ceph guys could provide a quick comment on this, I can
send a patch to the man page RST. Thanks.

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RBD stale on VM, and RBD cache enable problem

2012-06-11 Thread Sławomir Skowron
I have two questions. My newly created cluster with xfs on all osd,
ubuntu precise, kernel 3.2.0-23-generic. Ceph 0.47.2-1precise

pool 0 'data' rep size 3 crush_ruleset 0 object_hash rjenkins pg_num
64 pgp_num 64 last_change 1228 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 3 crush_ruleset 1 object_hash rjenkins
pg_num 64 pgp_num 64 last_change 1226 owner 0
pool 2 'rbd' rep size 3 crush_ruleset 2 object_hash rjenkins pg_num 64
pgp_num 64 last_change 1232 owner 0
pool 3 '.rgw' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 8
pgp_num 8 last_change 3878 owner 18446744073709551615

1. After i stop all daemons on 1 machine in my 3 node cluster with 3
replicas, rbd image operations on vm, staling. DD on this device in VM
freezing, and after ceph start on this machine everything goes online.
Is there any problem with my config ?? in this situation ceph should
go from another copies with reads, and writes into another osd in
replica chain, yes ??

Another test iozone on device, and it's stop after daemons stop on 1
machine, and after osd up, iozone go forward, how can i tune this to
work without freeze ??

2012-06-11 21:38:49.583133pg v88173: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
2012-06-11 21:38:50.582257pg v88174: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
.
2012-06-11 21:39:49.991893pg v88197: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
2012-06-11 21:39:50.992755pg v88198: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
2012-06-11 21:39:51.993533pg v88199: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
2012-06-11 21:39:52.994397pg v88200: 200 pgs: 60 active+clean, 1
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)

After boot all osd on stoped machine:

2012-06-11 21:40:37.826619   osd e4162: 72 osds: 53 up, 72 in
2012-06-11 21:40:37.825706 mon.0 10.177.66.4:6790/0 348 : [INF] osd.24
10.177.66.6:6800/21597 boot
2012-06-11 21:40:38.825297pg v88202: 200 pgs: 54 active+clean, 7
stale+active+clean, 139 active+degraded; 783 GB data, 1928 GB used,
18111 GB / 20040 GB avail; 78169/254952 degraded (30.660%)
2012-06-11 21:40:38.826517   osd e4163: 72 osds: 54 up, 72 in
2012-06-11 21:40:38.825250 mon.0 10.177.66.4:6790/0 349 : [INF] osd.25
10.177.66.6:6803/21712 boot
2012-06-11 21:40:38.825655 mon.0 10.177.66.4:6790/0 350 : [INF] osd.28
10.177.66.6:6812/26210 boot
2012-06-11 21:40:38.825907 mon.0 10.177.66.4:6790/0 351 : [INF] osd.29
10.177.66.6:6815/26327 boot
2012-06-11 21:40:39.826738pg v88203: 200 pgs: 56 active+clean, 4
stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
2012-06-11 21:40:39.830098   osd e4164: 72 osds: 59 up, 72 in
2012-06-11 21:40:39.826570 mon.0 10.177.66.4:6790/0 352 : [INF] osd.26
10.177.66.6:6806/21835 boot
2012-06-11 21:40:39.826961 mon.0 10.177.66.4:6790/0 353 : [INF] osd.27
10.177.66.6:6809/21953 boot
2012-06-11 21:40:39.828147 mon.0 10.177.66.4:6790/0 354 : [INF] osd.30
10.177.66.6:6818/26511 boot
2012-06-11 21:40:39.828418 mon.0 10.177.66.4:6790/0 355 : [INF] osd.31
10.177.66.6:6821/26583 boot
2012-06-11 21:40:39.828935 mon.0 10.177.66.4:6790/0 356 : [INF] osd.33
10.177.66.6:6827/26859 boot
2012-06-11 21:40:39.829274 mon.0 10.177.66.4:6790/0 357 : [INF] osd.34
10.177.66.6:6830/26979 boot
2012-06-11 21:40:40.827935pg v88204: 200 pgs: 56 active+clean, 4
stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
2012-06-11 21:40:40.830059   osd e4165: 72 osds: 62 up, 72 in
2012-06-11 21:40:40.827798 mon.0 10.177.66.4:6790/0 358 : [INF] osd.32
10.177.66.6:6824/26701 boot
2012-06-11 21:40:40.829043 mon.0 10.177.66.4:6790/0 359 : [INF] osd.35
10.177.66.6:6833/27165 boot
2012-06-11 21:40:40.829316 mon.0 10.177.66.4:6790/0 360 : [INF] osd.36
10.177.66.6:6836/27280 boot
2012-06-11 21:40:40.829602 mon.0 10.177.66.4:6790/0 361 : [INF] osd.37
10.177.66.6:6839/27397 boot
2012-06-11 21:40:41.828776pg v88205: 200 pgs: 56 active+clean, 4
stale+active+clean, 3 peering, 137 active+degraded; 783 GB data, 1928
GB used, 18111 GB / 20040 GB avail; 76921/254952 degraded (30.171%)
2012-06-11 21:40:41.831823   osd e4166: 72 osds: 68 up, 72 in
2012-06-11 21:40:41.828713 mon.0 10.177.66.4:6790/0 362 : [INF] osd.38
10.177.66.6:6842/27513 boot
2012-06-11 21:40:41.82944

Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Josh Durgin

On 06/11/2012 10:07 AM, Guido Winkelmann wrote:

Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:

On Mon, 11 Jun 2012, Guido Winkelmann wrote:

Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:

On Fri, 8 Jun 2012, Oliver Francke wrote:



Are you guys able to reproduce the corruption with 'debug osd = 20' and

'debug ms = 1'?  Ideally we'd like to:
  - reproduce from a fresh vm, with osd logs
  - identify the bad file
  - map that file to a block offset (see

http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)

  - use that to identify the badness in the log

I suspect the cache is just masking the problem because it submits fewer
IOs...


Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
'filestore fiemap = false' under [osd] and started a new VM. That worked
nicely, and the iotester found no corruptions. Then I removed 'filestore
fiemap = false' from the config, restarted all osds and ran the iotester
again. Output is as follows:

testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
Mon Jun 11 17:34:44 CEST 2012
Wrote 100 MiB of data in 1943 milliseconds
Wrote 100 MiB of data in 1858 milliseconds
Wrote 100 MiB of data in 2213 milliseconds
Wrote 100 MiB of data in 3441 milliseconds
Wrote 100 MiB of data in 2705 milliseconds
Wrote 100 MiB of data in 1778 milliseconds
Wrote 100 MiB of data in 1974 milliseconds
Wrote 100 MiB of data in 2780 milliseconds
Wrote 100 MiB of data in 1961 milliseconds
Wrote 100 MiB of data in 2366 milliseconds
Wrote 100 MiB of data in 1886 milliseconds
Wrote 100 MiB of data in 3589 milliseconds
Wrote 100 MiB of data in 1973 milliseconds
Wrote 100 MiB of data in 2506 milliseconds
Wrote 100 MiB of data in 1937 milliseconds
Wrote 100 MiB of data in 3404 milliseconds
Wrote 100 MiB of data in 1990 milliseconds
Wrote 100 MiB of data in 3713 milliseconds
Read 100 MiB of data in 4856 milliseconds
Digest wrong for file
"/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
17:35:34 CEST 2012
testserver-rbd11 iotester # ~/fiemap
/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
#   Logical  Physical Length   Flags
0:   a820 0010 000

I looked into the file in question, and it started with zero-bytes from
the
start until position 0xbff, even though it was supposed to all random
data.

I have included timestamps in the hopes they might make it easier to find
the related entries in the logs.

So what do I do now? The logs are very large and complex, and I don't
understand most of what's in there. I don't even know which OSD served
that
particular block/object.


If you can reproduce it with 'debug filestore = 20' too, that will be
better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
you can attach/post the contents of the object itself (rados -p rbd get
rb.0.1.02a0 /tmp/foo) we can make sure the object has the right
data (and the sparse-read operation that librbd is doing is the culprit).


Um. Maybe... That's the problem with using random data, I can't just look at
it and recognize it. I guess tomorrow I'll slap something together to see if I
can find any 1 Meg-range of data in there that matches the expect checksum.



As for the log:

First, map the offset to an rbd block.  For example, taking the 'Physical'
value of a820 from above:

$ printf "%012x\n" $((0xa820 / (4096*1024) ))
02a0

Then figure out what the object name prefix is:

$ rbd info  | grep prefix
 block_name_prefix: rb.0.1

Then add the block number, 02a0 to that, e.g. rb.0.1.02a0.

Then map that back to an osd with

$ ceph osd map rbd rb.0.1.02a0
osdmap e19 pool 'rbd' (2) object 'rb.0.1.02a0' ->  pg 2.a2e06f65
(2.5) ->  up [0,2] acting [0,2]

You'll see the osd ids listed in brackets after 'active'.  We want the
first one, 0 in my example.  The log from that OSD is what we need.


I'm getting

osdmap e89 pool 'rbd' (2) object 'rb.0.13.02a0' ->  pg 2.aca5eccb
(2.4b) ->  up [1,2] acting [1,2]

from that command, so I guess it's osd.1 then.
Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 gets
it down to 53 MiB, but that's still too large to be sent to a mailing list...


You can attach it to the tracker: http://tracker.newdream.net/issues/2535
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Sage Weil
On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> > On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> > > Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> 
> > > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > > 
> > > > 'debug ms = 1'?  Ideally we'd like to:
> > > >  - reproduce from a fresh vm, with osd logs
> > > >  - identify the bad file
> > > >  - map that file to a block offset (see
> > > >  
> > > >http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > > >  
> > > >  - use that to identify the badness in the log
> > > > 
> > > > I suspect the cache is just masking the problem because it submits fewer
> > > > IOs...
> > > 
> > > Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
> > > 'filestore fiemap = false' under [osd] and started a new VM. That worked
> > > nicely, and the iotester found no corruptions. Then I removed 'filestore
> > > fiemap = false' from the config, restarted all osds and ran the iotester
> > > again. Output is as follows:
> > > 
> > > testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> > > Mon Jun 11 17:34:44 CEST 2012
> > > Wrote 100 MiB of data in 1943 milliseconds
> > > Wrote 100 MiB of data in 1858 milliseconds
> > > Wrote 100 MiB of data in 2213 milliseconds
> > > Wrote 100 MiB of data in 3441 milliseconds
> > > Wrote 100 MiB of data in 2705 milliseconds
> > > Wrote 100 MiB of data in 1778 milliseconds
> > > Wrote 100 MiB of data in 1974 milliseconds
> > > Wrote 100 MiB of data in 2780 milliseconds
> > > Wrote 100 MiB of data in 1961 milliseconds
> > > Wrote 100 MiB of data in 2366 milliseconds
> > > Wrote 100 MiB of data in 1886 milliseconds
> > > Wrote 100 MiB of data in 3589 milliseconds
> > > Wrote 100 MiB of data in 1973 milliseconds
> > > Wrote 100 MiB of data in 2506 milliseconds
> > > Wrote 100 MiB of data in 1937 milliseconds
> > > Wrote 100 MiB of data in 3404 milliseconds
> > > Wrote 100 MiB of data in 1990 milliseconds
> > > Wrote 100 MiB of data in 3713 milliseconds
> > > Read 100 MiB of data in 4856 milliseconds
> > > Digest wrong for file
> > > "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
> > > 17:35:34 CEST 2012
> > > testserver-rbd11 iotester # ~/fiemap
> > > /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> > > File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> > > #   Logical  Physical Length   Flags
> > > 0:   a820 0010 000
> > > 
> > > I looked into the file in question, and it started with zero-bytes from
> > > the
> > > start until position 0xbff, even though it was supposed to all random
> > > data.
> > > 
> > > I have included timestamps in the hopes they might make it easier to find
> > > the related entries in the logs.
> > > 
> > > So what do I do now? The logs are very large and complex, and I don't
> > > understand most of what's in there. I don't even know which OSD served
> > > that
> > > particular block/object.
> > 
> > If you can reproduce it with 'debug filestore = 20' too, that will be
> > better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
> > you can attach/post the contents of the object itself (rados -p rbd get
> > rb.0.1.02a0 /tmp/foo) we can make sure the object has the right
> > data (and the sparse-read operation that librbd is doing is the culprit).
> 
> Um. Maybe... That's the problem with using random data, I can't just look at 
> it and recognize it. I guess tomorrow I'll slap something together to see if 
> I 
> can find any 1 Meg-range of data in there that matches the expect checksum.

The process below will identify the object in question..

> > 
> > As for the log:
> > 
> > First, map the offset to an rbd block.  For example, taking the 'Physical'
> > value of a820 from above:
> > 
> > $ printf "%012x\n" $((0xa820 / (4096*1024) ))
> > 02a0
> > 
> > Then figure out what the object name prefix is:
> > 
> > $ rbd info  | grep prefix
> > block_name_prefix: rb.0.1
> > 
> > Then add the block number, 02a0 to that, e.g. rb.0.1.02a0.
> > 
> > Then map that back to an osd with
> > 
> > $ ceph osd map rbd rb.0.1.02a0
> > osdmap e19 pool 'rbd' (2) object 'rb.0.1.02a0' -> pg 2.a2e06f65
> > (2.5) -> up [0,2] acting [0,2]
> > 
> > You'll see the osd ids listed in brackets after 'active'.  We want the
> > first one, 0 in my example.  The log from that OSD is what we need.
> 
> I'm getting
> 
> osdmap e89 pool 'rbd' (2) object 'rb.0.13.02a0' -> pg 2.aca5eccb 
> (2.4b) -> up [1,2] acting [1,2]
> 
> from that command, so I guess it's osd.1 then.
> Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 
> gets it down to 53 MiB, but that's still too large to be sent to a 
> mailing list...

Yeah, but it'll be more useful i

Re: setting up virtual machines with ceph

2012-06-11 Thread Tommi Virtanen
On Sun, Jun 10, 2012 at 3:51 AM, udit agarwal  wrote:
>  Thanks for your reply. I want just virtualization on my ceph system. I will
> explain you the exact implementation that I want in my ceph system. I want to
> run 10 virtual machine instances on my ceph client whilst utilizing the 
> storage
> on the rest of the two systems. I tried to study libvirt but couldn't succeed 
> to
> help my way out to solve this problem. Please help me out with this problem..

Hi. I'd like to help but I'm not sure what trouble you are really
running into. Can you use libvirt successfully with local storage? If
not, then you probably need to find a good linux admin to help you
out. If you can get libvirt to work locally, but not with rbd, can you
ask a specific question?
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Montag, 11. Juni 2012, 09:30:42 schrieb Sage Weil:
> On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> > Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > > On Fri, 8 Jun 2012, Oliver Francke wrote:

> > > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > > 
> > > 'debug ms = 1'?  Ideally we'd like to:
> > >  - reproduce from a fresh vm, with osd logs
> > >  - identify the bad file
> > >  - map that file to a block offset (see
> > >  
> > >http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > >  
> > >  - use that to identify the badness in the log
> > > 
> > > I suspect the cache is just masking the problem because it submits fewer
> > > IOs...
> > 
> > Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and
> > 'filestore fiemap = false' under [osd] and started a new VM. That worked
> > nicely, and the iotester found no corruptions. Then I removed 'filestore
> > fiemap = false' from the config, restarted all osds and ran the iotester
> > again. Output is as follows:
> > 
> > testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> > Mon Jun 11 17:34:44 CEST 2012
> > Wrote 100 MiB of data in 1943 milliseconds
> > Wrote 100 MiB of data in 1858 milliseconds
> > Wrote 100 MiB of data in 2213 milliseconds
> > Wrote 100 MiB of data in 3441 milliseconds
> > Wrote 100 MiB of data in 2705 milliseconds
> > Wrote 100 MiB of data in 1778 milliseconds
> > Wrote 100 MiB of data in 1974 milliseconds
> > Wrote 100 MiB of data in 2780 milliseconds
> > Wrote 100 MiB of data in 1961 milliseconds
> > Wrote 100 MiB of data in 2366 milliseconds
> > Wrote 100 MiB of data in 1886 milliseconds
> > Wrote 100 MiB of data in 3589 milliseconds
> > Wrote 100 MiB of data in 1973 milliseconds
> > Wrote 100 MiB of data in 2506 milliseconds
> > Wrote 100 MiB of data in 1937 milliseconds
> > Wrote 100 MiB of data in 3404 milliseconds
> > Wrote 100 MiB of data in 1990 milliseconds
> > Wrote 100 MiB of data in 3713 milliseconds
> > Read 100 MiB of data in 4856 milliseconds
> > Digest wrong for file
> > "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa" Mon Jun 11
> > 17:35:34 CEST 2012
> > testserver-rbd11 iotester # ~/fiemap
> > /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> > File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> > #   Logical  Physical Length   Flags
> > 0:   a820 0010 000
> > 
> > I looked into the file in question, and it started with zero-bytes from
> > the
> > start until position 0xbff, even though it was supposed to all random
> > data.
> > 
> > I have included timestamps in the hopes they might make it easier to find
> > the related entries in the logs.
> > 
> > So what do I do now? The logs are very large and complex, and I don't
> > understand most of what's in there. I don't even know which OSD served
> > that
> > particular block/object.
> 
> If you can reproduce it with 'debug filestore = 20' too, that will be
> better, as it will tell us what the FIEMAP ioctl is returning.  Also, if
> you can attach/post the contents of the object itself (rados -p rbd get
> rb.0.1.02a0 /tmp/foo) we can make sure the object has the right
> data (and the sparse-read operation that librbd is doing is the culprit).

Um. Maybe... That's the problem with using random data, I can't just look at 
it and recognize it. I guess tomorrow I'll slap something together to see if I 
can find any 1 Meg-range of data in there that matches the expect checksum.

> 
> As for the log:
> 
> First, map the offset to an rbd block.  For example, taking the 'Physical'
> value of a820 from above:
> 
> $ printf "%012x\n" $((0xa820 / (4096*1024) ))
> 02a0
> 
> Then figure out what the object name prefix is:
> 
> $ rbd info  | grep prefix
> block_name_prefix: rb.0.1
> 
> Then add the block number, 02a0 to that, e.g. rb.0.1.02a0.
> 
> Then map that back to an osd with
> 
> $ ceph osd map rbd rb.0.1.02a0
> osdmap e19 pool 'rbd' (2) object 'rb.0.1.02a0' -> pg 2.a2e06f65
> (2.5) -> up [0,2] acting [0,2]
> 
> You'll see the osd ids listed in brackets after 'active'.  We want the
> first one, 0 in my example.  The log from that OSD is what we need.

I'm getting

osdmap e89 pool 'rbd' (2) object 'rb.0.13.02a0' -> pg 2.aca5eccb 
(2.4b) -> up [1,2] acting [1,2]

from that command, so I guess it's osd.1 then.
Do you have somewhere I can upload the log? It is 1.1 GiB in size. Bzip2 gets 
it down to 53 MiB, but that's still too large to be sent to a mailing list...

Guido
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: class exec test

2012-06-11 Thread Sage Weil
On Sun, 10 Jun 2012, Josh Durgin wrote:
> On 06/10/2012 10:03 PM, Sage Weil wrote:
> > Hey-
> > 
> > The librados api tests were calling a dummy "test_exec" method in cls_rbd
> > that apparently got removed.  We probably want to replace the test with
> > *something*, though...  maybe a "version" or similar command that just
> > returns the version of the class?  Or an OSD built-in dummy class with
> > no-op methods?
> 
> I replaced this with a get_all_features method in the wip-rados-test
> branch. With this, the rbd tool can report the features that can be used
> to create a new image.

Looks good, pushed to master.

s
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] mon: fix pg state logging

2012-06-11 Thread Sage Weil
On Mon, 11 Jun 2012, Yan, Zheng wrote:
> From: "Yan, Zheng" 
> 
> PGMap->num_pg_by_state is a PG state to number of PG in the state
> mapping. PGMonitor::update_logger wrongly interprets the mapping.

Thanks, applied!
sage


> 
> Signed-off-by: Yan, Zheng 
> ---
>  src/mon/PGMonitor.cc |   12 ++--
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/src/mon/PGMonitor.cc b/src/mon/PGMonitor.cc
> index 97fbb1b..1b0a210 100644
> --- a/src/mon/PGMonitor.cc
> +++ b/src/mon/PGMonitor.cc
> @@ -96,13 +96,13 @@ void PGMonitor::update_logger()
>for (hash_map::iterator p = pg_map.num_pg_by_state.begin();
> p != pg_map.num_pg_by_state.end();
> ++p) {
> -if (p->second & PG_STATE_ACTIVE) {
> -  active++;
> -  if (p->second & PG_STATE_CLEAN)
> - active_clean++;
> +if (p->first & PG_STATE_ACTIVE) {
> +  active += p->second;
> +  if (p->first & PG_STATE_CLEAN)
> + active_clean += p->second;
>  }
> -if (p->second & PG_STATE_PEERING)
> -  peering++;
> +if (p->first & PG_STATE_PEERING)
> +  peering += p->second;
>}
>mon->cluster_logger->set(l_cluster_num_pg_active_clean, active_clean);
>mon->cluster_logger->set(l_cluster_num_pg_active, active);
> -- 
> 1.7.10.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Sage Weil
On Mon, 11 Jun 2012, Guido Winkelmann wrote:
> Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > Hi Guido,
> > > 
> > > yeah, there is something weird going on. I just started to establish some
> > > test-VM's. Freshly imported from running *.qcow2 images.
> > > Kernel panic with INIT, seg-faults and other "funny" stuff.
> > > 
> > > Just added the rbd_cache=true in my config, voila. All is
> > > fast-n-up-n-running...
> > > All my testing was done with cache enabled... Since our errors all came
> > > from rbd_writeback from former ceph-versions...
> > 
> > Are you guys able to reproduce the corruption with 'debug osd = 20' and
> > 'debug ms = 1'?  Ideally we'd like to:
> > 
> >  - reproduce from a fresh vm, with osd logs
> >  - identify the bad file
> >  - map that file to a block offset (see
> >http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> >  - use that to identify the badness in the log
> > 
> > I suspect the cache is just masking the problem because it submits fewer
> > IOs...
> 
> Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and 
> 'filestore fiemap = false' under [osd] and started a new VM. That worked 
> nicely, and the iotester found no corruptions. Then I removed 'filestore 
> fiemap = false' from the config, restarted all osds and ran the iotester 
> again. Output is as follows:
> 
> testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
> Mon Jun 11 17:34:44 CEST 2012
> Wrote 100 MiB of data in 1943 milliseconds
> Wrote 100 MiB of data in 1858 milliseconds
> Wrote 100 MiB of data in 2213 milliseconds
> Wrote 100 MiB of data in 3441 milliseconds
> Wrote 100 MiB of data in 2705 milliseconds
> Wrote 100 MiB of data in 1778 milliseconds
> Wrote 100 MiB of data in 1974 milliseconds
> Wrote 100 MiB of data in 2780 milliseconds
> Wrote 100 MiB of data in 1961 milliseconds
> Wrote 100 MiB of data in 2366 milliseconds
> Wrote 100 MiB of data in 1886 milliseconds
> Wrote 100 MiB of data in 3589 milliseconds
> Wrote 100 MiB of data in 1973 milliseconds
> Wrote 100 MiB of data in 2506 milliseconds
> Wrote 100 MiB of data in 1937 milliseconds
> Wrote 100 MiB of data in 3404 milliseconds
> Wrote 100 MiB of data in 1990 milliseconds
> Wrote 100 MiB of data in 3713 milliseconds
> Read 100 MiB of data in 4856 milliseconds
> Digest wrong for file "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa"
> Mon Jun 11 17:35:34 CEST 2012
> testserver-rbd11 iotester # ~/fiemap 
> /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
> File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
> #   Logical  Physical Length   Flags
> 0:   a820 0010 000
> 
> I looked into the file in question, and it started with zero-bytes from the 
> start until position 0xbff, even though it was supposed to all random data.
> 
> I have included timestamps in the hopes they might make it easier to find the 
> related entries in the logs.
> 
> So what do I do now? The logs are very large and complex, and I don't 
> understand most of what's in there. I don't even know which OSD served that 
> particular block/object.

If you can reproduce it with 'debug filestore = 20' too, that will be 
better, as it will tell us what the FIEMAP ioctl is returning.  Also, if 
you can attach/post the contents of the object itself (rados -p rbd get 
rb.0.1.02a0 /tmp/foo) we can make sure the object has the right 
data (and the sparse-read operation that librbd is doing is the culprit).

As for the log:

First, map the offset to an rbd block.  For example, taking the 'Physical' 
value of a820 from above:

$ printf "%012x\n" $((0xa820 / (4096*1024) ))
02a0

Then figure out what the object name prefix is:

$ rbd info  | grep prefix
block_name_prefix: rb.0.1

Then add the block number, 02a0 to that, e.g. rb.0.1.02a0.

Then map that back to an osd with

$ ceph osd map rbd rb.0.1.02a0
osdmap e19 pool 'rbd' (2) object 'rb.0.1.02a0' -> pg 2.a2e06f65 
(2.5) -> up [0,2] acting [0,2]

You'll see the osd ids listed in brackets after 'active'.  We want the 
first one, 0 in my example.  The log from that OSD is what we need.

Thanks!
sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Freitag, 8. Juni 2012, 06:55:19 schrieb Sage Weil:
> On Fri, 8 Jun 2012, Oliver Francke wrote:
> > Hi Guido,
> > 
> > yeah, there is something weird going on. I just started to establish some
> > test-VM's. Freshly imported from running *.qcow2 images.
> > Kernel panic with INIT, seg-faults and other "funny" stuff.
> > 
> > Just added the rbd_cache=true in my config, voila. All is
> > fast-n-up-n-running...
> > All my testing was done with cache enabled... Since our errors all came
> > from rbd_writeback from former ceph-versions...
> 
> Are you guys able to reproduce the corruption with 'debug osd = 20' and
> 'debug ms = 1'?  Ideally we'd like to:
> 
>  - reproduce from a fresh vm, with osd logs
>  - identify the bad file
>  - map that file to a block offset (see
>http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
>  - use that to identify the badness in the log
> 
> I suspect the cache is just masking the problem because it submits fewer
> IOs...

Okay, I added 'debug osd = 20' and 'debug ms = 1' under [global] and 
'filestore fiemap = false' under [osd] and started a new VM. That worked 
nicely, and the iotester found no corruptions. Then I removed 'filestore 
fiemap = false' from the config, restarted all osds and ran the iotester 
again. Output is as follows:

testserver-rbd11 iotester # date ; ./iotester /var/iotest ; date
Mon Jun 11 17:34:44 CEST 2012
Wrote 100 MiB of data in 1943 milliseconds
Wrote 100 MiB of data in 1858 milliseconds
Wrote 100 MiB of data in 2213 milliseconds
Wrote 100 MiB of data in 3441 milliseconds
Wrote 100 MiB of data in 2705 milliseconds
Wrote 100 MiB of data in 1778 milliseconds
Wrote 100 MiB of data in 1974 milliseconds
Wrote 100 MiB of data in 2780 milliseconds
Wrote 100 MiB of data in 1961 milliseconds
Wrote 100 MiB of data in 2366 milliseconds
Wrote 100 MiB of data in 1886 milliseconds
Wrote 100 MiB of data in 3589 milliseconds
Wrote 100 MiB of data in 1973 milliseconds
Wrote 100 MiB of data in 2506 milliseconds
Wrote 100 MiB of data in 1937 milliseconds
Wrote 100 MiB of data in 3404 milliseconds
Wrote 100 MiB of data in 1990 milliseconds
Wrote 100 MiB of data in 3713 milliseconds
Read 100 MiB of data in 4856 milliseconds
Digest wrong for file "/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa"
Mon Jun 11 17:35:34 CEST 2012
testserver-rbd11 iotester # ~/fiemap 
/var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa
File /var/iotest/b45e1a59f26830fb98af1ba9ed1360106f5580aa has 1 extents:
#   Logical  Physical Length   Flags
0:   a820 0010 000

I looked into the file in question, and it started with zero-bytes from the 
start until position 0xbff, even though it was supposed to all random data.

I have included timestamps in the hopes they might make it easier to find the 
related entries in the logs.

So what do I do now? The logs are very large and complex, and I don't 
understand most of what's in there. I don't even know which OSD served that 
particular block/object.

Regards,

Guido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] rbd: Clear ceph_msg->bio_iter for retransmitted message

2012-06-11 Thread Alex Elder
On 06/08/2012 01:17 AM, Hannes Reinecke wrote:
> On 06/06/2012 04:10 PM, Alex Elder wrote:
>> On 06/06/2012 03:03 AM, Yan, Zheng wrote:
>>> From: "Yan, Zheng" 
>>>
>>> The bug can cause NULL pointer dereference in write_partial_msg_pages
>>
>> Although this looks simple enough, I want to study it a little more
>> before committing it.  I've been wanting to walk through this bit
>> of code anyway so I'll do that today.
>>
>> One quick observation though:  m->bio_iter really ought to be
>> initialized only within #ifdef CONFIG_BLOCK (although I see it's
>> defined without it in the structure definition).  At some point
>> I'll put together a cleanup patch to do that everywhere; feel free
>> to do that yourself if you are so inclined.
>>
>>  -Alex
>>
>>> Signed-off-by: Zheng Yan 
>>> ---
>>>  net/ceph/messenger.c |1 +
>>>  1 files changed, 1 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
>>> index 1a80907..785b953 100644
>>> --- a/net/ceph/messenger.c
>>> +++ b/net/ceph/messenger.c
>>> @@ -598,6 +598,7 @@ static void prepare_write_message(struct 
>>> ceph_connection *con)
>>>  le32_to_cpu(con->out_msg->footer.front_crc),
>>>  le32_to_cpu(con->out_msg->footer.middle_crc));
>>>  
>>> +   m->bio_iter = NULL;
>>> /* is there a data payload? */
>>> if (le32_to_cpu(m->hdr.data_len) > 0) {
>>> /* initialize page iterator */
>>
> Incidentally, we've come across the same issue. First thing which
> struck me was this:
> 
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 524f4e4..759d4d2 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -874,7 +874,7 @@ static int write_partial_msg_pages(struct
> ceph_connection *c
> on)
> page = list_first_entry(&msg->pagelist->head,
> struct page, lru);
>  #ifdef CONFIG_BLOCK
> -   } else if (msg->bio) {
> +   } else if (msg->bio_iter) {
> struct bio_vec *bv;
> 
> bv = bio_iovec_idx(msg->bio_iter, msg->bio_seg);
> 
> We've called bio_list_init() a few lines above; however, it might
> return with a NULL bio_iter. So for consistency we should be
> checking for ->bio_iter here, as this is what we'll be using
> afterwards anyway.

Zheng is right, the only way bio_iter will be null following the
call to init_bio_iter() is if bio is also null, so it's roughly
equivalent either way.  I do think it would be reassuring to have
the check be against bio_iter as you suggest in this case, since
that's the pointer we're then dereferencing.

I'm reworking this code today, and will update it to check the
bio_iter pointer instead if this suggestion still applies.

-Alex

> Cheers,
> 
> Hannes

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread Yehuda Sadeh Weinraub
On Mon, Jun 11, 2012 at 5:32 AM, John Axel Eriksson  wrote:
>
> Also, when PUTting something through radosgw, does ceph/rgw return as
> soon as all data has been received or does it return
> when it has ensured N replicas? (I've seen quite a delay after all
> data has been sent before my PUT returns). I'm using nginx (1.2) by
> the way.
>

It sounds like that you're seeing the effect of nginx buffering of the
entire request before sending it to to the radosgw process (via
fastcgi). The same happens with apache when using mod_fcgi, unlike
mod_fastcgi that actually streams the written data to the backend and
thus acks are throttled accordingly.

Yehuda
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Random data corruption in VM, possibly caused by rbd

2012-06-11 Thread Guido Winkelmann
Am Samstag, 9. Juni 2012, 20:04:20 schrieb Sage Weil:
> On Fri, 8 Jun 2012, Guido Winkelmann wrote:
> > Am Freitag, 8. Juni 2012, 07:50:36 schrieb Josh Durgin:
> > > On 06/08/2012 06:55 AM, Sage Weil wrote:
> > > > On Fri, 8 Jun 2012, Oliver Francke wrote:
> > > >> Hi Guido,
> > > >> 
> > > >> yeah, there is something weird going on. I just started to establish
> > > >> some
> > > >> test-VM's. Freshly imported from running *.qcow2 images.
> > > >> Kernel panic with INIT, seg-faults and other "funny" stuff.
> > > >> 
> > > >> Just added the rbd_cache=true in my config, voila. All is
> > > >> fast-n-up-n-running...
> > > >> All my testing was done with cache enabled... Since our errors all
> > > >> came
> > > >> from rbd_writeback from former ceph-versions...
> > > > 
> > > > Are you guys able to reproduce the corruption with 'debug osd = 20'
> > > > and
> > > > 
> > > > 'debug ms = 1'?  Ideally we'd like to:
> > > >   - reproduce from a fresh vm, with osd logs
> > > >   - identify the bad file
> > > >   - map that file to a block offset (see
> > > >   
> > > > http://ceph.com/qa/fiemap.[ch], linux_fiemap.h)
> > > >   
> > > >   - use that to identify the badness in the log
> > > > 
> > > > I suspect the cache is just masking the problem because it submits
> > > > fewer
> > > > IOs...
> > > 
> > > The cache also doesn't do sparse reads. Is it still reproducible with
> > > a fresh vm when you set filestore_fiemap_threshold = 0 for the osds,
> > > and run without rbd caching?
> > 
> > I have set filestore_fiemap_threshold = 0 on all osds and restarted them.
> > The problem is still there, and so bad I cannot even run this fiemap
> > utility that Sage posted. I guess I should have tried booting the VM from
> > a livecd instead...
> 
> Whoops,
> 
>   filestore fiemap threshold = 0
> 
> doesn't turn it off, but
> 
>   filestore fiemap = false

Okay, I changed "filestore fiemap threshold = 0" to "filestore fiemap = false" 
under [osd]. So far, the problem does not seem to resurface.

Guido

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread John Axel Eriksson
Ok so that's the reason for the keys then? To be able to stop an osd
from connecting to the cluster? Won't "ceph osd down" or something
like that stop it from connecting?

Also, the rados command - doesn't that use the config file to connect
to the osds? If I have the correct config file won't that stop me from
accidentally connecting to some other cluster?

I can see how it might be more relevant when you have 1000 osds and
lots of people managing the cluster, but in my case the cluster won't
be all that big. At first I guess only 3 osds which may grow quite a
bit of course but I can't see us needing even 30 osds (we're not a
cloud hosting company and won't be giving access directly to the
storage from the outside).

Actually, it wouldn't be entirely necessary for us to run the radosgw,
however we do wan't an http interface and radosgw gives us one. I'm
not aware of any other http interface available right now (apart from
writing one ourselves, but we'd much rather use something that is
tested and works, we're a small company).

On Mon, Jun 11, 2012 at 2:51 PM, Wido den Hollander  wrote:
>
>
> On 06/11/2012 02:41 PM, John Axel Eriksson wrote:
>>
>> Oh sorry. I don't think I was clear on the auth question. What I meant
>> was if the admin.keyring and keys for the osd:s are really necessary
>> in a private ceph-cluster.
>
>
> I'd say: Yes
>
> With keys in place you can ensure that a rogue machine starts bringing down
> your cluster.
>
> Scenario: You take a machine offline in a cluster, let it sit in storage for
> some while and a couple of months later somebody wonders what that machine
> does.
>
> Plugs it into a switch, power and boots. Suddenly this old machine which is
> way behind on software starts participating in your cluster again and could
> potentially bring it all down.
>
> But it could even be even more simple. You set up a second Ceph cluster for
> some tests, but while playing with the 'rados' command you accidentally
> connect to the wrong cluster and issue a "rmpool". Oops!
>
> With auth in place you have a barrier against such situations.
>
> Wido
>
>
>>
>> On Mon, Jun 11, 2012 at 2:40 PM, Wido den Hollander
>>  wrote:
>>>
>>> Hi,
>>>
>>>
>>> On 06/11/2012 02:32 PM, John Axel Eriksson wrote:


 Is there a point to having auth enabled if I run ceph on an internal
 network, only for use with radosgw (i.e the object storage part)?
 It seems to complicate the setup unnecessarily and ceph doesn't use
 encryption anyway as far as I understand, it's only auth.
 If my network is trusted and I know who has access (and I trust them)
 - is there a point to complicate the setup with key-based auth?

>>>
>>> The RADOS Gateway uses the S3 protocol and that requires authentication
>>> and
>>> authorization.
>>>
>>> When creating a bucket/pool and storing objects, it has to be mapped to a
>>> users inside the RADOS GW.
>>>
>>> I don't know what your exact use-case is, but if it's only internal,
>>> isn't
>>> it a possibility to use RADOS natively?
>>>
>>>
 Also, when PUTting something through radosgw, does ceph/rgw return as
 soon as all data has been received or does it return
 when it has ensured N replicas? (I've seen quite a delay after all
 data has been sent before my PUT returns). I'm using nginx (1.2) by
 the way.
>>>
>>>
>>>
>>> iirc it returns when all replicas have received and stored the object.
>>>
>>> Wido
>>>

 Thanks!

 John
 --
 To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread Wido den Hollander



On 06/11/2012 02:41 PM, John Axel Eriksson wrote:

Oh sorry. I don't think I was clear on the auth question. What I meant
was if the admin.keyring and keys for the osd:s are really necessary
in a private ceph-cluster.


I'd say: Yes

With keys in place you can ensure that a rogue machine starts bringing 
down your cluster.


Scenario: You take a machine offline in a cluster, let it sit in storage 
for some while and a couple of months later somebody wonders what that 
machine does.


Plugs it into a switch, power and boots. Suddenly this old machine which 
is way behind on software starts participating in your cluster again and 
could potentially bring it all down.


But it could even be even more simple. You set up a second Ceph cluster 
for some tests, but while playing with the 'rados' command you 
accidentally connect to the wrong cluster and issue a "rmpool". Oops!


With auth in place you have a barrier against such situations.

Wido



On Mon, Jun 11, 2012 at 2:40 PM, Wido den Hollander  wrote:

Hi,


On 06/11/2012 02:32 PM, John Axel Eriksson wrote:


Is there a point to having auth enabled if I run ceph on an internal
network, only for use with radosgw (i.e the object storage part)?
It seems to complicate the setup unnecessarily and ceph doesn't use
encryption anyway as far as I understand, it's only auth.
If my network is trusted and I know who has access (and I trust them)
- is there a point to complicate the setup with key-based auth?



The RADOS Gateway uses the S3 protocol and that requires authentication and
authorization.

When creating a bucket/pool and storing objects, it has to be mapped to a
users inside the RADOS GW.

I don't know what your exact use-case is, but if it's only internal, isn't
it a possibility to use RADOS natively?



Also, when PUTting something through radosgw, does ceph/rgw return as
soon as all data has been received or does it return
when it has ensured N replicas? (I've seen quite a delay after all
data has been sent before my PUT returns). I'm using nginx (1.2) by
the way.



iirc it returns when all replicas have received and stored the object.

Wido



Thanks!

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread John Axel Eriksson
Oh sorry. I don't think I was clear on the auth question. What I meant
was if the admin.keyring and keys for the osd:s are really necessary
in a private ceph-cluster.

On Mon, Jun 11, 2012 at 2:40 PM, Wido den Hollander  wrote:
> Hi,
>
>
> On 06/11/2012 02:32 PM, John Axel Eriksson wrote:
>>
>> Is there a point to having auth enabled if I run ceph on an internal
>> network, only for use with radosgw (i.e the object storage part)?
>> It seems to complicate the setup unnecessarily and ceph doesn't use
>> encryption anyway as far as I understand, it's only auth.
>> If my network is trusted and I know who has access (and I trust them)
>> - is there a point to complicate the setup with key-based auth?
>>
>
> The RADOS Gateway uses the S3 protocol and that requires authentication and
> authorization.
>
> When creating a bucket/pool and storing objects, it has to be mapped to a
> users inside the RADOS GW.
>
> I don't know what your exact use-case is, but if it's only internal, isn't
> it a possibility to use RADOS natively?
>
>
>> Also, when PUTting something through radosgw, does ceph/rgw return as
>> soon as all data has been received or does it return
>> when it has ensured N replicas? (I've seen quite a delay after all
>> data has been sent before my PUT returns). I'm using nginx (1.2) by
>> the way.
>
>
> iirc it returns when all replicas have received and stored the object.
>
> Wido
>
>>
>> Thanks!
>>
>> John
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread Wido den Hollander

Hi,

On 06/11/2012 02:32 PM, John Axel Eriksson wrote:

Is there a point to having auth enabled if I run ceph on an internal
network, only for use with radosgw (i.e the object storage part)?
It seems to complicate the setup unnecessarily and ceph doesn't use
encryption anyway as far as I understand, it's only auth.
If my network is trusted and I know who has access (and I trust them)
- is there a point to complicate the setup with key-based auth?



The RADOS Gateway uses the S3 protocol and that requires authentication 
and authorization.


When creating a bucket/pool and storing objects, it has to be mapped to 
a users inside the RADOS GW.


I don't know what your exact use-case is, but if it's only internal, 
isn't it a possibility to use RADOS natively?



Also, when PUTting something through radosgw, does ceph/rgw return as
soon as all data has been received or does it return
when it has ensured N replicas? (I've seen quite a delay after all
data has been sent before my PUT returns). I'm using nginx (1.2) by
the way.


iirc it returns when all replicas have received and stored the object.

Wido



Thanks!

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Ceph questions regarding auth and return on PUT from radosgw

2012-06-11 Thread John Axel Eriksson
Is there a point to having auth enabled if I run ceph on an internal
network, only for use with radosgw (i.e the object storage part)?
It seems to complicate the setup unnecessarily and ceph doesn't use
encryption anyway as far as I understand, it's only auth.
If my network is trusted and I know who has access (and I trust them)
- is there a point to complicate the setup with key-based auth?

Also, when PUTting something through radosgw, does ceph/rgw return as
soon as all data has been received or does it return
when it has ensured N replicas? (I've seen quite a delay after all
data has been sent before my PUT returns). I'm using nginx (1.2) by
the way.

Thanks!

John
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC, PATCH, RESEND] fs: push rcu_barrier() from deactivate_locked_super() to filesystems

2012-06-11 Thread Kirill A. Shutemov
On Sat, Jun 09, 2012 at 12:25:57AM -0700, Andrew Morton wrote:
> And...  it seems that I misread what's going on.  The individual
> filesystems are doing the rcu freeing of their inodes, so it is
> appropriate that they also call rcu_barrier() prior to running
> kmem_cache_free().  Which is what Kirill's patch does.  oops.

Ack? ;)

-- 
 Kirill A. Shutemov


signature.asc
Description: Digital signature


Re: Journal size of each disk

2012-06-11 Thread Wido den Hollander

Hi,

On 06/11/2012 08:47 AM, eric_yh_c...@wiwynn.com wrote:

Dear all:

 I would like to know if the journal size influence the performance
of disk.

 If the size of each of my disk is 1T, how much size should I prepare
for the journal?



You journal should be able to hold the writes for a short period of 
time, something like 10 ~ 20 seconds.


If your machine is on a 1Gbit line it will do something like 100MB/sec 
of writes.


100MB * 20 seconds = 2000MB

So a journal of something like 2GB should be enough.

You can always scale it bigger to 4GB, that won't hurt anything.

So it doesn't depend on the size of your data disk, but depends on how 
fast the writes are coming in.


Wido


 Thanks for any comment.

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html