Wiki Spam

2012-05-31 Thread SPONEM, Benoît
Dear all,

Just for information, there are a lot of spam in Ceph's wiki
(http://ceph.com/wiki/Special:RecentChanges,
http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0
http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0).

Regards,
Benoit
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG
Hi Marc, Hi Stefan,

first thanks for all your help and time.

I found the commit which results in this problem and it is TCP related
but i'm still wondering if the expected behaviour of this commit is
expected?

The commit in question is:
git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
Author: Jason Wang jasow...@redhat.com
Date:   Thu Feb 2 00:07:00 2012 +

tcp: properly initialize tcp memory limits

Commit 4acb4190 tries to fix the using uninitialized value
introduced by commit 3dc43e3,  but it would make the
per-socket memory limits too small.

This patch fixes this and also remove the redundant codes
introduced in 4acb4190.

Signed-off-by: Jason Wang jasow...@redhat.com
Acked-by: Glauber Costa glom...@parallels.com
Signed-off-by: David S. Miller da...@davemloft.net

diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 4cb9cd2..7a7724d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
struct ctl_table *table;
-   unsigned long limit;

table = ipv4_net_table;
if (!net_eq(net, init_net)) {
@@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct
net *net)
net-ipv4.sysctl_rt_cache_rebuild_count = 4;

tcp_init_mem(net);
-   limit = nr_free_buffer_pages() / 8;
-   limit = max(limit, 128UL);
-   net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
-   net-ipv4.sysctl_tcp_mem[1] = limit;
-   net-ipv4.sysctl_tcp_mem[2] = net-ipv4.sysctl_tcp_mem[0] * 2;

net-ipv4.ipv4_hdr = register_net_sysctl_table(net,
net_ipv4_ctl_path, table);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a34f5cf..37755cc 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3229,7 +3229,6 @@ __setup(thash_entries=, set_thash_entries);

 void tcp_init_mem(struct net *net)
 {
-   /* Set per-socket limits to no more than 1/128 the pressure
threshold */
unsigned long limit = nr_free_buffer_pages() / 8;
limit = max(limit, 128UL);
net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
@@ -3298,7 +3297,8 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = max(128, cnt / 256);

tcp_init_mem(init_net);
-   limit = nr_free_buffer_pages() / 8;
+   /* Set per-socket limits to no more than 1/128 the pressure
threshold */
+   limit = nr_free_buffer_pages()  (PAGE_SHIFT - 10);
limit = max(limit, 128UL);
max_share = min(4UL*1024*1024, limit);

Greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


rbd rm image slow with big images ?

2012-05-31 Thread Alexandre DERUMIER
Hi,

I trying to delete some rbd images with rbd rm,
and it seem to be slow with big images.



I'm testing it with just create a new image (1TB):

# time rbd -p pool1 create --size 100 image2

real0m0.031s
user0m0.015s
sys 0m0.010s


then just delete it, without having writed nothing in image


# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real1m45.558s
user0m14.683s
sys 0m17.363s



same test with 100GB

# time rbd -p pool1 create --size 10 image2

real0m0.032s
user0m0.016s
sys 0m0.007s

# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real0m10.499s
user0m1.488s
sys 0m1.720s


I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs)
network bandwith,diskio,cpu are low.

Is it the normal behaviour ? Maybe some xfs tuning could help ?
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yehuda Sadeh
On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:
 Hi Marc, Hi Stefan,

 first thanks for all your help and time.

 I found the commit which results in this problem and it is TCP related
 but i'm still wondering if the expected behaviour of this commit is
 expected?

 The commit in question is:
 git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
 commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
 Author: Jason Wang jasow...@redhat.com
 Date:   Thu Feb 2 00:07:00 2012 +

    tcp: properly initialize tcp memory limits

    Commit 4acb4190 tries to fix the using uninitialized value
    introduced by commit 3dc43e3,  but it would make the
    per-socket memory limits too small.

    This patch fixes this and also remove the redundant codes
    introduced in 4acb4190.

    Signed-off-by: Jason Wang jasow...@redhat.com
    Acked-by: Glauber Costa glom...@parallels.com
    Signed-off-by: David S. Miller da...@davemloft.net

 diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
 index 4cb9cd2..7a7724d 100644
 --- a/net/ipv4/sysctl_net_ipv4.c
 +++ b/net/ipv4/sysctl_net_ipv4.c
 @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
  static __net_init int ipv4_sysctl_init_net(struct net *net)
  {
        struct ctl_table *table;
 -       unsigned long limit;

        table = ipv4_net_table;
        if (!net_eq(net, init_net)) {
 @@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct
 net *net)
        net-ipv4.sysctl_rt_cache_rebuild_count = 4;

        tcp_init_mem(net);
 -       limit = nr_free_buffer_pages() / 8;
 -       limit = max(limit, 128UL);
 -       net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
 -       net-ipv4.sysctl_tcp_mem[1] = limit;
 -       net-ipv4.sysctl_tcp_mem[2] = net-ipv4.sysctl_tcp_mem[0] * 2;

        net-ipv4.ipv4_hdr = register_net_sysctl_table(net,
                        net_ipv4_ctl_path, table);
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index a34f5cf..37755cc 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -3229,7 +3229,6 @@ __setup(thash_entries=, set_thash_entries);

  void tcp_init_mem(struct net *net)
  {
 -       /* Set per-socket limits to no more than 1/128 the pressure
 threshold */
        unsigned long limit = nr_free_buffer_pages() / 8;
        limit = max(limit, 128UL);
        net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
 @@ -3298,7 +3297,8 @@ void __init tcp_init(void)
        sysctl_max_syn_backlog = max(128, cnt / 256);

        tcp_init_mem(init_net);
 -       limit = nr_free_buffer_pages() / 8;
 +       /* Set per-socket limits to no more than 1/128 the pressure
 threshold */
 +       limit = nr_free_buffer_pages()  (PAGE_SHIFT - 10);
        limit = max(limit, 128UL);
        max_share = min(4UL*1024*1024, limit);

Yeah, this might have affected the tcp performance. Looking at the
current linus tree this function looks more like it looked beforehand,
so it was probable reverted this way or another.

Yehuda
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG
Am 31.05.2012 09:27, schrieb Stefan Majer:
 we have set them in /etc/sysctl.conf to:
 net.ipv4.tcp_mem = 1000 1000 1000

This does not help ;-(

 wow, this was fast !
 if i understand this commit correct it simply skips a in-kernel
 configuration of network related sysctl parameters, especialy
 net.ipv4.tcp_mem

I also tied this one:
net.ipv4.tcp_rmem = 4096 524287 16777216
net.ipv4.tcp_wmem = 4096 524287 16777216
# grabbed values from 3.0.X
net.ipv4.tcp_mem = 1162962  1550617 2325924

still - no help -. But if i use 3.4 and revert the commit it works fine.
But i wasn't able to find which other parts are influenced by this limit
while browsing through the source.

I only found:
net.ipv4.tcp_mem
and
net.ipv4.tcp_rmem
and
net.ipv4.tcp_wmem

Greets
Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Majer
Hi Stefan,

then you should probably describe this in a short mail to Jason Wang
and ask him how to circumvent this commit with sysctl settings.
I´m pretty sure my sysctl setting reverts the first part of the
commit. So probably the second part is the evil one ?

Greetings
Stefan

On Thu, May 31, 2012 at 10:04 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

 Am 31.05.2012 09:27, schrieb Stefan Majer:
  we have set them in /etc/sysctl.conf to:
  net.ipv4.tcp_mem = 1000 1000 1000

 This does not help ;-(

  wow, this was fast !
  if i understand this commit correct it simply skips a in-kernel
  configuration of network related sysctl parameters, especialy
  net.ipv4.tcp_mem

 I also tied this one:
 net.ipv4.tcp_rmem = 4096 524287 16777216
 net.ipv4.tcp_wmem = 4096 524287 16777216
 # grabbed values from 3.0.X
 net.ipv4.tcp_mem = 1162962      1550617 2325924

 still - no help -. But if i use 3.4 and revert the commit it works fine.
 But i wasn't able to find which other parts are influenced by this limit
 while browsing through the source.

 I only found:
 net.ipv4.tcp_mem
 and
 net.ipv4.tcp_rmem
 and
 net.ipv4.tcp_wmem

 Greets
 Stefan




--
Stefan Majer
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG

Am 31.05.2012 10:09, schrieb Stefan Majer:

Hi Stefan,

then you should probably describe this in a short mail to Jason Wang
and ask him how to circumvent this commit with sysctl settings.


done hopefully he can help


I´m pretty sure my sysctl setting reverts the first part of the
commit. So probably the second part is the evil one ?

Yes it seems like that

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Wiki Spam

2012-05-31 Thread Mark Nelson

Doh!

Thanks for the heads-up.  We'll deal with it.

Thanks,
Mark

On 5/31/12 2:05 AM, SPONEM, Benoît wrote:

Dear all,

Just for information, there are a lot of spam in Ceph's wiki
(http://ceph.com/wiki/Special:RecentChanges,
http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0
http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0).

Regards,
Benoit
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG

Hi Mark, Hi Stefan,

i found a way to solve it by comparing /proc/sys/net with an patched and 
an unpatched kernel.


Strangely the problem occours when the values are too big (in new kernel).

With the smaller values everything works fine even under 3.4. Any ideas 
how that can be? I thought these values should be tuned to a maximum for 
max performance.


- = old kernel
+ = new kernel

-/proc/sys/net/ipv4/tcp_rmem:4096   87380   6291456
+/proc/sys/net/ipv4/tcp_rmem:4096   87380   514873
-/proc/sys/net/ipv4/tcp_wmem:4096   16384   4194304
+/proc/sys/net/ipv4/tcp_wmem:4096   16384   514873


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Mark Nelson

Hi Stefan,

Please do share!  I was planning on starting out on the wiki and 
eventually getting these kinds of things into the master docs.  If you 
(and others) have already done testing it would be really interesting to 
compare experiences.  So far I've been just kind of throwing stuff into:


http://ceph.com/wiki/Performance_analysis

In it's current form it's pretty inadequate, but I'm hoping to 
eventually get back to it.  A lot of the work I've been doing recently 
is looking at underlying FS write behavior (specifically seeks) and if 
we can get any reasonable improvement through mkfs and mount options.


Mark

On 5/31/12 2:34 AM, Stefan Majer wrote:

Hi,

if Stefan confirms this as a solution it might me a good idea to 
collect some performance optimizations hints for osds to 
http://ceph.com/docs/master

probably seperated in:

Gigabit Ethernet based deployments
 with Jumbo Frames

 without Jumbo Frames
10 Gigabit Ethernet based deployments
 with Jumbo Frames

 without Jumbo Frames

I can share some of our configurations as well

Greetings
Stefan

On Thu, May 31, 2012 at 9:30 AM, Yehuda Sadeh yeh...@inktank.com 
mailto:yeh...@inktank.com wrote:


On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag mailto:s.pri...@profihost.ag wrote:
 Hi Marc, Hi Stefan,

 first thanks for all your help and time.

 I found the commit which results in this problem and it is TCP
related
 but i'm still wondering if the expected behaviour of this commit is
 expected?

 The commit in question is:
 git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
 commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2
 Author: Jason Wang jasow...@redhat.com
mailto:jasow...@redhat.com
 Date:   Thu Feb 2 00:07:00 2012 +

tcp: properly initialize tcp memory limits

Commit 4acb4190 tries to fix the using uninitialized value
introduced by commit 3dc43e3,  but it would make the
per-socket memory limits too small.

This patch fixes this and also remove the redundant codes
introduced in 4acb4190.

Signed-off-by: Jason Wang jasow...@redhat.com
mailto:jasow...@redhat.com
Acked-by: Glauber Costa glom...@parallels.com
mailto:glom...@parallels.com
Signed-off-by: David S. Miller da...@davemloft.net
mailto:da...@davemloft.net

 diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
 index 4cb9cd2..7a7724d 100644
 --- a/net/ipv4/sysctl_net_ipv4.c
 +++ b/net/ipv4/sysctl_net_ipv4.c
 @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
  static __net_init int ipv4_sysctl_init_net(struct net *net)
  {
struct ctl_table *table;
 -   unsigned long limit;

table = ipv4_net_table;
if (!net_eq(net, init_net)) {
 @@ -815,11 +814,6 @@ static __net_init int
ipv4_sysctl_init_net(struct
 net *net)
net-ipv4.sysctl_rt_cache_rebuild_count = 4;

tcp_init_mem(net);
 -   limit = nr_free_buffer_pages() / 8;
 -   limit = max(limit, 128UL);
 -   net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
 -   net-ipv4.sysctl_tcp_mem[1] = limit;
 -   net-ipv4.sysctl_tcp_mem[2] =
net-ipv4.sysctl_tcp_mem[0] * 2;

net-ipv4.ipv4_hdr = register_net_sysctl_table(net,
net_ipv4_ctl_path, table);
 diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
 index a34f5cf..37755cc 100644
 --- a/net/ipv4/tcp.c
 +++ b/net/ipv4/tcp.c
 @@ -3229 tel:3229,7 +3229,6 @@ __setup(thash_entries=,
set_thash_entries);

  void tcp_init_mem(struct net *net)
  {
 -   /* Set per-socket limits to no more than 1/128 the pressure
 threshold */
unsigned long limit = nr_free_buffer_pages() / 8;
limit = max(limit, 128UL);
net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
 @@ -3298 tel:3298,7 +3297,8 @@ void __init tcp_init(void)
sysctl_max_syn_backlog = max(128, cnt / 256);

tcp_init_mem(init_net);
 -   limit = nr_free_buffer_pages() / 8;
 +   /* Set per-socket limits to no more than 1/128 the pressure
 threshold */
 +   limit = nr_free_buffer_pages()  (PAGE_SHIFT - 10);
limit = max(limit, 128UL);
max_share = min(4UL*1024*1024, limit);

Yeah, this might have affected the tcp performance. Looking at the
current linus tree this function looks more like it looked beforehand,
so it was probable reverted this way or another.

Yehuda




--
Stefan Majer



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG

Am 31.05.2012 14:31, schrieb Mark Nelson:

Hi Stefan,

Please do share! I was planning on starting out on the wiki and
eventually getting these kinds of things into the master docs. If you
(and others) have already done testing it would be really interesting to
compare experiences. So far I've been just kind of throwing stuff into:

http://ceph.com/wiki/Performance_analysis

In it's current form it's pretty inadequate, but I'm hoping to
eventually get back to it. A lot of the work I've been doing recently is
looking at underlying FS write behavior (specifically seeks) and if we
can get any reasonable improvement through mkfs and mount options.


At least i'll start sharing when i've a fine running system ;-) I plan 
to switch to 10Gbe next week.


Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Nam Dang
Dear all,

I am running a small benchmark for Ceph with multithreading and cephfs-java API.
I encountered this issue even when I use only two threads, and I used
only open file and creating directory operations.

The piece of code is simply:
String parent = filePath.substring(0, filePath.lastIndexOf('/'));
mount.mkdirs(parent, 0755); // create parents if the path does not exist
int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); //
open the file

Each thread mounts its own ceph mounting point (using
mount.mount(null)) and I don't have any interlocking mechanism across
the threads at all.
It appears the error is SIGSEGV sent off by libcepfs. The message is as follows:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904
#
# JRE version: 6.0_26-b03
# Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
#
# An error report file with more information is saved as:
# /home/namd/cephBench/hs_err_pid14063.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.

I have also attached the hs_err_pid14063.log for your reference.
An excerpt from the file:

Stack: [0x7ff6aa828000,0x7ff6aa929000],
sp=0x7ff6aa9274f0,  free space=1021k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
j  
Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
j  Benchmark$StatsDaemon.benchmarkOne()V+22
j  Benchmark$StatsDaemon.run()V+26
v  ~StubRoutines::call_stub

So I think the probably may be due to the locking mechanism of ceph
internally. But Dr. Weil previously answered my email stating that the
mounting is done independently so multithreading should not lead to
this problem. If there is anyway to work around this, please let me
know.

Best regards,

Nam Dang
Email: n...@de.cs.titech.ac.jp
HP: (+81) 080-4465-1587
Yokota Lab, Dept. of Computer Science
Tokyo Institute of Technology
Tokyo, Japan


hs_err_pid14063.log
Description: Binary data


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Nam Dang
It turned out my monitor went down without my knowing.
So my bad, it wasn't because of Ceph.

Best regards,

Nam Dang
Tokyo Institute of Technology
Tokyo, Japan


On Thu, May 31, 2012 at 10:08 PM, Nam Dang n...@de.cs.titech.ac.jp wrote:
 Dear all,

 I am running a small benchmark for Ceph with multithreading and cephfs-java 
 API.
 I encountered this issue even when I use only two threads, and I used
 only open file and creating directory operations.

 The piece of code is simply:
 String parent = filePath.substring(0, filePath.lastIndexOf('/'));
 mount.mkdirs(parent, 0755); // create parents if the path does not exist
 int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); //
 open the file

 Each thread mounts its own ceph mounting point (using
 mount.mount(null)) and I don't have any interlocking mechanism across
 the threads at all.
 It appears the error is SIGSEGV sent off by libcepfs. The message is as 
 follows:

 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904
 #
 # JRE version: 6.0_26-b03
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode
 linux-amd64 compressed oops)
 # Problematic frame:
 # C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
 #
 # An error report file with more information is saved as:
 # /home/namd/cephBench/hs_err_pid14063.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 # The crash happened outside the Java Virtual Machine in native code.
 # See problematic frame for where to report the bug.

 I have also attached the hs_err_pid14063.log for your reference.
 An excerpt from the file:

 Stack: [0x7ff6aa828000,0x7ff6aa929000],
 sp=0x7ff6aa9274f0,  free space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9

 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
 j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
 j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
 j  
 Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
 j  Benchmark$StatsDaemon.benchmarkOne()V+22
 j  Benchmark$StatsDaemon.run()V+26
 v  ~StubRoutines::call_stub

 So I think the probably may be due to the locking mechanism of ceph
 internally. But Dr. Weil previously answered my email stating that the
 mounting is done independently so multithreading should not lead to
 this problem. If there is anyway to work around this, please let me
 know.

 Best regards,

 Nam Dang
 Email: n...@de.cs.titech.ac.jp
 HP: (+81) 080-4465-1587
 Yokota Lab, Dept. of Computer Science
 Tokyo Institute of Technology
 Tokyo, Japan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yann Dupont

On 31/05/2012 09:30, Yehuda Sadeh wrote:

On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

Hi Marc, Hi Stefan,



Hello, back today

Today, I upgraded my 2 last osd nodes with big storage, so now all my 
nodes are equivalent.


Using 3.4.0 kernel, I still have good results with rbd pool, but jumping 
values with data.




first thanks for all your help and time.

I found the commit which results in this problem and it is TCP related
but i'm still wondering if the expected behaviour of this commit is
expected?









Yeah, this might have affected the tcp performance. Looking at the
current linus tree this function looks more like it looked beforehand,
so it was probable reverted this way or another!

Yehuda


Well, I saw you probably found the culprit.

So tried the latest (this morning) git kernel.

Now data gives good results :

root@label5:~#  rados -p data bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  16   215   199   795.765   796  0.073769 0.0745517
2  16   430   414   827.833   860  0.060165 0.0753952
3  16   632   616   821.207   808  0.072241 0.0772463
4  16   838   822   821.883   824  0.129571 0.0768741
5  16  1039  1023   818.271   804  0.056867  0.077637
6  16  1254  1238   825.209   860  0.078801 0.0771122
7  16  1474  1458   833.023   880  0.062886 0.0764071
8  16  1669  1653   826.389   780   0.09632 0.0767323
9  16  1877  1861   827.003   832  0.083765 0.0770398
   10  16  2087  2071   828.294   840  0.051437  0.076937
   11  16  2309  2293   833.714   888  0.080584 0.0764829
   12  16  2535  2519   839.563   904  0.078095 0.0759574
   13  16  2762  2746   844.816   908  0.081323 0.0754571
   14  16  2984  2968   847.889   888  0.076973 0.0752921
   15  16  3203  3187   849.754   876  0.069877 0.0750613
   16  16  3437  3421   855.138   936  0.046845 0.0746941
   17  16  3655  3639   856.126   872  0.052258 0.0745157
   18  16  3862  3846   854.559   828  0.061542 0.0746875
   19  16  4085  4069   856.525   892  0.053889 0.0745582
min lat: 0.033007 max lat: 0.462951 avg lat: 0.0743988
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20  15  4308  4293   858.492   896  0.054176 0.0743988
Total time run:20.103415
Total writes made: 4309
Write size:4194304
Bandwidth (MB/sec):857.367

Average Latency:   0.0746302
Max latency:   0.462951
Min latency:   0.033007



But very strangely it's now rbd that isn't stable ?!

root@label5:~#  rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
0   0 0 0 0 0 - 0
1  16   155   139555.87   556  0.046232  0.109021
2  16   250   234   467.923   380  0.046793 0.0985316
3  16   250   234   311.955 0 - 0.0985316
4  16   250   234   233.965 0 - 0.0985316
5  16   250   234   187.173 0 - 0.0985316
6  16   266   250   166.64516  0.038083  0.175697
7  16   266   250   142.839 0 -  0.175697
8  16   441   425   212.475   350   0.05512  0.298391
9  16   476   460   204.422   140   0.04372  0.280483
   10  16   531   515   205.976   220  0.125076  0.309449
   11  16   734   718261.06   812  0.127582  0.244134
   12  16   795   779   259.637   244  0.065158  0.234156
   13  16   818   802   246.74292  0.054514  0.241704
   14  16   830   814   232.54648  0.044386  0.239006
   15  16   837   821   218.90928   3.41523  0.267521
   16  16  1043  1027   256.721   824   0.04898  0.248212
   17  16  1147  1131   266.088   416  0.048591  0.232725
   18  16  1147  1131   251.305 0 -  0.232725
   19  16  1202  1186   249.657   110  0.081777   0.25501
min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
  sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
   20  16  1296  1280255.97   376  0.053797  0.245711
   21   9  1297  1288   245.30532  0.708133 

Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Stefan Priebe - Profihost AG

Am 31.05.2012 15:21, schrieb Yann Dupont:

On 31/05/2012 09:30, Yehuda Sadeh wrote:

On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

But very strangely it's now rbd that isn't stable ?!

root@label5:~# rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 155 139 555.87 556 0.046232 0.109021
2 16 250 234 467.923 380 0.046793 0.0985316
3 16 250 234 311.955 0 - 0.0985316
4 16 250 234 233.965 0 - 0.0985316
5 16 250 234 187.173 0 - 0.0985316
6 16 266 250 166.645 16 0.038083 0.175697
7 16 266 250 142.839 0 - 0.175697
8 16 441 425 212.475 350 0.05512 0.298391
9 16 476 460 204.422 140 0.04372 0.280483
10 16 531 515 205.976 220 0.125076 0.309449
11 16 734 718 261.06 812 0.127582 0.244134
12 16 795 779 259.637 244 0.065158 0.234156
13 16 818 802 246.742 92 0.054514 0.241704
14 16 830 814 232.546 48 0.044386 0.239006
15 16 837 821 218.909 28 3.41523 0.267521
16 16 1043 1027 256.721 824 0.04898 0.248212
17 16 1147 1131 266.088 416 0.048591 0.232725
18 16 1147 1131 251.305 0 - 0.232725
19 16 1202 1186 249.657 110 0.081777 0.25501
min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 1296 1280 255.97 376 0.053797 0.245711
21 9 1297 1288 245.305 32 0.708133 0.248248
22 9 1297 1288 234.155 0 - 0.248248
23 9 1297 1288 223.975 0 - 0.248248
24 9 1297 1288 214.643 0 - 0.248248
25 9 1297 1288 206.057 0 - 0.248248
26 9 1297 1288 198.131 0 - 0.248248
Total time run: 26.829870
Total writes made: 1297
Write size: 4194304
Bandwidth (MB/sec): 193.367

Average Latency: 0.295922
Max latency: 7.36701
Min latency: 0.033773


Strange. I'm wondering if this has something to do with cache (that is,
operation I could have done before on nodes, as all my nodes are just
freshly rebooted).


Please test setting these values on all OSDs and Clients:
sysctl -w net.ipv4.tcp_rmem=409687380   514873
sysctl -w net.ipv4.tcp_wmem=409616384   514873

Stefan
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yann Dupont

On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:

Am 31.05.2012 15:21, schrieb Yann Dupont:

On 31/05/2012 09:30, Yehuda Sadeh wrote:

On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG
s.pri...@profihost.ag wrote:

But very strangely it's now rbd that isn't stable ?!

root@label5:~# rados -p rbd bench 20 write -t 16
Maintaining 16 concurrent writes of 4194304 bytes for at least 20
seconds.
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 155 139 555.87 556 0.046232 0.109021
2 16 250 234 467.923 380 0.046793 0.0985316
3 16 250 234 311.955 0 - 0.0985316
4 16 250 234 233.965 0 - 0.0985316
5 16 250 234 187.173 0 - 0.0985316
6 16 266 250 166.645 16 0.038083 0.175697
7 16 266 250 142.839 0 - 0.175697
8 16 441 425 212.475 350 0.05512 0.298391
9 16 476 460 204.422 140 0.04372 0.280483
10 16 531 515 205.976 220 0.125076 0.309449
11 16 734 718 261.06 812 0.127582 0.244134
12 16 795 779 259.637 244 0.065158 0.234156
13 16 818 802 246.742 92 0.054514 0.241704
14 16 830 814 232.546 48 0.044386 0.239006
15 16 837 821 218.909 28 3.41523 0.267521
16 16 1043 1027 256.721 824 0.04898 0.248212
17 16 1147 1131 266.088 416 0.048591 0.232725
18 16 1147 1131 251.305 0 - 0.232725
19 16 1202 1186 249.657 110 0.081777 0.25501
min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
20 16 1296 1280 255.97 376 0.053797 0.245711
21 9 1297 1288 245.305 32 0.708133 0.248248
22 9 1297 1288 234.155 0 - 0.248248
23 9 1297 1288 223.975 0 - 0.248248
24 9 1297 1288 214.643 0 - 0.248248
25 9 1297 1288 206.057 0 - 0.248248
26 9 1297 1288 198.131 0 - 0.248248
Total time run: 26.829870
Total writes made: 1297
Write size: 4194304
Bandwidth (MB/sec): 193.367

Average Latency: 0.295922
Max latency: 7.36701
Min latency: 0.033773


Strange. I'm wondering if this has something to do with cache (that is,
operation I could have done before on nodes, as all my nodes are just
freshly rebooted).


Please test setting these values on all OSDs and Clients:
sysctl -w net.ipv4.tcp_rmem=409687380   514873
sysctl -w net.ipv4.tcp_wmem=409616384   514873

Stefan


same. stable for pool data (845 MB/s average), jumping with rbd (229 
average, with a max latency of 6).


I'm with latest linus git kernel
(commit af56e0aa35f3ae2a4c1a6d1000702df1dd78cb76) , and I based on the 
fact that the patch was reversed on it.


I can try with plain 3.4.0 with 'culprit patch' manually reversed.

what puzzles me is that this morning, with 3.4.0 it was rbd that was 
stable, and now I have the exact contrary.


I'll begin to reboot with old 3.4.0 kernel to see if things are 
reproductible.


Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Noah Watkins

On May 31, 2012, at 6:20 AM, Nam Dang wrote:

 It turned out my monitor went down without my knowing.
 So my bad, it wasn't because of Ceph.

I believe the segfault here is from client being null dereferenced in the c 
wrappers. Which patch set are you using?

 
 Best regards,
 
 Nam Dang
 Tokyo Institute of Technology
 Tokyo, Japan
 
 
 On Thu, May 31, 2012 at 10:08 PM, Nam Dang n...@de.cs.titech.ac.jp wrote:
 Dear all,
 
 I am running a small benchmark for Ceph with multithreading and cephfs-java 
 API.
 I encountered this issue even when I use only two threads, and I used
 only open file and creating directory operations.
 
 The piece of code is simply:
 String parent = filePath.substring(0, filePath.lastIndexOf('/'));
 mount.mkdirs(parent, 0755); // create parents if the path does not exist
 int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); //
 open the file
 
 Each thread mounts its own ceph mounting point (using
 mount.mount(null)) and I don't have any interlocking mechanism across
 the threads at all.
 It appears the error is SIGSEGV sent off by libcepfs. The message is as 
 follows:
 
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904
 #
 # JRE version: 6.0_26-b03
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode
 linux-amd64 compressed oops)
 # Problematic frame:
 # C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
 #
 # An error report file with more information is saved as:
 # /home/namd/cephBench/hs_err_pid14063.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://java.sun.com/webapps/bugreport/crash.jsp
 # The crash happened outside the Java Virtual Machine in native code.
 # See problematic frame for where to report the bug.
 
 I have also attached the hs_err_pid14063.log for your reference.
 An excerpt from the file:
 
 Stack: [0x7ff6aa828000,0x7ff6aa929000],
 sp=0x7ff6aa9274f0,  free space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
 
 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
 j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
 j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
 j  
 Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
 j  Benchmark$StatsDaemon.benchmarkOne()V+22
 j  Benchmark$StatsDaemon.run()V+26
 v  ~StubRoutines::call_stub
 
 So I think the probably may be due to the locking mechanism of ceph
 internally. But Dr. Weil previously answered my email stating that the
 mounting is done independently so multithreading should not lead to
 this problem. If there is anyway to work around this, please let me
 know.
 
 Best regards,
 
 Nam Dang
 Email: n...@de.cs.titech.ac.jp
 HP: (+81) 080-4465-1587
 Yokota Lab, Dept. of Computer Science
 Tokyo Institute of Technology
 Tokyo, Japan
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yann Dupont

On 31/05/2012 15:45, Yann Dupont wrote:

On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:



what puzzles me is that this morning, with 3.4.0 it was rbd that was
stable, and now I have the exact contrary.

I'll begin to reboot with old 3.4.0 kernel to see if things are
reproductible.

Cheers,



I'd say my problem is probably not related. Freshly rebooting all osd 
nodes with 3.4.0 kernel (the same kernel I used this morning) now gives 
pool data stable  rbd unstable. As with current git, and the exact 
opposite of results I had tuesday  this morning.


Go figure.

Could it have to do with previous usage in OSD ? or active mds ? or mon ?

As I already said, as my osd are using btrfs with big medata features, 
so going back in 3.0 kernel need a complete reformat of my OSD before.


But I will do it if you judge it can bring some light on this case.

Cheers,
--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Noah Watkins

On May 31, 2012, at 6:20 AM, Nam Dang wrote:

 Stack: [0x7ff6aa828000,0x7ff6aa929000],
 sp=0x7ff6aa9274f0,  free space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
 
 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
 j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
 j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
 j  
 Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
 j  Benchmark$StatsDaemon.benchmarkOne()V+22
 j  Benchmark$StatsDaemon.run()V+26
 v  ~StubRoutines::call_stub

Nevermind to my last comment. Hmm, I've seen this, but very rarely.

- Noah

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


differents ip/network link for osd replication and client-osd ?

2012-05-31 Thread Alexandre DERUMIER
Hi,
Is it possible to use differents ip / network link for 

- replication between osd
- network between client and osd 

?

I would like to use differents swichs/network card for osd replication.

Regards,

Alexandre
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Mark Nelson

On 05/31/2012 09:42 AM, Yann Dupont wrote:

On 31/05/2012 15:45, Yann Dupont wrote:

On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote:



what puzzles me is that this morning, with 3.4.0 it was rbd that was
stable, and now I have the exact contrary.

I'll begin to reboot with old 3.4.0 kernel to see if things are
reproductible.

Cheers,



I'd say my problem is probably not related. Freshly rebooting all osd 
nodes with 3.4.0 kernel (the same kernel I used this morning) now 
gives pool data stable  rbd unstable. As with current git, and the 
exact opposite of results I had tuesday  this morning.


Go figure.

Could it have to do with previous usage in OSD ? or active mds ? or mon ?

As I already said, as my osd are using btrfs with big medata features, 
so going back in 3.0 kernel need a complete reformat of my OSD before.


But I will do it if you judge it can bring some light on this case.

Cheers,

Hi Yann,

Can you take a look at how many PGs are in each pool?

ceph osd pool getpool  pg_num


Thanks,
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yann Dupont

On 31/05/2012 17:32, Mark Nelson wrote:

ceph osd pool getpool  pg_num


My setup is detailed in a previous mail , But as I changed some 
parameters this morning, here we go :


root@chichibu:~# ceph osd pool get data pg_num
PG_NUM: 576
root@chichibu:~# ceph osd pool get rbd pg_num
PG_NUM: 576



The pg num is quite low because I started with small OSD (9 osd with 
200G each - internal disks) when I formatted. Now, I reduced to 8 osd, 
(osd.4 is out) but with much larger ( faster) storage.



Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the 
OSD similars. Replication is set to 2.



The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted 
via space_cache,compress=lzo,nobarrier,noatime.


journal is on tmpfs :
 osd journal = /dev/shm/journal
 osd journal size = 6144

I know this is dangerous, remember It's NOT a production system for the 
moment.


No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are 
in 1 place, 2 in another, and the 4 last in the principal place.


There is 10G between all the nodes and they are in the same VLAN, no 
router involved (but there is (negligible ?) latency between nodes)


I try to group host together to avoid problem when I loose a location 
(electrical problem, for example). Not sure I really customized the 
crush map as I should have.


here is the map :
 begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
id -5# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.2 weight 1.000
}
host hazelburn {
id -6# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.3 weight 1.000
}
rack loire {
id -3# do not change unnecessarily
# weight 2.000
alg straw
hash 0# rjenkins1
item karuizawa weight 1.000
item hazelburn weight 1.000
}
host carsebridge {
id -8# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.5 weight 1.000
}
host cameronbridge {
id -9# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.6 weight 1.000
}
rack chantrerie {
id -7# do not change unnecessarily
# weight 2.000
alg straw
hash 0# rjenkins1
item carsebridge weight 1.000
item cameronbridge weight 1.000
}
host chichibu {
id -2# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.0 weight 1.000
}
host glenesk {
id -4# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.1 weight 1.000
}
host braeval {
id -10# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.7 weight 1.000
}
host hanyu {
id -11# do not change unnecessarily
# weight 1.000
alg straw
hash 0# rjenkins1
item osd.8 weight 1.000
}
rack lombarderie {
id -12# do not change unnecessarily
# weight 4.000
alg straw
hash 0# rjenkins1
item chichibu weight 1.000
item glenesk weight 1.000
item braeval weight 1.000
item hanyu weight 1.000
}
pool default {
id -1# do not change unnecessarily
# weight 8.000
alg straw
hash 0# rjenkins1
item loire weight 2.000
item chantrerie weight 2.000
item lombarderie weight 4.000
}

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Hope it helps,
cheers


--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Nam Dang
Hi Noah,

By the way, the test suite of cephfs-java has a bug. You should put
the permission value in the form of 0777 instead of 777 since the
number has to be octal. With 777 I got directories with weird
permission settings.

Best regards
Nam Dang
Tokyo Institute of Technology
Tokyo, Japan


On Thu, May 31, 2012 at 11:43 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 On May 31, 2012, at 6:20 AM, Nam Dang wrote:

 Stack: [0x7ff6aa828000,0x7ff6aa929000],
 sp=0x7ff6aa9274f0,  free space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9

 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
 j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
 j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
 j  
 Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
 j  Benchmark$StatsDaemon.benchmarkOne()V+22
 j  Benchmark$StatsDaemon.run()V+26
 v  ~StubRoutines::call_stub

 Nevermind to my last comment. Hmm, I've seen this, but very rarely.

 - Noah

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Noah Watkins

On May 31, 2012, at 8:48 AM, Nam Dang wrote:

 Hi Noah,
 
 By the way, the test suite of cephfs-java has a bug. You should put
 the permission value in the form of 0777 instead of 777 since the
 number has to be octal. With 777 I got directories with weird
 permission settings.

Thanks Nam, I'll fix this up.

 
 Best regards
 Nam Dang
 Tokyo Institute of Technology
 Tokyo, Japan
 
 
 On Thu, May 31, 2012 at 11:43 PM, Noah Watkins jayh...@cs.ucsc.edu wrote:
 
 On May 31, 2012, at 6:20 AM, Nam Dang wrote:
 
 Stack: [0x7ff6aa828000,0x7ff6aa929000],
 sp=0x7ff6aa9274f0,  free space=1021k
 Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
 code)
 C  [libcephfs.so.1+0x139d39]  Mutex::Lock(bool)+0x9
 
 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
 j  com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
 j  com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
 j  
 Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
 j  Benchmark$StatsDaemon.benchmarkOne()V+22
 j  Benchmark$StatsDaemon.run()V+26
 v  ~StubRoutines::call_stub
 
 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
 
 - Noah
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Mark Nelson

On 05/31/2012 10:43 AM, Yann Dupont wrote:

On 31/05/2012 17:32, Mark Nelson wrote:

ceph osd pool getpool pg_num


My setup is detailed in a previous mail , But as I changed some
parameters this morning, here we go :

root@chichibu:~# ceph osd pool get data pg_num
PG_NUM: 576
root@chichibu:~# ceph osd pool get rbd pg_num
PG_NUM: 576



The pg num is quite low because I started with small OSD (9 osd with
200G each - internal disks) when I formatted. Now, I reduced to 8 osd,
(osd.4 is out) but with much larger ( faster) storage.


Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the
OSD similars. Replication is set to 2.


The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted
via space_cache,compress=lzo,nobarrier,noatime.

journal is on tmpfs :
osd journal = /dev/shm/journal
osd journal size = 6144

I know this is dangerous, remember It's NOT a production system for the
moment.

No OSD is full, I don't have much data stored for the moment.

Concerning crush map, I'm not using the default one :

The 8 nodes are in 3 different locations (some kilometers away). 2 are
in 1 place, 2 in another, and the 4 last in the principal place.

There is 10G between all the nodes and they are in the same VLAN, no
router involved (but there is (negligible ?) latency between nodes)

I try to group host together to avoid problem when I loose a location
(electrical problem, for example). Not sure I really customized the
crush map as I should have.

here is the map :
begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 device4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8

# types
type 0 osd
type 1 host
type 2 rack
type 3 pool

# buckets
host karuizawa {
id -5 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.2 weight 1.000
}
host hazelburn {
id -6 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.3 weight 1.000
}
rack loire {
id -3 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item karuizawa weight 1.000
item hazelburn weight 1.000
}
host carsebridge {
id -8 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.5 weight 1.000
}
host cameronbridge {
id -9 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.6 weight 1.000
}
rack chantrerie {
id -7 # do not change unnecessarily
# weight 2.000
alg straw
hash 0 # rjenkins1
item carsebridge weight 1.000
item cameronbridge weight 1.000
}
host chichibu {
id -2 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
}
host glenesk {
id -4 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.1 weight 1.000
}
host braeval {
id -10 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.7 weight 1.000
}
host hanyu {
id -11 # do not change unnecessarily
# weight 1.000
alg straw
hash 0 # rjenkins1
item osd.8 weight 1.000
}
rack lombarderie {
id -12 # do not change unnecessarily
# weight 4.000
alg straw
hash 0 # rjenkins1
item chichibu weight 1.000
item glenesk weight 1.000
item braeval weight 1.000
item hanyu weight 1.000
}
pool default {
id -1 # do not change unnecessarily
# weight 8.000
alg straw
hash 0 # rjenkins1
item loire weight 2.000
item chantrerie weight 2.000
item lombarderie weight 4.000
}

# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Hope it helps,
cheers




Hi Yann,

You might want to start out by running sar/iostat/collectl on the OSD 
nodes and seeing if anything looks funny during the slow test compared 
to the fast one.  If that doesn't reveal much, you could run blktrace on 
one of the OSDs during the tests and see if the IO to the disk looks 
different.  I can help out if you want to send me your blktrace results. 
 Similarly you could watch the network streams for both tests and see 
if anything looks different there.


Thanks!
Mark
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 01/13] libceph: eliminate connection state DEAD

2012-05-31 Thread Yehuda Sadeh
Reviewed-by: Yehuda Sadeh yeh...@inktank.com

On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote:
 The ceph connection state DEAD is never set and is therefore not
 needed.  Eliminate it.

 Signed-off-by: Alex Elder el...@inktank.com
 ---
  include/linux/ceph/messenger.h |    1 -
  net/ceph/messenger.c           |    6 --
  2 files changed, 0 insertions(+), 7 deletions(-)

 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 2521a95..aa506ca 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -119,7 +119,6 @@ struct ceph_msg_pos {
  #define CLOSED         10 /* we've closed the connection */
  #define SOCK_CLOSED    11 /* socket state changed to closed */
  #define OPENING         13 /* open connection w/ (possibly new) peer */
 -#define DEAD            14 /* dead, about to kfree */
  #define BACKOFF         15

  /*
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 1a80907..42ca8aa 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -2087,12 +2087,6 @@ bad_tag:
  */
  static void queue_con(struct ceph_connection *con)
  {
 -       if (test_bit(DEAD, con-state)) {
 -               dout(queue_con %p ignoring: DEAD\n,
 -                    con);
 -               return;
 -       }
 -
        if (!con-ops-get(con)) {
                dout(queue_con %p ref count 0\n, con);
                return;
 --
 1.7.5.4


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: differents ip/network link for osd replication and client-osd ?

2012-05-31 Thread Sage Weil
On Thu, 31 May 2012, Alexandre DERUMIER wrote:
 Hi,
 Is it possible to use differents ip / network link for 
 
 - replication between osd
 - network between client and osd 
 
 ?
 
 I would like to use differents swichs/network card for osd replication.

Yep:

[osd]
public network = 1.2.3.4/24
cluster network = 192.168.0.0/16

will make ceph-osd choose IPs in those subnets.  You can also specify 
'public addr' or 'cluster addr' to a specific IP, although that's more 
tedious to configure.

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Sage Weil
On Thu, 31 May 2012, Yann Dupont wrote:
 On 31/05/2012 17:32, Mark Nelson wrote:
  ceph osd pool getpool  pg_num
 
 My setup is detailed in a previous mail , But as I changed some parameters
 this morning, here we go :
 
 root@chichibu:~# ceph osd pool get data pg_num
 PG_NUM: 576
 root@chichibu:~# ceph osd pool get rbd pg_num
 PG_NUM: 576

Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules 
the pools are mapped to?

Thanks!
sage


 
 
 
 The pg num is quite low because I started with small OSD (9 osd with 200G each
 - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out)
 but with much larger ( faster) storage.
 
 
 Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the OSD
 similars. Replication is set to 2.
 
 
 The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted via
 space_cache,compress=lzo,nobarrier,noatime.
 
 journal is on tmpfs :
  osd journal = /dev/shm/journal
  osd journal size = 6144
 
 I know this is dangerous, remember It's NOT a production system for the
 moment.
 
 No OSD is full, I don't have much data stored for the moment.
 
 Concerning crush map, I'm not using the default one :
 
 The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1
 place, 2 in another, and the 4 last in the principal place.
 
 There is 10G between all the nodes and they are in the same VLAN, no router
 involved (but there is (negligible ?) latency between nodes)
 
 I try to group host together to avoid problem when I loose a location
 (electrical problem, for example). Not sure I really customized the crush map
 as I should have.
 
 here is the map :
  begin crush map
 
 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 device4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 
 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 pool
 
 # buckets
 host karuizawa {
 id -5# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.2 weight 1.000
 }
 host hazelburn {
 id -6# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.3 weight 1.000
 }
 rack loire {
 id -3# do not change unnecessarily
 # weight 2.000
 alg straw
 hash 0# rjenkins1
 item karuizawa weight 1.000
 item hazelburn weight 1.000
 }
 host carsebridge {
 id -8# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.5 weight 1.000
 }
 host cameronbridge {
 id -9# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.6 weight 1.000
 }
 rack chantrerie {
 id -7# do not change unnecessarily
 # weight 2.000
 alg straw
 hash 0# rjenkins1
 item carsebridge weight 1.000
 item cameronbridge weight 1.000
 }
 host chichibu {
 id -2# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.0 weight 1.000
 }
 host glenesk {
 id -4# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.1 weight 1.000
 }
 host braeval {
 id -10# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.7 weight 1.000
 }
 host hanyu {
 id -11# do not change unnecessarily
 # weight 1.000
 alg straw
 hash 0# rjenkins1
 item osd.8 weight 1.000
 }
 rack lombarderie {
 id -12# do not change unnecessarily
 # weight 4.000
 alg straw
 hash 0# rjenkins1
 item chichibu weight 1.000
 item glenesk weight 1.000
 item braeval weight 1.000
 item hanyu weight 1.000
 }
 pool default {
 id -1# do not change unnecessarily
 # weight 8.000
 alg straw
 hash 0# rjenkins1
 item loire weight 2.000
 item chantrerie weight 2.000
 item lombarderie weight 4.000
 }
 
 # rules
 rule data {
 ruleset 0
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }
 rule metadata {
 ruleset 1
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }
 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }
 
 # end crush map
 
 Hope it helps,
 cheers
 
 
 -- 
 Yann Dupont - Service IRTS, DSI Université de Nantes
 Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr
 
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 

Re: [PATCH 02/13] libceph: kill bad_proto ceph connection op

2012-05-31 Thread Yehuda Sadeh
Reviewed-by: Yehuda Sadeh yeh...@inktank.com

On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote:
 No code sets a bad_proto method in its ceph connection operations
 vector, so just get rid of it.

 Signed-off-by: Alex Elder el...@inktank.com
 ---
  include/linux/ceph/messenger.h |    3 ---
  net/ceph/messenger.c           |    5 -
  2 files changed, 0 insertions(+), 8 deletions(-)

 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index aa506ca..74f6c9b 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -31,9 +31,6 @@ struct ceph_connection_operations {
        int (*verify_authorizer_reply) (struct ceph_connection *con, int
 len);
        int (*invalidate_authorizer)(struct ceph_connection *con);

 -       /* protocol version mismatch */
 -       void (*bad_proto) (struct ceph_connection *con);
 -
        /* there was some error on the socket (disconnect, whatever) */
        void (*fault) (struct ceph_connection *con);

 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 42ca8aa..07af994 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -1356,11 +1356,6 @@ static void fail_protocol(struct ceph_connection
 *con)
  {
        reset_connection(con);
        set_bit(CLOSED, con-state);  /* in case there's queued work */
 -
 -       mutex_unlock(con-mutex);
 -       if (con-ops-bad_proto)
 -               con-ops-bad_proto(con);
 -       mutex_lock(con-mutex);
  }

  static int process_connect(struct ceph_connection *con)
 --
 1.7.5.4

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/13] libceph: rename kvec_reset and kvec_add functions

2012-05-31 Thread Yehuda Sadeh
Reviewed-by: Yehuda Sadeh yeh...@inktank.com

On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote:
 The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
 are entirely private functions, so drop the ceph_ prefix in their
 name to make them slightly more wieldy.

 Signed-off-by: Alex Elder el...@inktank.com
 ---
  net/ceph/messenger.c |   48
 
  1 files changed, 24 insertions(+), 24 deletions(-)

 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 5ad1f0a..2e9054f 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -484,14 +484,14 @@ static u32 get_global_seq(struct ceph_messenger *msgr,
 u32 gt)
        return ret;
  }

 -static void ceph_con_out_kvec_reset(struct ceph_connection *con)
 +static void con_out_kvec_reset(struct ceph_connection *con)
  {
        con-out_kvec_left = 0;
        con-out_kvec_bytes = 0;
        con-out_kvec_cur = con-out_kvec[0];
  }

 -static void ceph_con_out_kvec_add(struct ceph_connection *con,
 +static void con_out_kvec_add(struct ceph_connection *con,
                                size_t size, void *data)
  {
        int index;
 @@ -532,7 +532,7 @@ static void prepare_write_message(struct ceph_connection
 *con)
        struct ceph_msg *m;
        u32 crc;

 -       ceph_con_out_kvec_reset(con);
 +       con_out_kvec_reset(con);
        con-out_kvec_is_msg = true;
        con-out_msg_done = false;

 @@ -540,9 +540,9 @@ static void prepare_write_message(struct ceph_connection
 *con)
         * TCP packet that's a good thing. */
        if (con-in_seq  con-in_seq_acked) {
                con-in_seq_acked = con-in_seq;
 -               ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
 +               con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
                con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
 -               ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
 +               con_out_kvec_add(con, sizeof (con-out_temp_ack),
                        con-out_temp_ack);
        }

 @@ -570,12 +570,12 @@ static void prepare_write_message(struct
 ceph_connection *con)
        BUG_ON(le32_to_cpu(m-hdr.front_len) != m-front.iov_len);

        /* tag + hdr + front + middle */
 -       ceph_con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
 -       ceph_con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
 -       ceph_con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);
 +       con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
 +       con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
 +       con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);

        if (m-middle)
 -               ceph_con_out_kvec_add(con, m-middle-vec.iov_len,
 +               con_out_kvec_add(con, m-middle-vec.iov_len,
                        m-middle-vec.iov_base);

        /* fill in crc (except data pages), footer */
 @@ -624,12 +624,12 @@ static void prepare_write_ack(struct ceph_connection
 *con)
             con-in_seq_acked, con-in_seq);
        con-in_seq_acked = con-in_seq;

 -       ceph_con_out_kvec_reset(con);
 +       con_out_kvec_reset(con);

 -       ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
 +       con_out_kvec_add(con, sizeof (tag_ack), tag_ack);

        con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
 -       ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
 +       con_out_kvec_add(con, sizeof (con-out_temp_ack),
                                con-out_temp_ack);

        con-out_more = 1;  /* more will follow.. eventually.. */
 @@ -642,8 +642,8 @@ static void prepare_write_ack(struct ceph_connection
 *con)
  static void prepare_write_keepalive(struct ceph_connection *con)
  {
        dout(prepare_write_keepalive %p\n, con);
 -       ceph_con_out_kvec_reset(con);
 -       ceph_con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
 +       con_out_kvec_reset(con);
 +       con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
        set_bit(WRITE_PENDING, con-state);
  }

 @@ -688,8 +688,8 @@ static struct ceph_auth_handshake
 *get_connect_authorizer(struct ceph_connection
  */
  static void prepare_write_banner(struct ceph_connection *con)
  {
 -       ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
 -       ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
 +       con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
 +       con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
                                        con-msgr-my_enc_addr);

        con-out_more = 0;
 @@ -736,10 +736,10 @@ static int prepare_write_connect(struct
 ceph_connection *con)
        con-out_connect.authorizer_len = auth ?
                cpu_to_le32(auth-authorizer_buf_len) : 0;

 -       ceph_con_out_kvec_add(con, sizeof (con-out_connect),
 +       con_out_kvec_add(con, sizeof (con-out_connect),
                                        con-out_connect);
        if (auth  auth-authorizer_buf_len)
 -               

Re: poor OSD performance using kernel 3.4 = problem found

2012-05-31 Thread Yann Dupont

Le 31/05/2012 18:29, Sage Weil a écrit :


Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules
the pools are mapped to?



yes :

root@label5:~# ceph osd dump | grep ^pool
pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 576 
pgp_num 576 last_change 816 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 
576 pgp_num 576 last_change 1 owner 0
pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 576 
pgp_num 576 last_change 1 owner 0


cheers,


--
Yann Dupont - Service IRTS, DSI Université de Nantes
Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr


--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd rm image slow with big images ?

2012-05-31 Thread Wido den Hollander

Hi,

On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote:

Hi,

I trying to delete some rbd images with rbd rm,
and it seem to be slow with big images.



I'm testing it with just create a new image (1TB):

# time rbd -p pool1 create --size 100 image2

real0m0.031s
user0m0.015s
sys 0m0.010s


then just delete it, without having writed nothing in image


# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real1m45.558s
user0m14.683s
sys 0m17.363s



same test with 100GB

# time rbd -p pool1 create --size 10 image2

real0m0.032s
user0m0.016s
sys 0m0.007s

# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real0m10.499s
user0m1.488s
sys 0m1.720s


I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs)
network bandwith,diskio,cpu are low.

Is it the normal behaviour ? Maybe some xfs tuning could help ?


It's in the nature of RBD.

A RBD image consists of multiple 4MB (default) RADOS objects.

Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS 
objects, you can find those objects by doing: rados -p rbd ls


Now, when you create a new image only the header is writting, but no 
object is written.


When you start writing to a RBD image you will be writing to one of the 
4MB objects. When it doesn't exist it will be created.


So when you install your VM it will create objects, but not all of them.

RBD knows which RADOS objects to access by three parameters:

* Image name
* Image size
* Stripe size (4MB)

So when your VM access for byte Y until Z on the disk, RBD knows which 
object to access by calculating this.


Now, when you start removing the image there is no way of knowing which 
object exists and which doesn't, so RBD will try to remove all objects.


In the case of a fresh image this results in 10.000 RADOS remove 
operations for non-existent objects and that is slow.


Wido



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd rm image slow with big images ?

2012-05-31 Thread Stefan Priebe

One note:
he has written:
then just delete it, without having writed nothing in image 


Am 31.05.2012 20:15, schrieb Wido den Hollander:

Hi,

On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote:

Hi,

I trying to delete some rbd images with rbd rm,
and it seem to be slow with big images.



I'm testing it with just create a new image (1TB):

# time rbd -p pool1 create --size 100 image2

real 0m0.031s
user 0m0.015s
sys 0m0.010s


then just delete it, without having writed nothing in image


# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real 1m45.558s
user 0m14.683s
sys 0m17.363s



same test with 100GB

# time rbd -p pool1 create --size 10 image2

real 0m0.032s
user 0m0.016s
sys 0m0.007s

# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real 0m10.499s
user 0m1.488s
sys 0m1.720s


I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs)
network bandwith,diskio,cpu are low.

Is it the normal behaviour ? Maybe some xfs tuning could help ?


It's in the nature of RBD.

A RBD image consists of multiple 4MB (default) RADOS objects.

Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS
objects, you can find those objects by doing: rados -p rbd ls

Now, when you create a new image only the header is writting, but no
object is written.

When you start writing to a RBD image you will be writing to one of the
4MB objects. When it doesn't exist it will be created.

So when you install your VM it will create objects, but not all of them.

RBD knows which RADOS objects to access by three parameters:

* Image name
* Image size
* Stripe size (4MB)

So when your VM access for byte Y until Z on the disk, RBD knows which
object to access by calculating this.

Now, when you start removing the image there is no way of knowing which
object exists and which doesn't, so RBD will try to remove all objects.

In the case of a fresh image this results in 10.000 RADOS remove
operations for non-existent objects and that is slow.

Wido



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd rm image slow with big images ?

2012-05-31 Thread Sage Weil
On Thu, 31 May 2012, Wido den Hollander wrote:
 Hi,
  Is it the normal behaviour ? Maybe some xfs tuning could help ?
 
 It's in the nature of RBD.

Yes.

That said, the current implementation is also stupid: it's doing a single 
io at a time.  #2256 (next sprint) will parallelize this to make it go 
much faster (probably an order of magnitude?).

sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd rm image slow with big images ?

2012-05-31 Thread Wido den Hollander

On 05/31/2012 08:16 PM, Stefan Priebe wrote:

One note:
he has written:
then just delete it, without having writed nothing in image 


That is true, but RBD doesn't know that.

There is no record of which object got created and which didn't, so the 
removal process has to issue a removal for each RBD object that might exist.


That is the nature of RBD. It makes it simple and reliable.

Wido




Am 31.05.2012 20:15, schrieb Wido den Hollander:

Hi,

On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote:

Hi,

I trying to delete some rbd images with rbd rm,
and it seem to be slow with big images.



I'm testing it with just create a new image (1TB):

# time rbd -p pool1 create --size 100 image2

real 0m0.031s
user 0m0.015s
sys 0m0.010s


then just delete it, without having writed nothing in image


# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real 1m45.558s
user 0m14.683s
sys 0m17.363s



same test with 100GB

# time rbd -p pool1 create --size 10 image2

real 0m0.032s
user 0m0.016s
sys 0m0.007s

# time rbd -p pool1 rm image2
Removing image: 100% complete...done.

real 0m10.499s
user 0m1.488s
sys 0m1.720s


I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs)
network bandwith,diskio,cpu are low.

Is it the normal behaviour ? Maybe some xfs tuning could help ?


It's in the nature of RBD.

A RBD image consists of multiple 4MB (default) RADOS objects.

Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS
objects, you can find those objects by doing: rados -p rbd ls

Now, when you create a new image only the header is writting, but no
object is written.

When you start writing to a RBD image you will be writing to one of the
4MB objects. When it doesn't exist it will be created.

So when you install your VM it will create objects, but not all of them.

RBD knows which RADOS objects to access by three parameters:

* Image name
* Image size
* Stripe size (4MB)

So when your VM access for byte Y until Z on the disk, RBD knows which
object to access by calculating this.

Now, when you start removing the image there is no way of knowing which
object exists and which doesn't, so RBD will try to remove all objects.

In the case of a fresh image this results in 10.000 RADOS remove
operations for non-existent objects and that is slow.

Wido



--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Greg Farnum
On Thursday, May 31, 2012 at 7:43 AM, Noah Watkins wrote:
 
 On May 31, 2012, at 6:20 AM, Nam Dang wrote:
 
   Stack: [0x7ff6aa828000,0x7ff6aa929000],
   sp=0x7ff6aa9274f0, free space=1021k
   Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native 
   code)
   C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9
   
   Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
   j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0
   j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6
   j 
   Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37
   j Benchmark$StatsDaemon.benchmarkOne()V+22
   j Benchmark$StatsDaemon.run()V+26
   v ~StubRoutines::call_stub
  
 
 
 
 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
Noah, do you have any leads on this? Do you think it's a bug in your Java code 
or in the C/++ libraries?
Nam: it definitely shouldn't be segfaulting just because a monitor went down. :)
-Greg

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Noah Watkins

On May 31, 2012, at 3:39 PM, Greg Farnum wrote:
 
 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
 Noah, do you have any leads on this? Do you think it's a bug in your Java 
 code or in the C/++ libraries?

I _think_ this is because the JVM uses its own threading library, and Ceph 
assumes pthreads and pthread compatible mutexes--is that assumption about Ceph 
correct? Hence the error that looks like Mutex::lock(bool) being reference for 
context during the segfault. To verify this all that is needed is some 
synchronization added to the Java.

There are only two segfaults that I've ever encountered, one in which the C 
wrappers are used with an unmounted client, and the error Nam is seeing 
(although they could be related). I will re-submit an updated patch for the 
former, which should rule that out as the culprit.

Nam: where are you grabbing the Java patches from? I'll push some updates.


The only other scenario that comes to mind is related to signaling:

The RADOS Java wrappers suffered from an interaction between the JVM and RADOS 
client signal handlers, in which either the JVM or RADOS would replace the 
handlers for the other (not sure which order). Anyway, the solution was to link 
in the JVM libjsig.so signal chaining library. This might be the same thing we 
are seeing here, but I'm betting it is the first theory I mentioned.

- Noah--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


iozone test crashed on ceph

2012-05-31 Thread udit agarwal
Hi,
 I have set up ceph system with a client, mon and mds on one system which is
connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But when
I ran iozone test with a 5G file, the process got killed and our ceph system
hanged. Can anyone please help me with this.

 Thanks in advance.

--Udit

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RBD operations, pinging client that serves lingering tid

2012-05-31 Thread Sam Just
Those messages are harmless.  It's just debug output indicating that
the objecter is maintaining a watch on an rbd image header.  I'll tone
down the debug verbosity tomorrow.
-Sam

On Wed, May 30, 2012 at 6:54 AM, Guido Winkelmann
guido-c...@thisisnotatest.de wrote:
 Hi,

 Whenever I'm doing any operations on rbd volumes (like import, copy) using the
 rbd command line client, I'm getting these messages every couple of seconds:

 2012-05-30 15:53:08.010326 7f027aa47700  0 client.4159.objecter  pinging osd
 that serves lingering tid 1 (osd.2)
 2012-05-30 15:53:08.010344 7f027aa47700  0 client.4159.objecter  pinging osd
 that serves lingering tid 2 (osd.0)

 What does this mean? Is that anything to worry about?

 Yesterday, these messages were only mentioning osd.2, not osd.0...

        Guido
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: iozone test crashed on ceph

2012-05-31 Thread Sam Just
Hi,

Thanks for letting us know.  What version are you running?  Can you
post your ceph.conf to give us an idea of how your cluster is
configured?  Also, did any of the daemons crash?  If it's
reproducible, it would help to turn up osd and mds debugging to 20 and
post the logs.

Thanks
-Sam

On Thu, May 31, 2012 at 5:58 PM, udit agarwal fzdu...@gmail.com wrote:
 Hi,
  I have set up ceph system with a client, mon and mds on one system which is
 connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But 
 when
 I ran iozone test with a 5G file, the process got killed and our ceph system
 hanged. Can anyone please help me with this.

  Thanks in advance.

 --Udit

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 04/13] libceph: rename socket callbacks

2012-05-31 Thread Sage Weil
On Wed, 30 May 2012, Alex Elder wrote:
 Change the names of the three socket callback functions to make it
 more obvious they're specifically associated with a connection's
 socket (not the ceph connection that uses it).
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  net/ceph/messenger.c |   28 ++--
  1 files changed, 14 insertions(+), 14 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index fe3c2a1..5ad1f0a 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -153,46 +153,46 @@ EXPORT_SYMBOL(ceph_msgr_flush);
   */
 
  /* data available on socket, or listen socket received a connect */
 -static void ceph_data_ready(struct sock *sk, int count_unused)
 +static void ceph_sock_data_ready(struct sock *sk, int count_unused)
  {
   struct ceph_connection *con = sk-sk_user_data;
 
   if (sk-sk_state != TCP_CLOSE_WAIT) {
 - dout(ceph_data_ready on %p state = %lu, queueing work\n,
 + dout(%s on %p state = %lu, queueing work\n, __func__,

I think it's marginally better to do

dout(__func__  on %p state = %lu, queueing work\n,

so that the concatenation happens at compile-time instead of runtime.

Otherwise, looks good!

Reviewed-by: Sage Weil s...@inktank.com

con, con-state);
   queue_con(con);
   }
  }
 
  /* socket has buffer space for writing */
 -static void ceph_write_space(struct sock *sk)
 +static void ceph_sock_write_space(struct sock *sk)
  {
   struct ceph_connection *con = sk-sk_user_data;
 
   /* only queue to workqueue if there is data we want to write,
* and there is sufficient space in the socket buffer to accept
 -  * more data.  clear SOCK_NOSPACE so that ceph_write_space()
 +  * more data.  clear SOCK_NOSPACE so that ceph_sock_write_space()
* doesn't get called again until try_write() fills the socket
* buffer. See net/ipv4/tcp_input.c:tcp_check_space()
* and net/core/stream.c:sk_stream_write_space().
*/
   if (test_bit(WRITE_PENDING, con-state)) {
   if (sk_stream_wspace(sk) = sk_stream_min_wspace(sk)) {
 - dout(ceph_write_space %p queueing write work\n,
 con);
 + dout(%s %p queueing write work\n, __func__, con);
   clear_bit(SOCK_NOSPACE, sk-sk_socket-flags);
   queue_con(con);
   }
   } else {
 - dout(ceph_write_space %p nothing to write\n, con);
 + dout(%s %p nothing to write\n, __func__, con);
   }
  }
 
  /* socket's state has changed */
 -static void ceph_state_change(struct sock *sk)
 +static void ceph_sock_state_change(struct sock *sk)
  {
   struct ceph_connection *con = sk-sk_user_data;
 
 - dout(ceph_state_change %p state = %lu sk_state = %u\n,
 + dout(%s %p state = %lu sk_state = %u\n, __func__,
con, con-state, sk-sk_state);
 
   if (test_bit(CLOSED, con-state))
 @@ -200,9 +200,9 @@ static void ceph_state_change(struct sock *sk)
 
   switch (sk-sk_state) {
   case TCP_CLOSE:
 - dout(ceph_state_change TCP_CLOSE\n);
 + dout(%s TCP_CLOSE\n, __func__);
   case TCP_CLOSE_WAIT:
 - dout(ceph_state_change TCP_CLOSE_WAIT\n);
 + dout(%s TCP_CLOSE_WAIT\n, __func__);
   if (test_and_set_bit(SOCK_CLOSED, con-state) == 0) {
   if (test_bit(CONNECTING, con-state))
   con-error_msg = connection failed;
 @@ -212,7 +212,7 @@ static void ceph_state_change(struct sock *sk)
   }
   break;
   case TCP_ESTABLISHED:
 - dout(ceph_state_change TCP_ESTABLISHED\n);
 + dout(%s TCP_ESTABLISHED\n, __func__);
   queue_con(con);
   break;
   default:/* Everything else is uninteresting */
 @@ -228,9 +228,9 @@ static void set_sock_callbacks(struct socket *sock,
  {
   struct sock *sk = sock-sk;
   sk-sk_user_data = con;
 - sk-sk_data_ready = ceph_data_ready;
 - sk-sk_write_space = ceph_write_space;
 - sk-sk_state_change = ceph_state_change;
 + sk-sk_data_ready = ceph_sock_data_ready;
 + sk-sk_write_space = ceph_sock_write_space;
 + sk-sk_state_change = ceph_sock_state_change;
  }
 
 
 -- 
 1.7.5.4
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 05/13] libceph: rename kvec_reset and kvec_add functions

2012-05-31 Thread Sage Weil
Yep

On Wed, 30 May 2012, Alex Elder wrote:

 The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add()
 are entirely private functions, so drop the ceph_ prefix in their
 name to make them slightly more wieldy.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  net/ceph/messenger.c |   48 
  1 files changed, 24 insertions(+), 24 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 5ad1f0a..2e9054f 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -484,14 +484,14 @@ static u32 get_global_seq(struct ceph_messenger *msgr,
 u32 gt)
   return ret;
  }
 
 -static void ceph_con_out_kvec_reset(struct ceph_connection *con)
 +static void con_out_kvec_reset(struct ceph_connection *con)
  {
   con-out_kvec_left = 0;
   con-out_kvec_bytes = 0;
   con-out_kvec_cur = con-out_kvec[0];
  }
 
 -static void ceph_con_out_kvec_add(struct ceph_connection *con,
 +static void con_out_kvec_add(struct ceph_connection *con,
   size_t size, void *data)
  {
   int index;
 @@ -532,7 +532,7 @@ static void prepare_write_message(struct ceph_connection
 *con)
   struct ceph_msg *m;
   u32 crc;
 
 - ceph_con_out_kvec_reset(con);
 + con_out_kvec_reset(con);
   con-out_kvec_is_msg = true;
   con-out_msg_done = false;
 
 @@ -540,9 +540,9 @@ static void prepare_write_message(struct ceph_connection
 *con)
* TCP packet that's a good thing. */
   if (con-in_seq  con-in_seq_acked) {
   con-in_seq_acked = con-in_seq;
 - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
 + con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
   con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
 - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
 + con_out_kvec_add(con, sizeof (con-out_temp_ack),
   con-out_temp_ack);
   }
 
 @@ -570,12 +570,12 @@ static void prepare_write_message(struct ceph_connection
 *con)
   BUG_ON(le32_to_cpu(m-hdr.front_len) != m-front.iov_len);
 
   /* tag + hdr + front + middle */
 - ceph_con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
 - ceph_con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
 - ceph_con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);
 + con_out_kvec_add(con, sizeof (tag_msg), tag_msg);
 + con_out_kvec_add(con, sizeof (m-hdr), m-hdr);
 + con_out_kvec_add(con, m-front.iov_len, m-front.iov_base);
 
   if (m-middle)
 - ceph_con_out_kvec_add(con, m-middle-vec.iov_len,
 + con_out_kvec_add(con, m-middle-vec.iov_len,
   m-middle-vec.iov_base);
 
   /* fill in crc (except data pages), footer */
 @@ -624,12 +624,12 @@ static void prepare_write_ack(struct ceph_connection
 *con)
con-in_seq_acked, con-in_seq);
   con-in_seq_acked = con-in_seq;
 
 - ceph_con_out_kvec_reset(con);
 + con_out_kvec_reset(con);
 
 - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
 + con_out_kvec_add(con, sizeof (tag_ack), tag_ack);
 
   con-out_temp_ack = cpu_to_le64(con-in_seq_acked);
 - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack),
 + con_out_kvec_add(con, sizeof (con-out_temp_ack),
   con-out_temp_ack);
 
   con-out_more = 1;  /* more will follow.. eventually.. */
 @@ -642,8 +642,8 @@ static void prepare_write_ack(struct ceph_connection *con)
  static void prepare_write_keepalive(struct ceph_connection *con)
  {
   dout(prepare_write_keepalive %p\n, con);
 - ceph_con_out_kvec_reset(con);
 - ceph_con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
 + con_out_kvec_reset(con);
 + con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
   set_bit(WRITE_PENDING, con-state);
  }
 
 @@ -688,8 +688,8 @@ static struct ceph_auth_handshake
 *get_connect_authorizer(struct ceph_connection
   */
  static void prepare_write_banner(struct ceph_connection *con)
  {
 - ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
 - ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
 + con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER);
 + con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr),
   con-msgr-my_enc_addr);
 
   con-out_more = 0;
 @@ -736,10 +736,10 @@ static int prepare_write_connect(struct ceph_connection
 *con)
   con-out_connect.authorizer_len = auth ?
   cpu_to_le32(auth-authorizer_buf_len) : 0;
 
 - ceph_con_out_kvec_add(con, sizeof (con-out_connect),
 + con_out_kvec_add(con, sizeof (con-out_connect),
   con-out_connect);
   if (auth  auth-authorizer_buf_len)
 - ceph_con_out_kvec_add(con, auth-authorizer_buf_len,
 + con_out_kvec_add(con, auth-authorizer_buf_len,
   

Re: [PATCH 06/13] libceph: embed ceph messenger structure in ceph_client

2012-05-31 Thread Sage Weil
Reviewed-by: Sage Weil s...@inktank.com

On Wed, 30 May 2012, Alex Elder wrote:

 A ceph client has a pointer to a ceph messenger structure in it.
 There is always exactly one ceph messenger for a ceph client, so
 there is no need to allocate it separate from the ceph client
 structure.
 
 Switch the ceph_client structure to embed its ceph_messenger
 structure.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  fs/ceph/mds_client.c   |2 +-
  include/linux/ceph/libceph.h   |2 +-
  include/linux/ceph/messenger.h |9 +
  net/ceph/ceph_common.c |   18 +-
  net/ceph/messenger.c   |   30 +-
  net/ceph/mon_client.c  |6 +++---
  net/ceph/osd_client.c  |4 ++--
  7 files changed, 26 insertions(+), 45 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 200bc87..ad30261 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -394,7 +394,7 @@ static struct ceph_mds_session *register_session(struct
 ceph_mds_client *mdsc,
   s-s_seq = 0;
   mutex_init(s-s_mutex);
 
 - ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
 + ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
   s-s_con.private = s;
   s-s_con.ops = mds_con_ops;
   s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS;
 diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
 index 92eef7c..927361c 100644
 --- a/include/linux/ceph/libceph.h
 +++ b/include/linux/ceph/libceph.h
 @@ -131,7 +131,7 @@ struct ceph_client {
   u32 supported_features;
   u32 required_features;
 
 - struct ceph_messenger *msgr;   /* messenger instance */
 + struct ceph_messenger msgr;   /* messenger instance */
   struct ceph_mon_client monc;
   struct ceph_osd_client osdc;
 
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 74f6c9b..3fbd4be 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -211,10 +211,11 @@ extern int ceph_msgr_init(void);
  extern void ceph_msgr_exit(void);
  extern void ceph_msgr_flush(void);
 
 -extern struct ceph_messenger *ceph_messenger_create(
 - struct ceph_entity_addr *myaddr,
 - u32 features, u32 required);
 -extern void ceph_messenger_destroy(struct ceph_messenger *);
 +extern void ceph_messenger_init(struct ceph_messenger *msgr,
 + struct ceph_entity_addr *myaddr,
 + u32 supported_features,
 + u32 required_features,
 + bool nocrc);
 
  extern void ceph_con_init(struct ceph_messenger *msgr,
 struct ceph_connection *con);
 diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
 index cc91319..2de3ea1 100644
 --- a/net/ceph/ceph_common.c
 +++ b/net/ceph/ceph_common.c
 @@ -468,19 +468,15 @@ struct ceph_client *ceph_create_client(struct
 ceph_options *opt, void *private,
   /* msgr */
   if (ceph_test_opt(client, MYIP))
   myaddr = client-options-my_addr;
 - client-msgr = ceph_messenger_create(myaddr,
 -  client-supported_features,
 -  client-required_features);
 - if (IS_ERR(client-msgr)) {
 - err = PTR_ERR(client-msgr);
 - goto fail;
 - }
 - client-msgr-nocrc = ceph_test_opt(client, NOCRC);
 + ceph_messenger_init(client-msgr, myaddr,
 + client-supported_features,
 + client-required_features,
 + ceph_test_opt(client, NOCRC));
 
   /* subsystems */
   err = ceph_monc_init(client-monc, client);
   if (err  0)
 - goto fail_msgr;
 + goto fail;
   err = ceph_osdc_init(client-osdc, client);
   if (err  0)
   goto fail_monc;
 @@ -489,8 +485,6 @@ struct ceph_client *ceph_create_client(struct ceph_options
 *opt, void *private,
 
  fail_monc:
   ceph_monc_stop(client-monc);
 -fail_msgr:
 - ceph_messenger_destroy(client-msgr);
  fail:
   kfree(client);
   return ERR_PTR(err);
 @@ -515,8 +509,6 @@ void ceph_destroy_client(struct ceph_client *client)
 
   ceph_debugfs_client_cleanup(client);
 
 - ceph_messenger_destroy(client-msgr);
 -
   ceph_destroy_options(client-options);
 
   kfree(client);
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 2e9054f..19f1948 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -2243,18 +2243,14 @@ out:
 
 
  /*
 - * create a new messenger instance
 + * initialize a new messenger instance
   */
 -struct ceph_messenger *ceph_messenger_create(struct ceph_entity_addr *myaddr,
 -  u32 supported_features,
 -  u32 required_features)
 +void ceph_messenger_init(struct ceph_messenger *msgr,
 + struct ceph_entity_addr *myaddr,
 + u32 

Re: iozone test crashed on ceph

2012-05-31 Thread udit agarwal
Hi,
 thanks for your reply.
 The output of 'modinfo ceph' is as follows:
 filename:   /lib/modules/3.1.10-1.9-desktop/kernel/fs/ceph/ceph.ko
license:GPL
description:Ceph filesystem for Linux
author: Patience Warnick patie...@newdream.net
author: Yehuda Sadeh yeh...@hq.newdream.net
author: Sage Weil s...@newdream.net
srcversion: AFEFF779535E750AFD4072D
depends:   
vermagic:   3.1.10-1.9-desktop SMP preempt mod_unload modversions

And my ceph.conf file is as follows:

[global]
   pid file = /var/run/ceph/$name.pid
   logger dir = /var/log/ceph
   log dir = /var/log/ceph
   user = root
[mon]
   mon data = /var/local/data/mon$id
;   debug ms = 1
;   debug mon = 20
;   debug paxos = 20
[mon.0]
   host = hp1
   mon addr = 192.168.20.6:6789
;[mon.1]
;   host = hp2
;   mon addr = 192.168.20.7:6789
;[mon.2]
;   host = bb1
;   mon addr = 192.168.20.2:6789
[mds]
;   debug ms = 1; message traffic
;   debug mds = 1   ; mds
;   debug mds balancer = 20 ; load balancing
;   debug mds log = 20  ; mds journaling
;   debug mds_migrator = 20 ; metadata migration
;   debug monc = 20 ; monitor interaction, startup
[mds.0]
   host = hp1
;[mds.1]
;   host = hp2
[osd]
   osd journal = /var/local/data/osd$id/journal
   osd journal size = 1
   filestore journal writeahead = true
   osd data = /var/local/data/osd$id
;   debug ms = 1; message traffic
;   debug osd = 20
;   debug filestore = 20; local object storage
;   debug journal = 20  ; local journaling
;   debug monc = 20 ; monitor interaction, startup
[osd.0]
   host = el1
   btrfs devs = /dev/sda3
[osd.1]
   host = el1
   btrfs devs = /dev/sdb
[osd.2]
   host = bb1
   btrfs devs = /dev/sda3

No, I don't think so if any of them crashed.
Thanks in advance and let me know if you need further info.

--Udit

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 07/13] libceph: embed ceph connection structure in mon_client

2012-05-31 Thread Sage Weil
On Wed, 30 May 2012, Alex Elder wrote:
 A monitor client has a pointer to a ceph connection structure in it.
 This is the only one of the three ceph client types that do it this
 way; the OSD and MDS clients embed the connection into their main
 structures.  There is always exactly one ceph connection for a
 monitor client, so there is no need to allocate it separate from the
 monitor client structure.
 
 So switch the ceph_mon_client structure to embed its
 ceph_connection structure.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  include/linux/ceph/mon_client.h |2 +-
  net/ceph/mon_client.c   |   47 --
  2 files changed, 21 insertions(+), 28 deletions(-)
 
 diff --git a/include/linux/ceph/mon_client.h b/include/linux/ceph/mon_client.h
 index 545f859..2113e38 100644
 --- a/include/linux/ceph/mon_client.h
 +++ b/include/linux/ceph/mon_client.h
 @@ -70,7 +70,7 @@ struct ceph_mon_client {
   bool hunting;
   int cur_mon;   /* last monitor i contacted */
   unsigned long sub_sent, sub_renew_after;
 - struct ceph_connection *con;
 + struct ceph_connection con;
   bool have_fsid;
 
   /* pending generic requests */
 diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
 index 704dc95..ac4d6b1 100644
 --- a/net/ceph/mon_client.c
 +++ b/net/ceph/mon_client.c
 @@ -106,9 +106,9 @@ static void __send_prepared_auth_request(struct
 ceph_mon_client *monc, int len)
   monc-pending_auth = 1;
   monc-m_auth-front.iov_len = len;
   monc-m_auth-hdr.front_len = cpu_to_le32(len);
 - ceph_con_revoke(monc-con, monc-m_auth);
 + ceph_con_revoke(monc-con, monc-m_auth);
   ceph_msg_get(monc-m_auth);  /* keep our ref */
 - ceph_con_send(monc-con, monc-m_auth);
 + ceph_con_send(monc-con, monc-m_auth);
  }
 
  /*
 @@ -117,8 +117,8 @@ static void __send_prepared_auth_request(struct
 ceph_mon_client *monc, int len)
  static void __close_session(struct ceph_mon_client *monc)
  {
   dout(__close_session closing mon%d\n, monc-cur_mon);
 - ceph_con_revoke(monc-con, monc-m_auth);
 - ceph_con_close(monc-con);
 + ceph_con_revoke(monc-con, monc-m_auth);
 + ceph_con_close(monc-con);
   monc-cur_mon = -1;
   monc-pending_auth = 0;
   ceph_auth_reset(monc-auth);
 @@ -142,9 +142,9 @@ static int __open_session(struct ceph_mon_client *monc)
   monc-want_next_osdmap = !!monc-want_next_osdmap;
 
   dout(open_session mon%d opening\n, monc-cur_mon);
 - monc-con-peer_name.type = CEPH_ENTITY_TYPE_MON;
 - monc-con-peer_name.num = cpu_to_le64(monc-cur_mon);
 - ceph_con_open(monc-con,
 + monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
 + monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
 + ceph_con_open(monc-con,
 monc-monmap-mon_inst[monc-cur_mon].addr);
 
   /* initiatiate authentication handshake */
 @@ -226,8 +226,8 @@ static void __send_subscribe(struct ceph_mon_client *monc)
 
   msg-front.iov_len = p - msg-front.iov_base;
   msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
 - ceph_con_revoke(monc-con, msg);
 - ceph_con_send(monc-con, ceph_msg_get(msg));
 + ceph_con_revoke(monc-con, msg);
 + ceph_con_send(monc-con, ceph_msg_get(msg));
 
   monc-sub_sent = jiffies | 1;  /* never 0 */
   }
 @@ -247,7 +247,7 @@ static void handle_subscribe_ack(struct ceph_mon_client
 *monc,
   if (monc-hunting) {
   pr_info(mon%d %s session established\n,
   monc-cur_mon,
 - ceph_pr_addr(monc-con-peer_addr.in_addr));
 + ceph_pr_addr(monc-con.peer_addr.in_addr));
   monc-hunting = false;
   }
   dout(handle_subscribe_ack after %d seconds\n, seconds);
 @@ -461,7 +461,7 @@ static int do_generic_request(struct ceph_mon_client
 *monc,
   req-request-hdr.tid = cpu_to_le64(req-tid);
   __insert_generic_request(monc, req);
   monc-num_generic_requests++;
 - ceph_con_send(monc-con, ceph_msg_get(req-request));
 + ceph_con_send(monc-con, ceph_msg_get(req-request));
   mutex_unlock(monc-mutex);
 
   err = wait_for_completion_interruptible(req-completion);
 @@ -684,8 +684,8 @@ static void __resend_generic_request(struct
 ceph_mon_client *monc)
 
   for (p = rb_first(monc-generic_request_tree); p; p = rb_next(p)) {
   req = rb_entry(p, struct ceph_mon_generic_request, node);
 - ceph_con_revoke(monc-con, req-request);
 - ceph_con_send(monc-con, ceph_msg_get(req-request));
 + ceph_con_revoke(monc-con, req-request);
 + ceph_con_send(monc-con, ceph_msg_get(req-request));
   }
  }
 
 @@ -705,7 +705,7 @@ static void delayed_work(struct work_struct *work)
   __close_session(monc);
   

Re: [PATCH 08/13] libceph: start separating connection flags from state

2012-05-31 Thread Sage Weil
On Wed, 30 May 2012, Alex Elder wrote:
 A ceph_connection holds a mixture of connection state (as in state
 machine state) and connection flags in a single state field.  To
 make the distinction more clear, define a new flags field and use
 it rather than the state field to hold Boolean flag values.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  include/linux/ceph/messenger.h |   18 +
  net/ceph/messenger.c   |   50
 
  2 files changed, 37 insertions(+), 31 deletions(-)
 
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 3fbd4be..920235e 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -103,20 +103,25 @@ struct ceph_msg_pos {
  #define MAX_DELAY_INTERVAL   (5 * 60 * HZ)
 
  /*
 - * ceph_connection state bit flags
 + * ceph_connection flag bits
   */
 +
  #define LOSSYTX 0  /* we can close channel or drop messages on errors
 */
 -#define CONNECTING   1
 -#define NEGOTIATING  2
  #define KEEPALIVE_PENDING  3
  #define WRITE_PENDING4  /* we have data ready to send */
 +#define SOCK_CLOSED  11 /* socket state changed to closed */
 +#define BACKOFF 15
 +
 +/*
 + * ceph_connection states
 + */
 +#define CONNECTING   1
 +#define NEGOTIATING  2
  #define STANDBY  8  /* no outgoing messages, socket closed.  we
 keep
   * the ceph_connection around to maintain shared
   * state with the peer. */
  #define CLOSED   10 /* we've closed the connection */
 -#define SOCK_CLOSED  11 /* socket state changed to closed */
  #define OPENING 13 /* open connection w/ (possibly new) peer */
 -#define BACKOFF 15

Later it might be work prefixing these with FLAG_ and/or STATE_.

Reviewed-by: Sage Weil s...@inktank.com


 
  /*
   * A single connection with another host.
 @@ -133,7 +138,8 @@ struct ceph_connection {
 
   struct ceph_messenger *msgr;
   struct socket *sock;
 - unsigned long state;/* connection state (see flags above) */
 + unsigned long flags;
 + unsigned long state;
   const char *error_msg;  /* error message, if any */
 
   struct ceph_entity_addr peer_addr; /* peer address */
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 19f1948..29055df 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -176,7 +176,7 @@ static void ceph_sock_write_space(struct sock *sk)
* buffer. See net/ipv4/tcp_input.c:tcp_check_space()
* and net/core/stream.c:sk_stream_write_space().
*/
 - if (test_bit(WRITE_PENDING, con-state)) {
 + if (test_bit(WRITE_PENDING, con-flags)) {
   if (sk_stream_wspace(sk) = sk_stream_min_wspace(sk)) {
   dout(%s %p queueing write work\n, __func__, con);
   clear_bit(SOCK_NOSPACE, sk-sk_socket-flags);
 @@ -203,7 +203,7 @@ static void ceph_sock_state_change(struct sock *sk)
   dout(%s TCP_CLOSE\n, __func__);
   case TCP_CLOSE_WAIT:
   dout(%s TCP_CLOSE_WAIT\n, __func__);
 - if (test_and_set_bit(SOCK_CLOSED, con-state) == 0) {
 + if (test_and_set_bit(SOCK_CLOSED, con-flags) == 0) {
   if (test_bit(CONNECTING, con-state))
   con-error_msg = connection failed;
   else
 @@ -393,9 +393,9 @@ void ceph_con_close(struct ceph_connection *con)
ceph_pr_addr(con-peer_addr.in_addr));
   set_bit(CLOSED, con-state);  /* in case there's queued work */
   clear_bit(STANDBY, con-state);  /* avoid connect_seq bump */
 - clear_bit(LOSSYTX, con-state);  /* so we retry next connect */
 - clear_bit(KEEPALIVE_PENDING, con-state);
 - clear_bit(WRITE_PENDING, con-state);
 + clear_bit(LOSSYTX, con-flags);  /* so we retry next connect */
 + clear_bit(KEEPALIVE_PENDING, con-flags);
 + clear_bit(WRITE_PENDING, con-flags);
   mutex_lock(con-mutex);
   reset_connection(con);
   con-peer_global_seq = 0;
 @@ -612,7 +612,7 @@ static void prepare_write_message(struct ceph_connection
 *con)
   prepare_write_message_footer(con);
   }
 
 - set_bit(WRITE_PENDING, con-state);
 + set_bit(WRITE_PENDING, con-flags);
  }
 
  /*
 @@ -633,7 +633,7 @@ static void prepare_write_ack(struct ceph_connection *con)
   con-out_temp_ack);
 
   con-out_more = 1;  /* more will follow.. eventually.. */
 - set_bit(WRITE_PENDING, con-state);
 + set_bit(WRITE_PENDING, con-flags);
  }
 
  /*
 @@ -644,7 +644,7 @@ static void prepare_write_keepalive(struct ceph_connection
 *con)
   dout(prepare_write_keepalive %p\n, con);
   con_out_kvec_reset(con);
   con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive);
 - set_bit(WRITE_PENDING, con-state);
 + set_bit(WRITE_PENDING, con-flags);
  }
 
  /*
 @@ -673,7 +673,7 @@ 

Re: [PATCH 09/13] libceph: start tracking connection socket state

2012-05-31 Thread Sage Weil
On Wed, 30 May 2012, Alex Elder wrote:
 Start explicitly keeping track of the state of a ceph connection's
 socket, separate from the state of the connection itself.  Create
 placeholder functions to encapsulate the state transitions.
 
 
 | NEW* |  transient initial state
 
 | con_sock_state_init()
 v
 --
 | CLOSED |  initialized, but no socket (and no
 --  TCP connection)
  ^  \
  |   \ con_sock_state_connecting()
  |--
  |  \
  + con_sock_state_closed()   \
  |\   \
  | \   \
  |  --- \
  |  | CLOSING |  socket event;   \
  |  ---  await close  \
  |   ^|
  |   ||
  |   + con_sock_state_closing()   |
  |  / \   |
  | /   ---|
  |/   \   v
  |   /--
  |  /-| CONNECTING |  socket created, TCP
  |  |   / --  connect initiated
  |  |   | con_sock_state_connected()
  |  |   v
 -
 | CONNECTED |  TCP connection established
 -

Can we put this beautiful pictures in the header next to the states?

Reviewed-by: Sage Weil s...@inktank.com

 
 Make the socket state an atomic variable, reinforcing that it's a
 distinct transtion with no possible intermediate/both states.
 This is almost certainly overkill at this point, though the
 transitions into CONNECTED and CLOSING state do get called via
 socket callback (the rest of the transitions occur with the
 connection mutex held).  We can back out the atomicity later.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  include/linux/ceph/messenger.h |8 -
  net/ceph/messenger.c   |   63
 
  2 files changed, 69 insertions(+), 2 deletions(-)
 
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 920235e..5e852f4 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -137,14 +137,18 @@ struct ceph_connection {
   const struct ceph_connection_operations *ops;
 
   struct ceph_messenger *msgr;
 +
 + atomic_t sock_state;
   struct socket *sock;
 + struct ceph_entity_addr peer_addr; /* peer address */
 + struct ceph_entity_addr peer_addr_for_me;
 +
   unsigned long flags;
   unsigned long state;
   const char *error_msg;  /* error message, if any */
 
 - struct ceph_entity_addr peer_addr; /* peer address */
   struct ceph_entity_name peer_name; /* peer name */
 - struct ceph_entity_addr peer_addr_for_me;
 +
   unsigned peer_features;
   u32 connect_seq;  /* identify the most recent connection
attempt for this connection, client */
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 29055df..7e11b07 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -29,6 +29,14 @@
   * the sender.
   */
 
 +/* State values for ceph_connection-sock_state; NEW is assumed to be 0 */
 +
 +#define CON_SOCK_STATE_NEW   0   /* - CLOSED */
 +#define CON_SOCK_STATE_CLOSED1   /* - CONNECTING */
 +#define CON_SOCK_STATE_CONNECTING2   /* - CONNECTED or - CLOSING
 */
 +#define CON_SOCK_STATE_CONNECTED 3   /* - CLOSING or - CLOSED */
 +#define CON_SOCK_STATE_CLOSING   4   /* - CLOSED */
 +
  /* static tag bytes (protocol control messages) */
  static char tag_msg = CEPH_MSGR_TAG_MSG;
  static char tag_ack = CEPH_MSGR_TAG_ACK;
 @@ -147,6 +155,54 @@ void ceph_msgr_flush(void)
  }
  EXPORT_SYMBOL(ceph_msgr_flush);
 
 +/* Connection socket state transition functions */
 +
 +static void con_sock_state_init(struct ceph_connection *con)
 +{
 + int old_state;
 +
 + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CLOSED);
 + if (WARN_ON(old_state != CON_SOCK_STATE_NEW))
 + printk(%s: unexpected old state %d\n, __func__, old_state);
 +}
 +
 +static void con_sock_state_connecting(struct ceph_connection *con)
 +{
 + int old_state;
 +
 + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTING);
 + if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED))
 + printk(%s: unexpected old state %d\n, __func__, old_state);
 +}
 +
 +static void con_sock_state_connected(struct ceph_connection *con)
 +{
 + int old_state;
 +
 + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTED);
 + if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING))
 + printk(%s: unexpected old state %d\n, __func__, old_state);
 +}
 +
 +static void con_sock_state_closing(struct 

Re: [PATCH 11/13] libceph: init monitor connection when opening

2012-05-31 Thread Sage Weil
yep!

On Wed, 30 May 2012, Alex Elder wrote:

 Hold off initializing a monitor client's connection until just
 before it gets opened for use.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  net/ceph/mon_client.c |   13 ++---
  1 files changed, 6 insertions(+), 7 deletions(-)
 
 diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
 index ac4d6b1..77da480 100644
 --- a/net/ceph/mon_client.c
 +++ b/net/ceph/mon_client.c
 @@ -119,6 +119,7 @@ static void __close_session(struct ceph_mon_client *monc)
   dout(__close_session closing mon%d\n, monc-cur_mon);
   ceph_con_revoke(monc-con, monc-m_auth);
   ceph_con_close(monc-con);
 + monc-con.private = NULL;
   monc-cur_mon = -1;
   monc-pending_auth = 0;
   ceph_auth_reset(monc-auth);
 @@ -141,9 +142,13 @@ static int __open_session(struct ceph_mon_client *monc)
   monc-sub_renew_after = jiffies;  /* i.e., expired */
   monc-want_next_osdmap = !!monc-want_next_osdmap;
 
 - dout(open_session mon%d opening\n, monc-cur_mon);
 + ceph_con_init(monc-client-msgr, monc-con);
 + monc-con.private = monc;
 + monc-con.ops = mon_con_ops;
   monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
   monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
 +
 + dout(open_session mon%d opening\n, monc-cur_mon);
   ceph_con_open(monc-con,
 monc-monmap-mon_inst[monc-cur_mon].addr);
 
 @@ -760,10 +765,6 @@ int ceph_monc_init(struct ceph_mon_client *monc, struct
 ceph_client *cl)
   goto out;
 
   /* connection */
 - ceph_con_init(monc-client-msgr, monc-con);
 - monc-con.private = monc;
 - monc-con.ops = mon_con_ops;
 -
   /* authentication */
   monc-auth = ceph_auth_init(cl-options-name,
   cl-options-key);
 @@ -836,8 +837,6 @@ void ceph_monc_stop(struct ceph_mon_client *monc)
   mutex_lock(monc-mutex);
   __close_session(monc);
 
 - monc-con.private = NULL;
 -
   mutex_unlock(monc-mutex);
 
   ceph_auth_destroy(monc-auth);
 -- 
 1.7.5.4
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 12/13] libceph: fully initialize connection in con_init()

2012-05-31 Thread Sage Weil
Reviewed-by: Sage Weil s...@inktank.com

On Wed, 30 May 2012, Alex Elder wrote:

 Move the initialization of a ceph connection's private pointer,
 operations vector pointer, and peer name information into
 ceph_con_init().  Rearrange the arguments so the connection pointer
 is first.  Hide the byte-swapping of the peer entity number inside
 ceph_con_init()
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  fs/ceph/mds_client.c   |7 ++-
  include/linux/ceph/messenger.h |6 --
  net/ceph/messenger.c   |9 -
  net/ceph/mon_client.c  |8 +++-
  net/ceph/osd_client.c  |7 ++-
  5 files changed, 19 insertions(+), 18 deletions(-)
 
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index ad30261..ecd7f15 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -394,11 +394,8 @@ static struct ceph_mds_session *register_session(struct
 ceph_mds_client *mdsc,
   s-s_seq = 0;
   mutex_init(s-s_mutex);
 
 - ceph_con_init(mdsc-fsc-client-msgr, s-s_con);
 - s-s_con.private = s;
 - s-s_con.ops = mds_con_ops;
 - s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS;
 - s-s_con.peer_name.num = cpu_to_le64(mds);
 + ceph_con_init(s-s_con, s, mds_con_ops, mdsc-fsc-client-msgr,
 + CEPH_ENTITY_TYPE_MDS, mds);
 
   spin_lock_init(s-s_gen_ttl_lock);
   s-s_cap_gen = 0;
 diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
 index 5e852f4..dd27837 100644
 --- a/include/linux/ceph/messenger.h
 +++ b/include/linux/ceph/messenger.h
 @@ -227,8 +227,10 @@ extern void ceph_messenger_init(struct ceph_messenger
 *msgr,
   u32 required_features,
   bool nocrc);
 
 -extern void ceph_con_init(struct ceph_messenger *msgr,
 -   struct ceph_connection *con);
 +extern void ceph_con_init(struct ceph_connection *con, void *private,
 + const struct ceph_connection_operations *ops,
 + struct ceph_messenger *msgr, __u8 entity_type,
 + __u64 entity_num);
  extern void ceph_con_open(struct ceph_connection *con,
 struct ceph_entity_addr *addr);
  extern bool ceph_con_opened(struct ceph_connection *con);
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index 7e11b07..cdf8299 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -514,15 +514,22 @@ void ceph_con_put(struct ceph_connection *con)
  /*
   * initialize a new connection.
   */
 -void ceph_con_init(struct ceph_messenger *msgr, struct ceph_connection *con)
 +void ceph_con_init(struct ceph_connection *con, void *private,
 + const struct ceph_connection_operations *ops,
 + struct ceph_messenger *msgr, __u8 entity_type, __u64 entity_num)
  {
   dout(con_init %p\n, con);
   memset(con, 0, sizeof(*con));
 + con-private = private;
   atomic_set(con-nref, 1);
 + con-ops = ops;
   con-msgr = msgr;
 
   con_sock_state_init(con);
 
 + con-peer_name.type = (__u8) entity_type;
 + con-peer_name.num = cpu_to_le64(entity_num);
 +
   mutex_init(con-mutex);
   INIT_LIST_HEAD(con-out_queue);
   INIT_LIST_HEAD(con-out_sent);
 diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c
 index 77da480..9b4cef9 100644
 --- a/net/ceph/mon_client.c
 +++ b/net/ceph/mon_client.c
 @@ -142,11 +142,9 @@ static int __open_session(struct ceph_mon_client *monc)
   monc-sub_renew_after = jiffies;  /* i.e., expired */
   monc-want_next_osdmap = !!monc-want_next_osdmap;
 
 - ceph_con_init(monc-client-msgr, monc-con);
 - monc-con.private = monc;
 - monc-con.ops = mon_con_ops;
 - monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON;
 - monc-con.peer_name.num = cpu_to_le64(monc-cur_mon);
 + ceph_con_init(monc-con, monc, mon_con_ops,
 + monc-client-msgr,
 + CEPH_ENTITY_TYPE_MON, monc-cur_mon);
 
   dout(open_session mon%d opening\n, monc-cur_mon);
   ceph_con_open(monc-con,
 diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
 index e30efbc..1f3951a 100644
 --- a/net/ceph/osd_client.c
 +++ b/net/ceph/osd_client.c
 @@ -640,11 +640,8 @@ static struct ceph_osd *create_osd(struct ceph_osd_client
 *osdc, int onum)
   INIT_LIST_HEAD(osd-o_osd_lru);
   osd-o_incarnation = 1;
 
 - ceph_con_init(osdc-client-msgr, osd-o_con);
 - osd-o_con.private = osd;
 - osd-o_con.ops = osd_con_ops;
 - osd-o_con.peer_name.type = CEPH_ENTITY_TYPE_OSD;
 - osd-o_con.peer_name.num = cpu_to_le64(onum);
 + ceph_con_init(osd-o_con, osd, osd_con_ops, osdc-client-msgr,
 + CEPH_ENTITY_TYPE_OSD, onum);
 
   INIT_LIST_HEAD(osd-o_keepalive_item);
   return osd;
 -- 
 1.7.5.4
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to 

Re: [PATCH 13/13] libceph: set CLOSED state bit in con_init

2012-05-31 Thread Sage Weil
Reviewed-by: Sage Weil s...@inktank.com

On Wed, 30 May 2012, Alex Elder wrote:

 Once a connection is fully initialized, it is really in a CLOSED
 state, so make that explicit by setting the bit in its state field.
 
 It is possible for a connection in NEGOTIATING state to get a
 failure, leading to ceph_fault() and ultimately ceph_con_close().
 Clear that bits if it is set in that case, to reflect that the
 connection truly is closed and is no longer participating in a
 connect sequence.
 
 Issue a warning if ceph_con_open() is called on a connection that
 is not in CLOSED state.
 
 Signed-off-by: Alex Elder el...@inktank.com
 ---
  net/ceph/messenger.c |8 +++-
  1 files changed, 7 insertions(+), 1 deletions(-)
 
 diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
 index cdf8299..85bfe12 100644
 --- a/net/ceph/messenger.c
 +++ b/net/ceph/messenger.c
 @@ -452,10 +452,13 @@ void ceph_con_close(struct ceph_connection *con)
   dout(con_close %p peer %s\n, con,
ceph_pr_addr(con-peer_addr.in_addr));
   set_bit(CLOSED, con-state);  /* in case there's queued work */
 + clear_bit(NEGOTIATING, con-state);
   clear_bit(STANDBY, con-state);  /* avoid connect_seq bump */
 +
   clear_bit(LOSSYTX, con-flags);  /* so we retry next connect */
   clear_bit(KEEPALIVE_PENDING, con-flags);
   clear_bit(WRITE_PENDING, con-flags);
 +
   mutex_lock(con-mutex);
   reset_connection(con);
   con-peer_global_seq = 0;
 @@ -472,7 +475,8 @@ void ceph_con_open(struct ceph_connection *con, struct
 ceph_entity_addr *addr)
  {
   dout(con_open %p %s\n, con, ceph_pr_addr(addr-in_addr));
   set_bit(OPENING, con-state);
 - clear_bit(CLOSED, con-state);
 + WARN_ON(!test_and_clear_bit(CLOSED, con-state));
 +
   memcpy(con-peer_addr, addr, sizeof(*addr));
   con-delay = 0;  /* reset backoff memory */
   queue_con(con);
 @@ -534,6 +538,8 @@ void ceph_con_init(struct ceph_connection *con, void
 *private,
   INIT_LIST_HEAD(con-out_queue);
   INIT_LIST_HEAD(con-out_sent);
   INIT_DELAYED_WORK(con-work, con_work);
 +
 + set_bit(CLOSED, con-state);
  }
  EXPORT_SYMBOL(ceph_con_init);
 
 -- 
 1.7.5.4
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rbd rm image slow with big images ?

2012-05-31 Thread Alexandre DERUMIER

That said, the current implementation is also stupid: it's doing a single 
io at a time. #2256 (next sprint) will parallelize this to make it go 
much faster (probably an order of magnitude?). 

Ah, ok, this is why is see low ios/network during delete.

Thanks Sage and Wido for the explains, that's very clear!



- Mail original - 

De: Sage Weil s...@inktank.com 
À: Wido den Hollander w...@widodh.nl 
Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel@vger.kernel.org 
Envoyé: Jeudi 31 Mai 2012 20:19:44 
Objet: Re: rbd rm image slow with big images ? 

On Thu, 31 May 2012, Wido den Hollander wrote: 
 Hi, 
  Is it the normal behaviour ? Maybe some xfs tuning could help ? 
 
 It's in the nature of RBD. 

Yes. 

That said, the current implementation is also stupid: it's doing a single 
io at a time. #2256 (next sprint) will parallelize this to make it go 
much faster (probably an order of magnitude?). 

sage 



-- 

-- 




Alexandre D erumier 
Ingénieur Système 
Fixe : 03 20 68 88 90 
Fax : 03 20 68 90 81 
45 Bvd du Général Leclerc 59100 Roubaix - France 
12 rue Marivaux 75002 Paris - France 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Nam Dang
I pulled the Java lib from https://github.com/noahdesu/ceph/tree/wip-java-cephfs
However, I use ceph 0.47.1 installed directly from Ubuntu's repository
with apt-get, not the one that I built with the java library. I
assumed that since the java lib is just a wrapper.

There are only two segfaults that I've ever encountered, one in which the C 
wrappers are used with an unmounted client, and the error Nam is seeing 
(although they
 could be related). I will re-submit an updated patch for the former, which 
 should rule that out as the culprit.

No, this occurs when I call mount(null) with the monitor being taken
down. The library should throw an Exception instead, but since SIGSEGV
originates from libcephfs.so so I guess it's more related to Ceph's
internal code.

Best regards,

Nam Dang
Tokyo Institute of Technology
Tokyo, Japan


On Fri, Jun 1, 2012 at 8:58 AM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 On May 31, 2012, at 3:39 PM, Greg Farnum wrote:

 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
 Noah, do you have any leads on this? Do you think it's a bug in your Java 
 code or in the C/++ libraries?

 I _think_ this is because the JVM uses its own threading library, and Ceph 
 assumes pthreads and pthread compatible mutexes--is that assumption about 
 Ceph correct? Hence the error that looks like Mutex::lock(bool) being 
 reference for context during the segfault. To verify this all that is needed 
 is some synchronization added to the Java.

 There are only two segfaults that I've ever encountered, one in which the C 
 wrappers are used with an unmounted client, and the error Nam is seeing 
 (although they could be related). I will re-submit an updated patch for the 
 former, which should rule that out as the culprit.

 Nam: where are you grabbing the Java patches from? I'll push some updates.


 The only other scenario that comes to mind is related to signaling:

 The RADOS Java wrappers suffered from an interaction between the JVM and 
 RADOS client signal handlers, in which either the JVM or RADOS would replace 
 the handlers for the other (not sure which order). Anyway, the solution was 
 to link in the JVM libjsig.so signal chaining library. This might be the same 
 thing we are seeing here, but I'm betting it is the first theory I mentioned.

 - Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Nam Dang
I made a mistake in the previous email. As Noah said, this problem is
due to the wrapper being used with an unsuccessfully mounted client.
However, I think if the mount fails, the wrapper should throw an
exception instead of letting the client continue.

Best regards,
Nam Dang
Tokyo Institute of Technology
Tokyo, Japan


On Fri, Jun 1, 2012 at 1:44 PM, Nam Dang n...@de.cs.titech.ac.jp wrote:
 I pulled the Java lib from 
 https://github.com/noahdesu/ceph/tree/wip-java-cephfs
 However, I use ceph 0.47.1 installed directly from Ubuntu's repository
 with apt-get, not the one that I built with the java library. I
 assumed that since the java lib is just a wrapper.

There are only two segfaults that I've ever encountered, one in which the C 
wrappers are used with an unmounted client, and the error Nam is seeing 
(although they
 could be related). I will re-submit an updated patch for the former, which 
 should rule that out as the culprit.

 No, this occurs when I call mount(null) with the monitor being taken
 down. The library should throw an Exception instead, but since SIGSEGV
 originates from libcephfs.so so I guess it's more related to Ceph's
 internal code.

 Best regards,

 Nam Dang
 Tokyo Institute of Technology
 Tokyo, Japan


 On Fri, Jun 1, 2012 at 8:58 AM, Noah Watkins jayh...@cs.ucsc.edu wrote:

 On May 31, 2012, at 3:39 PM, Greg Farnum wrote:

 Nevermind to my last comment. Hmm, I've seen this, but very rarely.
 Noah, do you have any leads on this? Do you think it's a bug in your Java 
 code or in the C/++ libraries?

 I _think_ this is because the JVM uses its own threading library, and Ceph 
 assumes pthreads and pthread compatible mutexes--is that assumption about 
 Ceph correct? Hence the error that looks like Mutex::lock(bool) being 
 reference for context during the segfault. To verify this all that is needed 
 is some synchronization added to the Java.

 There are only two segfaults that I've ever encountered, one in which the C 
 wrappers are used with an unmounted client, and the error Nam is seeing 
 (although they could be related). I will re-submit an updated patch for the 
 former, which should rule that out as the culprit.

 Nam: where are you grabbing the Java patches from? I'll push some updates.


 The only other scenario that comes to mind is related to signaling:

 The RADOS Java wrappers suffered from an interaction between the JVM and 
 RADOS client signal handlers, in which either the JVM or RADOS would replace 
 the handlers for the other (not sure which order). Anyway, the solution was 
 to link in the JVM libjsig.so signal chaining library. This might be the 
 same thing we are seeing here, but I'm betting it is the first theory I 
 mentioned.

 - Noah
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SIGSEGV in cephfs-java, but probably in Ceph

2012-05-31 Thread Noah Watkins

On May 31, 2012, at 9:44 PM, Nam Dang wrote:

 I pulled the Java lib from 
 https://github.com/noahdesu/ceph/tree/wip-java-cephfs
 However, I use ceph 0.47.1 installed directly from Ubuntu's repository
 with apt-get, not the one that I built with the java library. I
 assumed that since the java lib is just a wrapper.
 
 There are only two segfaults that I've ever encountered, one in which the C 
 wrappers are used with an unmounted client, and the error Nam is seeing 
 (although they
 could be related). I will re-submit an updated patch for the former, which 
 should rule that out as the culprit.
 
 No, this occurs when I call mount(null) with the monitor being taken
 down. The library should throw an Exception instead,

I agree. I'll push changes to the tree soon. Thanks.

- Noah--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html