Wiki Spam
Dear all, Just for information, there are a lot of spam in Ceph's wiki (http://ceph.com/wiki/Special:RecentChanges, http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0 http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0). Regards, Benoit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Hi Marc, Hi Stefan, first thanks for all your help and time. I found the commit which results in this problem and it is TCP related but i'm still wondering if the expected behaviour of this commit is expected? The commit in question is: git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 Author: Jason Wang jasow...@redhat.com Date: Thu Feb 2 00:07:00 2012 + tcp: properly initialize tcp memory limits Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang jasow...@redhat.com Acked-by: Glauber Costa glom...@parallels.com Signed-off-by: David S. Miller da...@davemloft.net diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4cb9cd2..7a7724d 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path); static __net_init int ipv4_sysctl_init_net(struct net *net) { struct ctl_table *table; - unsigned long limit; table = ipv4_net_table; if (!net_eq(net, init_net)) { @@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct net *net) net-ipv4.sysctl_rt_cache_rebuild_count = 4; tcp_init_mem(net); - limit = nr_free_buffer_pages() / 8; - limit = max(limit, 128UL); - net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; - net-ipv4.sysctl_tcp_mem[1] = limit; - net-ipv4.sysctl_tcp_mem[2] = net-ipv4.sysctl_tcp_mem[0] * 2; net-ipv4.ipv4_hdr = register_net_sysctl_table(net, net_ipv4_ctl_path, table); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a34f5cf..37755cc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3229,7 +3229,6 @@ __setup(thash_entries=, set_thash_entries); void tcp_init_mem(struct net *net) { - /* Set per-socket limits to no more than 1/128 the pressure threshold */ unsigned long limit = nr_free_buffer_pages() / 8; limit = max(limit, 128UL); net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; @@ -3298,7 +3297,8 @@ void __init tcp_init(void) sysctl_max_syn_backlog = max(128, cnt / 256); tcp_init_mem(init_net); - limit = nr_free_buffer_pages() / 8; + /* Set per-socket limits to no more than 1/128 the pressure threshold */ + limit = nr_free_buffer_pages() (PAGE_SHIFT - 10); limit = max(limit, 128UL); max_share = min(4UL*1024*1024, limit); Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
rbd rm image slow with big images ?
Hi, I trying to delete some rbd images with rbd rm, and it seem to be slow with big images. I'm testing it with just create a new image (1TB): # time rbd -p pool1 create --size 100 image2 real0m0.031s user0m0.015s sys 0m0.010s then just delete it, without having writed nothing in image # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real1m45.558s user0m14.683s sys 0m17.363s same test with 100GB # time rbd -p pool1 create --size 10 image2 real0m0.032s user0m0.016s sys 0m0.007s # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real0m10.499s user0m1.488s sys 0m1.720s I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs) network bandwith,diskio,cpu are low. Is it the normal behaviour ? Maybe some xfs tuning could help ? -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi Marc, Hi Stefan, first thanks for all your help and time. I found the commit which results in this problem and it is TCP related but i'm still wondering if the expected behaviour of this commit is expected? The commit in question is: git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 Author: Jason Wang jasow...@redhat.com Date: Thu Feb 2 00:07:00 2012 + tcp: properly initialize tcp memory limits Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang jasow...@redhat.com Acked-by: Glauber Costa glom...@parallels.com Signed-off-by: David S. Miller da...@davemloft.net diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4cb9cd2..7a7724d 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path); static __net_init int ipv4_sysctl_init_net(struct net *net) { struct ctl_table *table; - unsigned long limit; table = ipv4_net_table; if (!net_eq(net, init_net)) { @@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct net *net) net-ipv4.sysctl_rt_cache_rebuild_count = 4; tcp_init_mem(net); - limit = nr_free_buffer_pages() / 8; - limit = max(limit, 128UL); - net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; - net-ipv4.sysctl_tcp_mem[1] = limit; - net-ipv4.sysctl_tcp_mem[2] = net-ipv4.sysctl_tcp_mem[0] * 2; net-ipv4.ipv4_hdr = register_net_sysctl_table(net, net_ipv4_ctl_path, table); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a34f5cf..37755cc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3229,7 +3229,6 @@ __setup(thash_entries=, set_thash_entries); void tcp_init_mem(struct net *net) { - /* Set per-socket limits to no more than 1/128 the pressure threshold */ unsigned long limit = nr_free_buffer_pages() / 8; limit = max(limit, 128UL); net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; @@ -3298,7 +3297,8 @@ void __init tcp_init(void) sysctl_max_syn_backlog = max(128, cnt / 256); tcp_init_mem(init_net); - limit = nr_free_buffer_pages() / 8; + /* Set per-socket limits to no more than 1/128 the pressure threshold */ + limit = nr_free_buffer_pages() (PAGE_SHIFT - 10); limit = max(limit, 128UL); max_share = min(4UL*1024*1024, limit); Yeah, this might have affected the tcp performance. Looking at the current linus tree this function looks more like it looked beforehand, so it was probable reverted this way or another. Yehuda -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Am 31.05.2012 09:27, schrieb Stefan Majer: we have set them in /etc/sysctl.conf to: net.ipv4.tcp_mem = 1000 1000 1000 This does not help ;-( wow, this was fast ! if i understand this commit correct it simply skips a in-kernel configuration of network related sysctl parameters, especialy net.ipv4.tcp_mem I also tied this one: net.ipv4.tcp_rmem = 4096 524287 16777216 net.ipv4.tcp_wmem = 4096 524287 16777216 # grabbed values from 3.0.X net.ipv4.tcp_mem = 1162962 1550617 2325924 still - no help -. But if i use 3.4 and revert the commit it works fine. But i wasn't able to find which other parts are influenced by this limit while browsing through the source. I only found: net.ipv4.tcp_mem and net.ipv4.tcp_rmem and net.ipv4.tcp_wmem Greets Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Hi Stefan, then you should probably describe this in a short mail to Jason Wang and ask him how to circumvent this commit with sysctl settings. I´m pretty sure my sysctl setting reverts the first part of the commit. So probably the second part is the evil one ? Greetings Stefan On Thu, May 31, 2012 at 10:04 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Am 31.05.2012 09:27, schrieb Stefan Majer: we have set them in /etc/sysctl.conf to: net.ipv4.tcp_mem = 1000 1000 1000 This does not help ;-( wow, this was fast ! if i understand this commit correct it simply skips a in-kernel configuration of network related sysctl parameters, especialy net.ipv4.tcp_mem I also tied this one: net.ipv4.tcp_rmem = 4096 524287 16777216 net.ipv4.tcp_wmem = 4096 524287 16777216 # grabbed values from 3.0.X net.ipv4.tcp_mem = 1162962 1550617 2325924 still - no help -. But if i use 3.4 and revert the commit it works fine. But i wasn't able to find which other parts are influenced by this limit while browsing through the source. I only found: net.ipv4.tcp_mem and net.ipv4.tcp_rmem and net.ipv4.tcp_wmem Greets Stefan -- Stefan Majer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Am 31.05.2012 10:09, schrieb Stefan Majer: Hi Stefan, then you should probably describe this in a short mail to Jason Wang and ask him how to circumvent this commit with sysctl settings. done hopefully he can help I´m pretty sure my sysctl setting reverts the first part of the commit. So probably the second part is the evil one ? Yes it seems like that Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Wiki Spam
Doh! Thanks for the heads-up. We'll deal with it. Thanks, Mark On 5/31/12 2:05 AM, SPONEM, Benoît wrote: Dear all, Just for information, there are a lot of spam in Ceph's wiki (http://ceph.com/wiki/Special:RecentChanges, http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0 http://ceph.com/w/index.php?title=Special:LonelyPageslimit=250offset=0). Regards, Benoit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Hi Mark, Hi Stefan, i found a way to solve it by comparing /proc/sys/net with an patched and an unpatched kernel. Strangely the problem occours when the values are too big (in new kernel). With the smaller values everything works fine even under 3.4. Any ideas how that can be? I thought these values should be tuned to a maximum for max performance. - = old kernel + = new kernel -/proc/sys/net/ipv4/tcp_rmem:4096 87380 6291456 +/proc/sys/net/ipv4/tcp_rmem:4096 87380 514873 -/proc/sys/net/ipv4/tcp_wmem:4096 16384 4194304 +/proc/sys/net/ipv4/tcp_wmem:4096 16384 514873 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Hi Stefan, Please do share! I was planning on starting out on the wiki and eventually getting these kinds of things into the master docs. If you (and others) have already done testing it would be really interesting to compare experiences. So far I've been just kind of throwing stuff into: http://ceph.com/wiki/Performance_analysis In it's current form it's pretty inadequate, but I'm hoping to eventually get back to it. A lot of the work I've been doing recently is looking at underlying FS write behavior (specifically seeks) and if we can get any reasonable improvement through mkfs and mount options. Mark On 5/31/12 2:34 AM, Stefan Majer wrote: Hi, if Stefan confirms this as a solution it might me a good idea to collect some performance optimizations hints for osds to http://ceph.com/docs/master probably seperated in: Gigabit Ethernet based deployments with Jumbo Frames without Jumbo Frames 10 Gigabit Ethernet based deployments with Jumbo Frames without Jumbo Frames I can share some of our configurations as well Greetings Stefan On Thu, May 31, 2012 at 9:30 AM, Yehuda Sadeh yeh...@inktank.com mailto:yeh...@inktank.com wrote: On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag mailto:s.pri...@profihost.ag wrote: Hi Marc, Hi Stefan, first thanks for all your help and time. I found the commit which results in this problem and it is TCP related but i'm still wondering if the expected behaviour of this commit is expected? The commit in question is: git show c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 commit c43b874d5d714f271b80d4c3f49e05d0cbf51ed2 Author: Jason Wang jasow...@redhat.com mailto:jasow...@redhat.com Date: Thu Feb 2 00:07:00 2012 + tcp: properly initialize tcp memory limits Commit 4acb4190 tries to fix the using uninitialized value introduced by commit 3dc43e3, but it would make the per-socket memory limits too small. This patch fixes this and also remove the redundant codes introduced in 4acb4190. Signed-off-by: Jason Wang jasow...@redhat.com mailto:jasow...@redhat.com Acked-by: Glauber Costa glom...@parallels.com mailto:glom...@parallels.com Signed-off-by: David S. Miller da...@davemloft.net mailto:da...@davemloft.net diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 4cb9cd2..7a7724d 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -778,7 +778,6 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path); static __net_init int ipv4_sysctl_init_net(struct net *net) { struct ctl_table *table; - unsigned long limit; table = ipv4_net_table; if (!net_eq(net, init_net)) { @@ -815,11 +814,6 @@ static __net_init int ipv4_sysctl_init_net(struct net *net) net-ipv4.sysctl_rt_cache_rebuild_count = 4; tcp_init_mem(net); - limit = nr_free_buffer_pages() / 8; - limit = max(limit, 128UL); - net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; - net-ipv4.sysctl_tcp_mem[1] = limit; - net-ipv4.sysctl_tcp_mem[2] = net-ipv4.sysctl_tcp_mem[0] * 2; net-ipv4.ipv4_hdr = register_net_sysctl_table(net, net_ipv4_ctl_path, table); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index a34f5cf..37755cc 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -3229 tel:3229,7 +3229,6 @@ __setup(thash_entries=, set_thash_entries); void tcp_init_mem(struct net *net) { - /* Set per-socket limits to no more than 1/128 the pressure threshold */ unsigned long limit = nr_free_buffer_pages() / 8; limit = max(limit, 128UL); net-ipv4.sysctl_tcp_mem[0] = limit / 4 * 3; @@ -3298 tel:3298,7 +3297,8 @@ void __init tcp_init(void) sysctl_max_syn_backlog = max(128, cnt / 256); tcp_init_mem(init_net); - limit = nr_free_buffer_pages() / 8; + /* Set per-socket limits to no more than 1/128 the pressure threshold */ + limit = nr_free_buffer_pages() (PAGE_SHIFT - 10); limit = max(limit, 128UL); max_share = min(4UL*1024*1024, limit); Yeah, this might have affected the tcp performance. Looking at the current linus tree this function looks more like it looked beforehand, so it was probable reverted this way or another. Yehuda -- Stefan Majer -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
Am 31.05.2012 14:31, schrieb Mark Nelson: Hi Stefan, Please do share! I was planning on starting out on the wiki and eventually getting these kinds of things into the master docs. If you (and others) have already done testing it would be really interesting to compare experiences. So far I've been just kind of throwing stuff into: http://ceph.com/wiki/Performance_analysis In it's current form it's pretty inadequate, but I'm hoping to eventually get back to it. A lot of the work I've been doing recently is looking at underlying FS write behavior (specifically seeks) and if we can get any reasonable improvement through mkfs and mount options. At least i'll start sharing when i've a fine running system ;-) I plan to switch to 10Gbe next week. Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
SIGSEGV in cephfs-java, but probably in Ceph
Dear all, I am running a small benchmark for Ceph with multithreading and cephfs-java API. I encountered this issue even when I use only two threads, and I used only open file and creating directory operations. The piece of code is simply: String parent = filePath.substring(0, filePath.lastIndexOf('/')); mount.mkdirs(parent, 0755); // create parents if the path does not exist int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); // open the file Each thread mounts its own ceph mounting point (using mount.mount(null)) and I don't have any interlocking mechanism across the threads at all. It appears the error is SIGSEGV sent off by libcepfs. The message is as follows: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904 # # JRE version: 6.0_26-b03 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 # # An error report file with more information is saved as: # /home/namd/cephBench/hs_err_pid14063.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. I have also attached the hs_err_pid14063.log for your reference. An excerpt from the file: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub So I think the probably may be due to the locking mechanism of ceph internally. But Dr. Weil previously answered my email stating that the mounting is done independently so multithreading should not lead to this problem. If there is anyway to work around this, please let me know. Best regards, Nam Dang Email: n...@de.cs.titech.ac.jp HP: (+81) 080-4465-1587 Yokota Lab, Dept. of Computer Science Tokyo Institute of Technology Tokyo, Japan hs_err_pid14063.log Description: Binary data
Re: SIGSEGV in cephfs-java, but probably in Ceph
It turned out my monitor went down without my knowing. So my bad, it wasn't because of Ceph. Best regards, Nam Dang Tokyo Institute of Technology Tokyo, Japan On Thu, May 31, 2012 at 10:08 PM, Nam Dang n...@de.cs.titech.ac.jp wrote: Dear all, I am running a small benchmark for Ceph with multithreading and cephfs-java API. I encountered this issue even when I use only two threads, and I used only open file and creating directory operations. The piece of code is simply: String parent = filePath.substring(0, filePath.lastIndexOf('/')); mount.mkdirs(parent, 0755); // create parents if the path does not exist int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); // open the file Each thread mounts its own ceph mounting point (using mount.mount(null)) and I don't have any interlocking mechanism across the threads at all. It appears the error is SIGSEGV sent off by libcepfs. The message is as follows: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904 # # JRE version: 6.0_26-b03 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 # # An error report file with more information is saved as: # /home/namd/cephBench/hs_err_pid14063.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. I have also attached the hs_err_pid14063.log for your reference. An excerpt from the file: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub So I think the probably may be due to the locking mechanism of ceph internally. But Dr. Weil previously answered my email stating that the mounting is done independently so multithreading should not lead to this problem. If there is anyway to work around this, please let me know. Best regards, Nam Dang Email: n...@de.cs.titech.ac.jp HP: (+81) 080-4465-1587 Yokota Lab, Dept. of Computer Science Tokyo Institute of Technology Tokyo, Japan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 31/05/2012 09:30, Yehuda Sadeh wrote: On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: Hi Marc, Hi Stefan, Hello, back today Today, I upgraded my 2 last osd nodes with big storage, so now all my nodes are equivalent. Using 3.4.0 kernel, I still have good results with rbd pool, but jumping values with data. first thanks for all your help and time. I found the commit which results in this problem and it is TCP related but i'm still wondering if the expected behaviour of this commit is expected? Yeah, this might have affected the tcp performance. Looking at the current linus tree this function looks more like it looked beforehand, so it was probable reverted this way or another! Yehuda Well, I saw you probably found the culprit. So tried the latest (this morning) git kernel. Now data gives good results : root@label5:~# rados -p data bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 215 199 795.765 796 0.073769 0.0745517 2 16 430 414 827.833 860 0.060165 0.0753952 3 16 632 616 821.207 808 0.072241 0.0772463 4 16 838 822 821.883 824 0.129571 0.0768741 5 16 1039 1023 818.271 804 0.056867 0.077637 6 16 1254 1238 825.209 860 0.078801 0.0771122 7 16 1474 1458 833.023 880 0.062886 0.0764071 8 16 1669 1653 826.389 780 0.09632 0.0767323 9 16 1877 1861 827.003 832 0.083765 0.0770398 10 16 2087 2071 828.294 840 0.051437 0.076937 11 16 2309 2293 833.714 888 0.080584 0.0764829 12 16 2535 2519 839.563 904 0.078095 0.0759574 13 16 2762 2746 844.816 908 0.081323 0.0754571 14 16 2984 2968 847.889 888 0.076973 0.0752921 15 16 3203 3187 849.754 876 0.069877 0.0750613 16 16 3437 3421 855.138 936 0.046845 0.0746941 17 16 3655 3639 856.126 872 0.052258 0.0745157 18 16 3862 3846 854.559 828 0.061542 0.0746875 19 16 4085 4069 856.525 892 0.053889 0.0745582 min lat: 0.033007 max lat: 0.462951 avg lat: 0.0743988 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 15 4308 4293 858.492 896 0.054176 0.0743988 Total time run:20.103415 Total writes made: 4309 Write size:4194304 Bandwidth (MB/sec):857.367 Average Latency: 0.0746302 Max latency: 0.462951 Min latency: 0.033007 But very strangely it's now rbd that isn't stable ?! root@label5:~# rados -p rbd bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 155 139555.87 556 0.046232 0.109021 2 16 250 234 467.923 380 0.046793 0.0985316 3 16 250 234 311.955 0 - 0.0985316 4 16 250 234 233.965 0 - 0.0985316 5 16 250 234 187.173 0 - 0.0985316 6 16 266 250 166.64516 0.038083 0.175697 7 16 266 250 142.839 0 - 0.175697 8 16 441 425 212.475 350 0.05512 0.298391 9 16 476 460 204.422 140 0.04372 0.280483 10 16 531 515 205.976 220 0.125076 0.309449 11 16 734 718261.06 812 0.127582 0.244134 12 16 795 779 259.637 244 0.065158 0.234156 13 16 818 802 246.74292 0.054514 0.241704 14 16 830 814 232.54648 0.044386 0.239006 15 16 837 821 218.90928 3.41523 0.267521 16 16 1043 1027 256.721 824 0.04898 0.248212 17 16 1147 1131 266.088 416 0.048591 0.232725 18 16 1147 1131 251.305 0 - 0.232725 19 16 1202 1186 249.657 110 0.081777 0.25501 min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 1296 1280255.97 376 0.053797 0.245711 21 9 1297 1288 245.30532 0.708133
Re: poor OSD performance using kernel 3.4 = problem found
Am 31.05.2012 15:21, schrieb Yann Dupont: On 31/05/2012 09:30, Yehuda Sadeh wrote: On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: But very strangely it's now rbd that isn't stable ?! root@label5:~# rados -p rbd bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 155 139 555.87 556 0.046232 0.109021 2 16 250 234 467.923 380 0.046793 0.0985316 3 16 250 234 311.955 0 - 0.0985316 4 16 250 234 233.965 0 - 0.0985316 5 16 250 234 187.173 0 - 0.0985316 6 16 266 250 166.645 16 0.038083 0.175697 7 16 266 250 142.839 0 - 0.175697 8 16 441 425 212.475 350 0.05512 0.298391 9 16 476 460 204.422 140 0.04372 0.280483 10 16 531 515 205.976 220 0.125076 0.309449 11 16 734 718 261.06 812 0.127582 0.244134 12 16 795 779 259.637 244 0.065158 0.234156 13 16 818 802 246.742 92 0.054514 0.241704 14 16 830 814 232.546 48 0.044386 0.239006 15 16 837 821 218.909 28 3.41523 0.267521 16 16 1043 1027 256.721 824 0.04898 0.248212 17 16 1147 1131 266.088 416 0.048591 0.232725 18 16 1147 1131 251.305 0 - 0.232725 19 16 1202 1186 249.657 110 0.081777 0.25501 min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 1296 1280 255.97 376 0.053797 0.245711 21 9 1297 1288 245.305 32 0.708133 0.248248 22 9 1297 1288 234.155 0 - 0.248248 23 9 1297 1288 223.975 0 - 0.248248 24 9 1297 1288 214.643 0 - 0.248248 25 9 1297 1288 206.057 0 - 0.248248 26 9 1297 1288 198.131 0 - 0.248248 Total time run: 26.829870 Total writes made: 1297 Write size: 4194304 Bandwidth (MB/sec): 193.367 Average Latency: 0.295922 Max latency: 7.36701 Min latency: 0.033773 Strange. I'm wondering if this has something to do with cache (that is, operation I could have done before on nodes, as all my nodes are just freshly rebooted). Please test setting these values on all OSDs and Clients: sysctl -w net.ipv4.tcp_rmem=409687380 514873 sysctl -w net.ipv4.tcp_wmem=409616384 514873 Stefan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote: Am 31.05.2012 15:21, schrieb Yann Dupont: On 31/05/2012 09:30, Yehuda Sadeh wrote: On Thu, May 31, 2012 at 12:10 AM, Stefan Priebe - Profihost AG s.pri...@profihost.ag wrote: But very strangely it's now rbd that isn't stable ?! root@label5:~# rados -p rbd bench 20 write -t 16 Maintaining 16 concurrent writes of 4194304 bytes for at least 20 seconds. sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 0 0 0 0 0 0 - 0 1 16 155 139 555.87 556 0.046232 0.109021 2 16 250 234 467.923 380 0.046793 0.0985316 3 16 250 234 311.955 0 - 0.0985316 4 16 250 234 233.965 0 - 0.0985316 5 16 250 234 187.173 0 - 0.0985316 6 16 266 250 166.645 16 0.038083 0.175697 7 16 266 250 142.839 0 - 0.175697 8 16 441 425 212.475 350 0.05512 0.298391 9 16 476 460 204.422 140 0.04372 0.280483 10 16 531 515 205.976 220 0.125076 0.309449 11 16 734 718 261.06 812 0.127582 0.244134 12 16 795 779 259.637 244 0.065158 0.234156 13 16 818 802 246.742 92 0.054514 0.241704 14 16 830 814 232.546 48 0.044386 0.239006 15 16 837 821 218.909 28 3.41523 0.267521 16 16 1043 1027 256.721 824 0.04898 0.248212 17 16 1147 1131 266.088 416 0.048591 0.232725 18 16 1147 1131 251.305 0 - 0.232725 19 16 1202 1186 249.657 110 0.081777 0.25501 min lat: 0.033773 max lat: 5.92059 avg lat: 0.245711 sec Cur ops started finished avg MB/s cur MB/s last lat avg lat 20 16 1296 1280 255.97 376 0.053797 0.245711 21 9 1297 1288 245.305 32 0.708133 0.248248 22 9 1297 1288 234.155 0 - 0.248248 23 9 1297 1288 223.975 0 - 0.248248 24 9 1297 1288 214.643 0 - 0.248248 25 9 1297 1288 206.057 0 - 0.248248 26 9 1297 1288 198.131 0 - 0.248248 Total time run: 26.829870 Total writes made: 1297 Write size: 4194304 Bandwidth (MB/sec): 193.367 Average Latency: 0.295922 Max latency: 7.36701 Min latency: 0.033773 Strange. I'm wondering if this has something to do with cache (that is, operation I could have done before on nodes, as all my nodes are just freshly rebooted). Please test setting these values on all OSDs and Clients: sysctl -w net.ipv4.tcp_rmem=409687380 514873 sysctl -w net.ipv4.tcp_wmem=409616384 514873 Stefan same. stable for pool data (845 MB/s average), jumping with rbd (229 average, with a max latency of 6). I'm with latest linus git kernel (commit af56e0aa35f3ae2a4c1a6d1000702df1dd78cb76) , and I based on the fact that the patch was reversed on it. I can try with plain 3.4.0 with 'culprit patch' manually reversed. what puzzles me is that this morning, with 3.4.0 it was rbd that was stable, and now I have the exact contrary. I'll begin to reboot with old 3.4.0 kernel to see if things are reproductible. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On May 31, 2012, at 6:20 AM, Nam Dang wrote: It turned out my monitor went down without my knowing. So my bad, it wasn't because of Ceph. I believe the segfault here is from client being null dereferenced in the c wrappers. Which patch set are you using? Best regards, Nam Dang Tokyo Institute of Technology Tokyo, Japan On Thu, May 31, 2012 at 10:08 PM, Nam Dang n...@de.cs.titech.ac.jp wrote: Dear all, I am running a small benchmark for Ceph with multithreading and cephfs-java API. I encountered this issue even when I use only two threads, and I used only open file and creating directory operations. The piece of code is simply: String parent = filePath.substring(0, filePath.lastIndexOf('/')); mount.mkdirs(parent, 0755); // create parents if the path does not exist int fileID = mount.open(filePath, CephConstants.O_CREAT, 0666); // open the file Each thread mounts its own ceph mounting point (using mount.mount(null)) and I don't have any interlocking mechanism across the threads at all. It appears the error is SIGSEGV sent off by libcepfs. The message is as follows: # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x7ff6af978d39, pid=14063, tid=140697400411904 # # JRE version: 6.0_26-b03 # Java VM: Java HotSpot(TM) 64-Bit Server VM (20.1-b02 mixed mode linux-amd64 compressed oops) # Problematic frame: # C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 # # An error report file with more information is saved as: # /home/namd/cephBench/hs_err_pid14063.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. I have also attached the hs_err_pid14063.log for your reference. An excerpt from the file: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub So I think the probably may be due to the locking mechanism of ceph internally. But Dr. Weil previously answered my email stating that the mounting is done independently so multithreading should not lead to this problem. If there is anyway to work around this, please let me know. Best regards, Nam Dang Email: n...@de.cs.titech.ac.jp HP: (+81) 080-4465-1587 Yokota Lab, Dept. of Computer Science Tokyo Institute of Technology Tokyo, Japan -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 31/05/2012 15:45, Yann Dupont wrote: On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote: what puzzles me is that this morning, with 3.4.0 it was rbd that was stable, and now I have the exact contrary. I'll begin to reboot with old 3.4.0 kernel to see if things are reproductible. Cheers, I'd say my problem is probably not related. Freshly rebooting all osd nodes with 3.4.0 kernel (the same kernel I used this morning) now gives pool data stable rbd unstable. As with current git, and the exact opposite of results I had tuesday this morning. Go figure. Could it have to do with previous usage in OSD ? or active mds ? or mon ? As I already said, as my osd are using btrfs with big medata features, so going back in 3.0 kernel need a complete reformat of my OSD before. But I will do it if you judge it can bring some light on this case. Cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On May 31, 2012, at 6:20 AM, Nam Dang wrote: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub Nevermind to my last comment. Hmm, I've seen this, but very rarely. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
differents ip/network link for osd replication and client-osd ?
Hi, Is it possible to use differents ip / network link for - replication between osd - network between client and osd ? I would like to use differents swichs/network card for osd replication. Regards, Alexandre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 05/31/2012 09:42 AM, Yann Dupont wrote: On 31/05/2012 15:45, Yann Dupont wrote: On 31/05/2012 15:37, Stefan Priebe - Profihost AG wrote: what puzzles me is that this morning, with 3.4.0 it was rbd that was stable, and now I have the exact contrary. I'll begin to reboot with old 3.4.0 kernel to see if things are reproductible. Cheers, I'd say my problem is probably not related. Freshly rebooting all osd nodes with 3.4.0 kernel (the same kernel I used this morning) now gives pool data stable rbd unstable. As with current git, and the exact opposite of results I had tuesday this morning. Go figure. Could it have to do with previous usage in OSD ? or active mds ? or mon ? As I already said, as my osd are using btrfs with big medata features, so going back in 3.0 kernel need a complete reformat of my OSD before. But I will do it if you judge it can bring some light on this case. Cheers, Hi Yann, Can you take a look at how many PGs are in each pool? ceph osd pool getpool pg_num Thanks, Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 31/05/2012 17:32, Mark Nelson wrote: ceph osd pool getpool pg_num My setup is detailed in a previous mail , But as I changed some parameters this morning, here we go : root@chichibu:~# ceph osd pool get data pg_num PG_NUM: 576 root@chichibu:~# ceph osd pool get rbd pg_num PG_NUM: 576 The pg num is quite low because I started with small OSD (9 osd with 200G each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out) but with much larger ( faster) storage. Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the OSD similars. Replication is set to 2. The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted via space_cache,compress=lzo,nobarrier,noatime. journal is on tmpfs : osd journal = /dev/shm/journal osd journal size = 6144 I know this is dangerous, remember It's NOT a production system for the moment. No OSD is full, I don't have much data stored for the moment. Concerning crush map, I'm not using the default one : The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1 place, 2 in another, and the 4 last in the principal place. There is 10G between all the nodes and they are in the same VLAN, no router involved (but there is (negligible ?) latency between nodes) I try to group host together to avoid problem when I loose a location (electrical problem, for example). Not sure I really customized the crush map as I should have. here is the map : begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 device4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 # types type 0 osd type 1 host type 2 rack type 3 pool # buckets host karuizawa { id -5# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.2 weight 1.000 } host hazelburn { id -6# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.3 weight 1.000 } rack loire { id -3# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item karuizawa weight 1.000 item hazelburn weight 1.000 } host carsebridge { id -8# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.5 weight 1.000 } host cameronbridge { id -9# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.6 weight 1.000 } rack chantrerie { id -7# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item carsebridge weight 1.000 item cameronbridge weight 1.000 } host chichibu { id -2# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.0 weight 1.000 } host glenesk { id -4# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.1 weight 1.000 } host braeval { id -10# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.7 weight 1.000 } host hanyu { id -11# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.8 weight 1.000 } rack lombarderie { id -12# do not change unnecessarily # weight 4.000 alg straw hash 0# rjenkins1 item chichibu weight 1.000 item glenesk weight 1.000 item braeval weight 1.000 item hanyu weight 1.000 } pool default { id -1# do not change unnecessarily # weight 8.000 alg straw hash 0# rjenkins1 item loire weight 2.000 item chantrerie weight 2.000 item lombarderie weight 4.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Hope it helps, cheers -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
Hi Noah, By the way, the test suite of cephfs-java has a bug. You should put the permission value in the form of 0777 instead of 777 since the number has to be octal. With 777 I got directories with weird permission settings. Best regards Nam Dang Tokyo Institute of Technology Tokyo, Japan On Thu, May 31, 2012 at 11:43 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: On May 31, 2012, at 6:20 AM, Nam Dang wrote: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub Nevermind to my last comment. Hmm, I've seen this, but very rarely. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On May 31, 2012, at 8:48 AM, Nam Dang wrote: Hi Noah, By the way, the test suite of cephfs-java has a bug. You should put the permission value in the form of 0777 instead of 777 since the number has to be octal. With 777 I got directories with weird permission settings. Thanks Nam, I'll fix this up. Best regards Nam Dang Tokyo Institute of Technology Tokyo, Japan On Thu, May 31, 2012 at 11:43 PM, Noah Watkins jayh...@cs.ucsc.edu wrote: On May 31, 2012, at 6:20 AM, Nam Dang wrote: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub Nevermind to my last comment. Hmm, I've seen this, but very rarely. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On 05/31/2012 10:43 AM, Yann Dupont wrote: On 31/05/2012 17:32, Mark Nelson wrote: ceph osd pool getpool pg_num My setup is detailed in a previous mail , But as I changed some parameters this morning, here we go : root@chichibu:~# ceph osd pool get data pg_num PG_NUM: 576 root@chichibu:~# ceph osd pool get rbd pg_num PG_NUM: 576 The pg num is quite low because I started with small OSD (9 osd with 200G each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out) but with much larger ( faster) storage. Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the OSD similars. Replication is set to 2. The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted via space_cache,compress=lzo,nobarrier,noatime. journal is on tmpfs : osd journal = /dev/shm/journal osd journal size = 6144 I know this is dangerous, remember It's NOT a production system for the moment. No OSD is full, I don't have much data stored for the moment. Concerning crush map, I'm not using the default one : The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1 place, 2 in another, and the 4 last in the principal place. There is 10G between all the nodes and they are in the same VLAN, no router involved (but there is (negligible ?) latency between nodes) I try to group host together to avoid problem when I loose a location (electrical problem, for example). Not sure I really customized the crush map as I should have. here is the map : begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 device4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 # types type 0 osd type 1 host type 2 rack type 3 pool # buckets host karuizawa { id -5 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.2 weight 1.000 } host hazelburn { id -6 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.3 weight 1.000 } rack loire { id -3 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item karuizawa weight 1.000 item hazelburn weight 1.000 } host carsebridge { id -8 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.5 weight 1.000 } host cameronbridge { id -9 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.6 weight 1.000 } rack chantrerie { id -7 # do not change unnecessarily # weight 2.000 alg straw hash 0 # rjenkins1 item carsebridge weight 1.000 item cameronbridge weight 1.000 } host chichibu { id -2 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.0 weight 1.000 } host glenesk { id -4 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.1 weight 1.000 } host braeval { id -10 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.7 weight 1.000 } host hanyu { id -11 # do not change unnecessarily # weight 1.000 alg straw hash 0 # rjenkins1 item osd.8 weight 1.000 } rack lombarderie { id -12 # do not change unnecessarily # weight 4.000 alg straw hash 0 # rjenkins1 item chichibu weight 1.000 item glenesk weight 1.000 item braeval weight 1.000 item hanyu weight 1.000 } pool default { id -1 # do not change unnecessarily # weight 8.000 alg straw hash 0 # rjenkins1 item loire weight 2.000 item chantrerie weight 2.000 item lombarderie weight 4.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Hope it helps, cheers Hi Yann, You might want to start out by running sar/iostat/collectl on the OSD nodes and seeing if anything looks funny during the slow test compared to the fast one. If that doesn't reveal much, you could run blktrace on one of the OSDs during the tests and see if the IO to the disk looks different. I can help out if you want to send me your blktrace results. Similarly you could watch the network streams for both tests and see if anything looks different there. Thanks! Mark -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 01/13] libceph: eliminate connection state DEAD
Reviewed-by: Yehuda Sadeh yeh...@inktank.com On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote: The ceph connection state DEAD is never set and is therefore not needed. Eliminate it. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/messenger.h | 1 - net/ceph/messenger.c | 6 -- 2 files changed, 0 insertions(+), 7 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 2521a95..aa506ca 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -119,7 +119,6 @@ struct ceph_msg_pos { #define CLOSED 10 /* we've closed the connection */ #define SOCK_CLOSED 11 /* socket state changed to closed */ #define OPENING 13 /* open connection w/ (possibly new) peer */ -#define DEAD 14 /* dead, about to kfree */ #define BACKOFF 15 /* diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 1a80907..42ca8aa 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2087,12 +2087,6 @@ bad_tag: */ static void queue_con(struct ceph_connection *con) { - if (test_bit(DEAD, con-state)) { - dout(queue_con %p ignoring: DEAD\n, - con); - return; - } - if (!con-ops-get(con)) { dout(queue_con %p ref count 0\n, con); return; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: differents ip/network link for osd replication and client-osd ?
On Thu, 31 May 2012, Alexandre DERUMIER wrote: Hi, Is it possible to use differents ip / network link for - replication between osd - network between client and osd ? I would like to use differents swichs/network card for osd replication. Yep: [osd] public network = 1.2.3.4/24 cluster network = 192.168.0.0/16 will make ceph-osd choose IPs in those subnets. You can also specify 'public addr' or 'cluster addr' to a specific IP, although that's more tedious to configure. sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: poor OSD performance using kernel 3.4 = problem found
On Thu, 31 May 2012, Yann Dupont wrote: On 31/05/2012 17:32, Mark Nelson wrote: ceph osd pool getpool pg_num My setup is detailed in a previous mail , But as I changed some parameters this morning, here we go : root@chichibu:~# ceph osd pool get data pg_num PG_NUM: 576 root@chichibu:~# ceph osd pool get rbd pg_num PG_NUM: 576 Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules the pools are mapped to? Thanks! sage The pg num is quite low because I started with small OSD (9 osd with 200G each - internal disks) when I formatted. Now, I reduced to 8 osd, (osd.4 is out) but with much larger ( faster) storage. Now, each of the 8 OSD have 5T on it, I try, for the moment, to keep the OSD similars. Replication is set to 2. The fs is btrfs formatted with big metadata (-l 64k -n64k), and mounted via space_cache,compress=lzo,nobarrier,noatime. journal is on tmpfs : osd journal = /dev/shm/journal osd journal size = 6144 I know this is dangerous, remember It's NOT a production system for the moment. No OSD is full, I don't have much data stored for the moment. Concerning crush map, I'm not using the default one : The 8 nodes are in 3 different locations (some kilometers away). 2 are in 1 place, 2 in another, and the 4 last in the principal place. There is 10G between all the nodes and they are in the same VLAN, no router involved (but there is (negligible ?) latency between nodes) I try to group host together to avoid problem when I loose a location (electrical problem, for example). Not sure I really customized the crush map as I should have. here is the map : begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 device4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 # types type 0 osd type 1 host type 2 rack type 3 pool # buckets host karuizawa { id -5# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.2 weight 1.000 } host hazelburn { id -6# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.3 weight 1.000 } rack loire { id -3# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item karuizawa weight 1.000 item hazelburn weight 1.000 } host carsebridge { id -8# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.5 weight 1.000 } host cameronbridge { id -9# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.6 weight 1.000 } rack chantrerie { id -7# do not change unnecessarily # weight 2.000 alg straw hash 0# rjenkins1 item carsebridge weight 1.000 item cameronbridge weight 1.000 } host chichibu { id -2# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.0 weight 1.000 } host glenesk { id -4# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.1 weight 1.000 } host braeval { id -10# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.7 weight 1.000 } host hanyu { id -11# do not change unnecessarily # weight 1.000 alg straw hash 0# rjenkins1 item osd.8 weight 1.000 } rack lombarderie { id -12# do not change unnecessarily # weight 4.000 alg straw hash 0# rjenkins1 item chichibu weight 1.000 item glenesk weight 1.000 item braeval weight 1.000 item hanyu weight 1.000 } pool default { id -1# do not change unnecessarily # weight 8.000 alg straw hash 0# rjenkins1 item loire weight 2.000 item chantrerie weight 2.000 item lombarderie weight 4.000 } # rules rule data { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Hope it helps, cheers -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/13] libceph: kill bad_proto ceph connection op
Reviewed-by: Yehuda Sadeh yeh...@inktank.com On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote: No code sets a bad_proto method in its ceph connection operations vector, so just get rid of it. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/messenger.h | 3 --- net/ceph/messenger.c | 5 - 2 files changed, 0 insertions(+), 8 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index aa506ca..74f6c9b 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -31,9 +31,6 @@ struct ceph_connection_operations { int (*verify_authorizer_reply) (struct ceph_connection *con, int len); int (*invalidate_authorizer)(struct ceph_connection *con); - /* protocol version mismatch */ - void (*bad_proto) (struct ceph_connection *con); - /* there was some error on the socket (disconnect, whatever) */ void (*fault) (struct ceph_connection *con); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 42ca8aa..07af994 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -1356,11 +1356,6 @@ static void fail_protocol(struct ceph_connection *con) { reset_connection(con); set_bit(CLOSED, con-state); /* in case there's queued work */ - - mutex_unlock(con-mutex); - if (con-ops-bad_proto) - con-ops-bad_proto(con); - mutex_lock(con-mutex); } static int process_connect(struct ceph_connection *con) -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/13] libceph: rename kvec_reset and kvec_add functions
Reviewed-by: Yehuda Sadeh yeh...@inktank.com On Wed, May 30, 2012 at 12:34 PM, Alex Elder el...@inktank.com wrote: The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add() are entirely private functions, so drop the ceph_ prefix in their name to make them slightly more wieldy. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 48 1 files changed, 24 insertions(+), 24 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 5ad1f0a..2e9054f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -484,14 +484,14 @@ static u32 get_global_seq(struct ceph_messenger *msgr, u32 gt) return ret; } -static void ceph_con_out_kvec_reset(struct ceph_connection *con) +static void con_out_kvec_reset(struct ceph_connection *con) { con-out_kvec_left = 0; con-out_kvec_bytes = 0; con-out_kvec_cur = con-out_kvec[0]; } -static void ceph_con_out_kvec_add(struct ceph_connection *con, +static void con_out_kvec_add(struct ceph_connection *con, size_t size, void *data) { int index; @@ -532,7 +532,7 @@ static void prepare_write_message(struct ceph_connection *con) struct ceph_msg *m; u32 crc; - ceph_con_out_kvec_reset(con); + con_out_kvec_reset(con); con-out_kvec_is_msg = true; con-out_msg_done = false; @@ -540,9 +540,9 @@ static void prepare_write_message(struct ceph_connection *con) * TCP packet that's a good thing. */ if (con-in_seq con-in_seq_acked) { con-in_seq_acked = con-in_seq; - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack); + con_out_kvec_add(con, sizeof (tag_ack), tag_ack); con-out_temp_ack = cpu_to_le64(con-in_seq_acked); - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack), + con_out_kvec_add(con, sizeof (con-out_temp_ack), con-out_temp_ack); } @@ -570,12 +570,12 @@ static void prepare_write_message(struct ceph_connection *con) BUG_ON(le32_to_cpu(m-hdr.front_len) != m-front.iov_len); /* tag + hdr + front + middle */ - ceph_con_out_kvec_add(con, sizeof (tag_msg), tag_msg); - ceph_con_out_kvec_add(con, sizeof (m-hdr), m-hdr); - ceph_con_out_kvec_add(con, m-front.iov_len, m-front.iov_base); + con_out_kvec_add(con, sizeof (tag_msg), tag_msg); + con_out_kvec_add(con, sizeof (m-hdr), m-hdr); + con_out_kvec_add(con, m-front.iov_len, m-front.iov_base); if (m-middle) - ceph_con_out_kvec_add(con, m-middle-vec.iov_len, + con_out_kvec_add(con, m-middle-vec.iov_len, m-middle-vec.iov_base); /* fill in crc (except data pages), footer */ @@ -624,12 +624,12 @@ static void prepare_write_ack(struct ceph_connection *con) con-in_seq_acked, con-in_seq); con-in_seq_acked = con-in_seq; - ceph_con_out_kvec_reset(con); + con_out_kvec_reset(con); - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack); + con_out_kvec_add(con, sizeof (tag_ack), tag_ack); con-out_temp_ack = cpu_to_le64(con-in_seq_acked); - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack), + con_out_kvec_add(con, sizeof (con-out_temp_ack), con-out_temp_ack); con-out_more = 1; /* more will follow.. eventually.. */ @@ -642,8 +642,8 @@ static void prepare_write_ack(struct ceph_connection *con) static void prepare_write_keepalive(struct ceph_connection *con) { dout(prepare_write_keepalive %p\n, con); - ceph_con_out_kvec_reset(con); - ceph_con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive); + con_out_kvec_reset(con); + con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive); set_bit(WRITE_PENDING, con-state); } @@ -688,8 +688,8 @@ static struct ceph_auth_handshake *get_connect_authorizer(struct ceph_connection */ static void prepare_write_banner(struct ceph_connection *con) { - ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER); - ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr), + con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER); + con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr), con-msgr-my_enc_addr); con-out_more = 0; @@ -736,10 +736,10 @@ static int prepare_write_connect(struct ceph_connection *con) con-out_connect.authorizer_len = auth ? cpu_to_le32(auth-authorizer_buf_len) : 0; - ceph_con_out_kvec_add(con, sizeof (con-out_connect), + con_out_kvec_add(con, sizeof (con-out_connect), con-out_connect); if (auth auth-authorizer_buf_len) -
Re: poor OSD performance using kernel 3.4 = problem found
Le 31/05/2012 18:29, Sage Weil a écrit : Can you post 'ceph osd dump | grep ^pool' so we can see which CRUSH rules the pools are mapped to? yes : root@label5:~# ceph osd dump | grep ^pool pool 0 'data' rep size 2 crush_ruleset 0 object_hash rjenkins pg_num 576 pgp_num 576 last_change 816 owner 0 crash_replay_interval 45 pool 1 'metadata' rep size 2 crush_ruleset 1 object_hash rjenkins pg_num 576 pgp_num 576 last_change 1 owner 0 pool 2 'rbd' rep size 2 crush_ruleset 2 object_hash rjenkins pg_num 576 pgp_num 576 last_change 1 owner 0 cheers, -- Yann Dupont - Service IRTS, DSI Université de Nantes Tel : 02.53.48.49.20 - Mail/Jabber : yann.dup...@univ-nantes.fr -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd rm image slow with big images ?
Hi, On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote: Hi, I trying to delete some rbd images with rbd rm, and it seem to be slow with big images. I'm testing it with just create a new image (1TB): # time rbd -p pool1 create --size 100 image2 real0m0.031s user0m0.015s sys 0m0.010s then just delete it, without having writed nothing in image # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real1m45.558s user0m14.683s sys 0m17.363s same test with 100GB # time rbd -p pool1 create --size 10 image2 real0m0.032s user0m0.016s sys 0m0.007s # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real0m10.499s user0m1.488s sys 0m1.720s I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs) network bandwith,diskio,cpu are low. Is it the normal behaviour ? Maybe some xfs tuning could help ? It's in the nature of RBD. A RBD image consists of multiple 4MB (default) RADOS objects. Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS objects, you can find those objects by doing: rados -p rbd ls Now, when you create a new image only the header is writting, but no object is written. When you start writing to a RBD image you will be writing to one of the 4MB objects. When it doesn't exist it will be created. So when you install your VM it will create objects, but not all of them. RBD knows which RADOS objects to access by three parameters: * Image name * Image size * Stripe size (4MB) So when your VM access for byte Y until Z on the disk, RBD knows which object to access by calculating this. Now, when you start removing the image there is no way of knowing which object exists and which doesn't, so RBD will try to remove all objects. In the case of a fresh image this results in 10.000 RADOS remove operations for non-existent objects and that is slow. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd rm image slow with big images ?
One note: he has written: then just delete it, without having writed nothing in image Am 31.05.2012 20:15, schrieb Wido den Hollander: Hi, On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote: Hi, I trying to delete some rbd images with rbd rm, and it seem to be slow with big images. I'm testing it with just create a new image (1TB): # time rbd -p pool1 create --size 100 image2 real 0m0.031s user 0m0.015s sys 0m0.010s then just delete it, without having writed nothing in image # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real 1m45.558s user 0m14.683s sys 0m17.363s same test with 100GB # time rbd -p pool1 create --size 10 image2 real 0m0.032s user 0m0.016s sys 0m0.007s # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real 0m10.499s user 0m1.488s sys 0m1.720s I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs) network bandwith,diskio,cpu are low. Is it the normal behaviour ? Maybe some xfs tuning could help ? It's in the nature of RBD. A RBD image consists of multiple 4MB (default) RADOS objects. Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS objects, you can find those objects by doing: rados -p rbd ls Now, when you create a new image only the header is writting, but no object is written. When you start writing to a RBD image you will be writing to one of the 4MB objects. When it doesn't exist it will be created. So when you install your VM it will create objects, but not all of them. RBD knows which RADOS objects to access by three parameters: * Image name * Image size * Stripe size (4MB) So when your VM access for byte Y until Z on the disk, RBD knows which object to access by calculating this. Now, when you start removing the image there is no way of knowing which object exists and which doesn't, so RBD will try to remove all objects. In the case of a fresh image this results in 10.000 RADOS remove operations for non-existent objects and that is slow. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd rm image slow with big images ?
On Thu, 31 May 2012, Wido den Hollander wrote: Hi, Is it the normal behaviour ? Maybe some xfs tuning could help ? It's in the nature of RBD. Yes. That said, the current implementation is also stupid: it's doing a single io at a time. #2256 (next sprint) will parallelize this to make it go much faster (probably an order of magnitude?). sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd rm image slow with big images ?
On 05/31/2012 08:16 PM, Stefan Priebe wrote: One note: he has written: then just delete it, without having writed nothing in image That is true, but RBD doesn't know that. There is no record of which object got created and which didn't, so the removal process has to issue a removal for each RBD object that might exist. That is the nature of RBD. It makes it simple and reliable. Wido Am 31.05.2012 20:15, schrieb Wido den Hollander: Hi, On 05/31/2012 09:12 AM, Alexandre DERUMIER wrote: Hi, I trying to delete some rbd images with rbd rm, and it seem to be slow with big images. I'm testing it with just create a new image (1TB): # time rbd -p pool1 create --size 100 image2 real 0m0.031s user 0m0.015s sys 0m0.010s then just delete it, without having writed nothing in image # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real 1m45.558s user 0m14.683s sys 0m17.363s same test with 100GB # time rbd -p pool1 create --size 10 image2 real 0m0.032s user 0m0.016s sys 0m0.007s # time rbd -p pool1 rm image2 Removing image: 100% complete...done. real 0m10.499s user 0m1.488s sys 0m1.720s I'm using journal in tmpfs, 3 servers, 15 osds with 1disk 15K (xfs) network bandwith,diskio,cpu are low. Is it the normal behaviour ? Maybe some xfs tuning could help ? It's in the nature of RBD. A RBD image consists of multiple 4MB (default) RADOS objects. Let's say you have a disk of 40GB, that will contain 10.000 4MB RADOS objects, you can find those objects by doing: rados -p rbd ls Now, when you create a new image only the header is writting, but no object is written. When you start writing to a RBD image you will be writing to one of the 4MB objects. When it doesn't exist it will be created. So when you install your VM it will create objects, but not all of them. RBD knows which RADOS objects to access by three parameters: * Image name * Image size * Stripe size (4MB) So when your VM access for byte Y until Z on the disk, RBD knows which object to access by calculating this. Now, when you start removing the image there is no way of knowing which object exists and which doesn't, so RBD will try to remove all objects. In the case of a fresh image this results in 10.000 RADOS remove operations for non-existent objects and that is slow. Wido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On Thursday, May 31, 2012 at 7:43 AM, Noah Watkins wrote: On May 31, 2012, at 6:20 AM, Nam Dang wrote: Stack: [0x7ff6aa828000,0x7ff6aa929000], sp=0x7ff6aa9274f0, free space=1021k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [libcephfs.so.1+0x139d39] Mutex::Lock(bool)+0x9 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j com.ceph.fs.CephMount.native_ceph_mkdirs(JLjava/lang/String;I)I+0 j com.ceph.fs.CephMount.mkdirs(Ljava/lang/String;I)V+6 j Benchmark$CreateFileStats.executeOp(IILjava/lang/String;Lcom/ceph/fs/CephMount;)J+37 j Benchmark$StatsDaemon.benchmarkOne()V+22 j Benchmark$StatsDaemon.run()V+26 v ~StubRoutines::call_stub Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? Nam: it definitely shouldn't be segfaulting just because a monitor went down. :) -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On May 31, 2012, at 3:39 PM, Greg Farnum wrote: Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? I _think_ this is because the JVM uses its own threading library, and Ceph assumes pthreads and pthread compatible mutexes--is that assumption about Ceph correct? Hence the error that looks like Mutex::lock(bool) being reference for context during the segfault. To verify this all that is needed is some synchronization added to the Java. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. Nam: where are you grabbing the Java patches from? I'll push some updates. The only other scenario that comes to mind is related to signaling: The RADOS Java wrappers suffered from an interaction between the JVM and RADOS client signal handlers, in which either the JVM or RADOS would replace the handlers for the other (not sure which order). Anyway, the solution was to link in the JVM libjsig.so signal chaining library. This might be the same thing we are seeing here, but I'm betting it is the first theory I mentioned. - Noah-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
iozone test crashed on ceph
Hi, I have set up ceph system with a client, mon and mds on one system which is connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But when I ran iozone test with a 5G file, the process got killed and our ceph system hanged. Can anyone please help me with this. Thanks in advance. --Udit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RBD operations, pinging client that serves lingering tid
Those messages are harmless. It's just debug output indicating that the objecter is maintaining a watch on an rbd image header. I'll tone down the debug verbosity tomorrow. -Sam On Wed, May 30, 2012 at 6:54 AM, Guido Winkelmann guido-c...@thisisnotatest.de wrote: Hi, Whenever I'm doing any operations on rbd volumes (like import, copy) using the rbd command line client, I'm getting these messages every couple of seconds: 2012-05-30 15:53:08.010326 7f027aa47700 0 client.4159.objecter pinging osd that serves lingering tid 1 (osd.2) 2012-05-30 15:53:08.010344 7f027aa47700 0 client.4159.objecter pinging osd that serves lingering tid 2 (osd.0) What does this mean? Is that anything to worry about? Yesterday, these messages were only mentioning osd.2, not osd.0... Guido -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: iozone test crashed on ceph
Hi, Thanks for letting us know. What version are you running? Can you post your ceph.conf to give us an idea of how your cluster is configured? Also, did any of the daemons crash? If it's reproducible, it would help to turn up osd and mds debugging to 20 and post the logs. Thanks -Sam On Thu, May 31, 2012 at 5:58 PM, udit agarwal fzdu...@gmail.com wrote: Hi, I have set up ceph system with a client, mon and mds on one system which is connected to 2 osds. I ran iozone test with a 10G file and it ran fine. But when I ran iozone test with a 5G file, the process got killed and our ceph system hanged. Can anyone please help me with this. Thanks in advance. --Udit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/13] libceph: rename socket callbacks
On Wed, 30 May 2012, Alex Elder wrote: Change the names of the three socket callback functions to make it more obvious they're specifically associated with a connection's socket (not the ceph connection that uses it). Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 28 ++-- 1 files changed, 14 insertions(+), 14 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index fe3c2a1..5ad1f0a 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -153,46 +153,46 @@ EXPORT_SYMBOL(ceph_msgr_flush); */ /* data available on socket, or listen socket received a connect */ -static void ceph_data_ready(struct sock *sk, int count_unused) +static void ceph_sock_data_ready(struct sock *sk, int count_unused) { struct ceph_connection *con = sk-sk_user_data; if (sk-sk_state != TCP_CLOSE_WAIT) { - dout(ceph_data_ready on %p state = %lu, queueing work\n, + dout(%s on %p state = %lu, queueing work\n, __func__, I think it's marginally better to do dout(__func__ on %p state = %lu, queueing work\n, so that the concatenation happens at compile-time instead of runtime. Otherwise, looks good! Reviewed-by: Sage Weil s...@inktank.com con, con-state); queue_con(con); } } /* socket has buffer space for writing */ -static void ceph_write_space(struct sock *sk) +static void ceph_sock_write_space(struct sock *sk) { struct ceph_connection *con = sk-sk_user_data; /* only queue to workqueue if there is data we want to write, * and there is sufficient space in the socket buffer to accept - * more data. clear SOCK_NOSPACE so that ceph_write_space() + * more data. clear SOCK_NOSPACE so that ceph_sock_write_space() * doesn't get called again until try_write() fills the socket * buffer. See net/ipv4/tcp_input.c:tcp_check_space() * and net/core/stream.c:sk_stream_write_space(). */ if (test_bit(WRITE_PENDING, con-state)) { if (sk_stream_wspace(sk) = sk_stream_min_wspace(sk)) { - dout(ceph_write_space %p queueing write work\n, con); + dout(%s %p queueing write work\n, __func__, con); clear_bit(SOCK_NOSPACE, sk-sk_socket-flags); queue_con(con); } } else { - dout(ceph_write_space %p nothing to write\n, con); + dout(%s %p nothing to write\n, __func__, con); } } /* socket's state has changed */ -static void ceph_state_change(struct sock *sk) +static void ceph_sock_state_change(struct sock *sk) { struct ceph_connection *con = sk-sk_user_data; - dout(ceph_state_change %p state = %lu sk_state = %u\n, + dout(%s %p state = %lu sk_state = %u\n, __func__, con, con-state, sk-sk_state); if (test_bit(CLOSED, con-state)) @@ -200,9 +200,9 @@ static void ceph_state_change(struct sock *sk) switch (sk-sk_state) { case TCP_CLOSE: - dout(ceph_state_change TCP_CLOSE\n); + dout(%s TCP_CLOSE\n, __func__); case TCP_CLOSE_WAIT: - dout(ceph_state_change TCP_CLOSE_WAIT\n); + dout(%s TCP_CLOSE_WAIT\n, __func__); if (test_and_set_bit(SOCK_CLOSED, con-state) == 0) { if (test_bit(CONNECTING, con-state)) con-error_msg = connection failed; @@ -212,7 +212,7 @@ static void ceph_state_change(struct sock *sk) } break; case TCP_ESTABLISHED: - dout(ceph_state_change TCP_ESTABLISHED\n); + dout(%s TCP_ESTABLISHED\n, __func__); queue_con(con); break; default:/* Everything else is uninteresting */ @@ -228,9 +228,9 @@ static void set_sock_callbacks(struct socket *sock, { struct sock *sk = sock-sk; sk-sk_user_data = con; - sk-sk_data_ready = ceph_data_ready; - sk-sk_write_space = ceph_write_space; - sk-sk_state_change = ceph_state_change; + sk-sk_data_ready = ceph_sock_data_ready; + sk-sk_write_space = ceph_sock_write_space; + sk-sk_state_change = ceph_sock_state_change; } -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/13] libceph: rename kvec_reset and kvec_add functions
Yep On Wed, 30 May 2012, Alex Elder wrote: The functions ceph_con_out_kvec_reset() and ceph_con_out_kvec_add() are entirely private functions, so drop the ceph_ prefix in their name to make them slightly more wieldy. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c | 48 1 files changed, 24 insertions(+), 24 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 5ad1f0a..2e9054f 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -484,14 +484,14 @@ static u32 get_global_seq(struct ceph_messenger *msgr, u32 gt) return ret; } -static void ceph_con_out_kvec_reset(struct ceph_connection *con) +static void con_out_kvec_reset(struct ceph_connection *con) { con-out_kvec_left = 0; con-out_kvec_bytes = 0; con-out_kvec_cur = con-out_kvec[0]; } -static void ceph_con_out_kvec_add(struct ceph_connection *con, +static void con_out_kvec_add(struct ceph_connection *con, size_t size, void *data) { int index; @@ -532,7 +532,7 @@ static void prepare_write_message(struct ceph_connection *con) struct ceph_msg *m; u32 crc; - ceph_con_out_kvec_reset(con); + con_out_kvec_reset(con); con-out_kvec_is_msg = true; con-out_msg_done = false; @@ -540,9 +540,9 @@ static void prepare_write_message(struct ceph_connection *con) * TCP packet that's a good thing. */ if (con-in_seq con-in_seq_acked) { con-in_seq_acked = con-in_seq; - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack); + con_out_kvec_add(con, sizeof (tag_ack), tag_ack); con-out_temp_ack = cpu_to_le64(con-in_seq_acked); - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack), + con_out_kvec_add(con, sizeof (con-out_temp_ack), con-out_temp_ack); } @@ -570,12 +570,12 @@ static void prepare_write_message(struct ceph_connection *con) BUG_ON(le32_to_cpu(m-hdr.front_len) != m-front.iov_len); /* tag + hdr + front + middle */ - ceph_con_out_kvec_add(con, sizeof (tag_msg), tag_msg); - ceph_con_out_kvec_add(con, sizeof (m-hdr), m-hdr); - ceph_con_out_kvec_add(con, m-front.iov_len, m-front.iov_base); + con_out_kvec_add(con, sizeof (tag_msg), tag_msg); + con_out_kvec_add(con, sizeof (m-hdr), m-hdr); + con_out_kvec_add(con, m-front.iov_len, m-front.iov_base); if (m-middle) - ceph_con_out_kvec_add(con, m-middle-vec.iov_len, + con_out_kvec_add(con, m-middle-vec.iov_len, m-middle-vec.iov_base); /* fill in crc (except data pages), footer */ @@ -624,12 +624,12 @@ static void prepare_write_ack(struct ceph_connection *con) con-in_seq_acked, con-in_seq); con-in_seq_acked = con-in_seq; - ceph_con_out_kvec_reset(con); + con_out_kvec_reset(con); - ceph_con_out_kvec_add(con, sizeof (tag_ack), tag_ack); + con_out_kvec_add(con, sizeof (tag_ack), tag_ack); con-out_temp_ack = cpu_to_le64(con-in_seq_acked); - ceph_con_out_kvec_add(con, sizeof (con-out_temp_ack), + con_out_kvec_add(con, sizeof (con-out_temp_ack), con-out_temp_ack); con-out_more = 1; /* more will follow.. eventually.. */ @@ -642,8 +642,8 @@ static void prepare_write_ack(struct ceph_connection *con) static void prepare_write_keepalive(struct ceph_connection *con) { dout(prepare_write_keepalive %p\n, con); - ceph_con_out_kvec_reset(con); - ceph_con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive); + con_out_kvec_reset(con); + con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive); set_bit(WRITE_PENDING, con-state); } @@ -688,8 +688,8 @@ static struct ceph_auth_handshake *get_connect_authorizer(struct ceph_connection */ static void prepare_write_banner(struct ceph_connection *con) { - ceph_con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER); - ceph_con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr), + con_out_kvec_add(con, strlen(CEPH_BANNER), CEPH_BANNER); + con_out_kvec_add(con, sizeof (con-msgr-my_enc_addr), con-msgr-my_enc_addr); con-out_more = 0; @@ -736,10 +736,10 @@ static int prepare_write_connect(struct ceph_connection *con) con-out_connect.authorizer_len = auth ? cpu_to_le32(auth-authorizer_buf_len) : 0; - ceph_con_out_kvec_add(con, sizeof (con-out_connect), + con_out_kvec_add(con, sizeof (con-out_connect), con-out_connect); if (auth auth-authorizer_buf_len) - ceph_con_out_kvec_add(con, auth-authorizer_buf_len, + con_out_kvec_add(con, auth-authorizer_buf_len,
Re: [PATCH 06/13] libceph: embed ceph messenger structure in ceph_client
Reviewed-by: Sage Weil s...@inktank.com On Wed, 30 May 2012, Alex Elder wrote: A ceph client has a pointer to a ceph messenger structure in it. There is always exactly one ceph messenger for a ceph client, so there is no need to allocate it separate from the ceph client structure. Switch the ceph_client structure to embed its ceph_messenger structure. Signed-off-by: Alex Elder el...@inktank.com --- fs/ceph/mds_client.c |2 +- include/linux/ceph/libceph.h |2 +- include/linux/ceph/messenger.h |9 + net/ceph/ceph_common.c | 18 +- net/ceph/messenger.c | 30 +- net/ceph/mon_client.c |6 +++--- net/ceph/osd_client.c |4 ++-- 7 files changed, 26 insertions(+), 45 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 200bc87..ad30261 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -394,7 +394,7 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc, s-s_seq = 0; mutex_init(s-s_mutex); - ceph_con_init(mdsc-fsc-client-msgr, s-s_con); + ceph_con_init(mdsc-fsc-client-msgr, s-s_con); s-s_con.private = s; s-s_con.ops = mds_con_ops; s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS; diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 92eef7c..927361c 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -131,7 +131,7 @@ struct ceph_client { u32 supported_features; u32 required_features; - struct ceph_messenger *msgr; /* messenger instance */ + struct ceph_messenger msgr; /* messenger instance */ struct ceph_mon_client monc; struct ceph_osd_client osdc; diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 74f6c9b..3fbd4be 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -211,10 +211,11 @@ extern int ceph_msgr_init(void); extern void ceph_msgr_exit(void); extern void ceph_msgr_flush(void); -extern struct ceph_messenger *ceph_messenger_create( - struct ceph_entity_addr *myaddr, - u32 features, u32 required); -extern void ceph_messenger_destroy(struct ceph_messenger *); +extern void ceph_messenger_init(struct ceph_messenger *msgr, + struct ceph_entity_addr *myaddr, + u32 supported_features, + u32 required_features, + bool nocrc); extern void ceph_con_init(struct ceph_messenger *msgr, struct ceph_connection *con); diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index cc91319..2de3ea1 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -468,19 +468,15 @@ struct ceph_client *ceph_create_client(struct ceph_options *opt, void *private, /* msgr */ if (ceph_test_opt(client, MYIP)) myaddr = client-options-my_addr; - client-msgr = ceph_messenger_create(myaddr, - client-supported_features, - client-required_features); - if (IS_ERR(client-msgr)) { - err = PTR_ERR(client-msgr); - goto fail; - } - client-msgr-nocrc = ceph_test_opt(client, NOCRC); + ceph_messenger_init(client-msgr, myaddr, + client-supported_features, + client-required_features, + ceph_test_opt(client, NOCRC)); /* subsystems */ err = ceph_monc_init(client-monc, client); if (err 0) - goto fail_msgr; + goto fail; err = ceph_osdc_init(client-osdc, client); if (err 0) goto fail_monc; @@ -489,8 +485,6 @@ struct ceph_client *ceph_create_client(struct ceph_options *opt, void *private, fail_monc: ceph_monc_stop(client-monc); -fail_msgr: - ceph_messenger_destroy(client-msgr); fail: kfree(client); return ERR_PTR(err); @@ -515,8 +509,6 @@ void ceph_destroy_client(struct ceph_client *client) ceph_debugfs_client_cleanup(client); - ceph_messenger_destroy(client-msgr); - ceph_destroy_options(client-options); kfree(client); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 2e9054f..19f1948 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -2243,18 +2243,14 @@ out: /* - * create a new messenger instance + * initialize a new messenger instance */ -struct ceph_messenger *ceph_messenger_create(struct ceph_entity_addr *myaddr, - u32 supported_features, - u32 required_features) +void ceph_messenger_init(struct ceph_messenger *msgr, + struct ceph_entity_addr *myaddr, + u32
Re: iozone test crashed on ceph
Hi, thanks for your reply. The output of 'modinfo ceph' is as follows: filename: /lib/modules/3.1.10-1.9-desktop/kernel/fs/ceph/ceph.ko license:GPL description:Ceph filesystem for Linux author: Patience Warnick patie...@newdream.net author: Yehuda Sadeh yeh...@hq.newdream.net author: Sage Weil s...@newdream.net srcversion: AFEFF779535E750AFD4072D depends: vermagic: 3.1.10-1.9-desktop SMP preempt mod_unload modversions And my ceph.conf file is as follows: [global] pid file = /var/run/ceph/$name.pid logger dir = /var/log/ceph log dir = /var/log/ceph user = root [mon] mon data = /var/local/data/mon$id ; debug ms = 1 ; debug mon = 20 ; debug paxos = 20 [mon.0] host = hp1 mon addr = 192.168.20.6:6789 ;[mon.1] ; host = hp2 ; mon addr = 192.168.20.7:6789 ;[mon.2] ; host = bb1 ; mon addr = 192.168.20.2:6789 [mds] ; debug ms = 1; message traffic ; debug mds = 1 ; mds ; debug mds balancer = 20 ; load balancing ; debug mds log = 20 ; mds journaling ; debug mds_migrator = 20 ; metadata migration ; debug monc = 20 ; monitor interaction, startup [mds.0] host = hp1 ;[mds.1] ; host = hp2 [osd] osd journal = /var/local/data/osd$id/journal osd journal size = 1 filestore journal writeahead = true osd data = /var/local/data/osd$id ; debug ms = 1; message traffic ; debug osd = 20 ; debug filestore = 20; local object storage ; debug journal = 20 ; local journaling ; debug monc = 20 ; monitor interaction, startup [osd.0] host = el1 btrfs devs = /dev/sda3 [osd.1] host = el1 btrfs devs = /dev/sdb [osd.2] host = bb1 btrfs devs = /dev/sda3 No, I don't think so if any of them crashed. Thanks in advance and let me know if you need further info. --Udit -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 07/13] libceph: embed ceph connection structure in mon_client
On Wed, 30 May 2012, Alex Elder wrote: A monitor client has a pointer to a ceph connection structure in it. This is the only one of the three ceph client types that do it this way; the OSD and MDS clients embed the connection into their main structures. There is always exactly one ceph connection for a monitor client, so there is no need to allocate it separate from the monitor client structure. So switch the ceph_mon_client structure to embed its ceph_connection structure. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/mon_client.h |2 +- net/ceph/mon_client.c | 47 -- 2 files changed, 21 insertions(+), 28 deletions(-) diff --git a/include/linux/ceph/mon_client.h b/include/linux/ceph/mon_client.h index 545f859..2113e38 100644 --- a/include/linux/ceph/mon_client.h +++ b/include/linux/ceph/mon_client.h @@ -70,7 +70,7 @@ struct ceph_mon_client { bool hunting; int cur_mon; /* last monitor i contacted */ unsigned long sub_sent, sub_renew_after; - struct ceph_connection *con; + struct ceph_connection con; bool have_fsid; /* pending generic requests */ diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 704dc95..ac4d6b1 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -106,9 +106,9 @@ static void __send_prepared_auth_request(struct ceph_mon_client *monc, int len) monc-pending_auth = 1; monc-m_auth-front.iov_len = len; monc-m_auth-hdr.front_len = cpu_to_le32(len); - ceph_con_revoke(monc-con, monc-m_auth); + ceph_con_revoke(monc-con, monc-m_auth); ceph_msg_get(monc-m_auth); /* keep our ref */ - ceph_con_send(monc-con, monc-m_auth); + ceph_con_send(monc-con, monc-m_auth); } /* @@ -117,8 +117,8 @@ static void __send_prepared_auth_request(struct ceph_mon_client *monc, int len) static void __close_session(struct ceph_mon_client *monc) { dout(__close_session closing mon%d\n, monc-cur_mon); - ceph_con_revoke(monc-con, monc-m_auth); - ceph_con_close(monc-con); + ceph_con_revoke(monc-con, monc-m_auth); + ceph_con_close(monc-con); monc-cur_mon = -1; monc-pending_auth = 0; ceph_auth_reset(monc-auth); @@ -142,9 +142,9 @@ static int __open_session(struct ceph_mon_client *monc) monc-want_next_osdmap = !!monc-want_next_osdmap; dout(open_session mon%d opening\n, monc-cur_mon); - monc-con-peer_name.type = CEPH_ENTITY_TYPE_MON; - monc-con-peer_name.num = cpu_to_le64(monc-cur_mon); - ceph_con_open(monc-con, + monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON; + monc-con.peer_name.num = cpu_to_le64(monc-cur_mon); + ceph_con_open(monc-con, monc-monmap-mon_inst[monc-cur_mon].addr); /* initiatiate authentication handshake */ @@ -226,8 +226,8 @@ static void __send_subscribe(struct ceph_mon_client *monc) msg-front.iov_len = p - msg-front.iov_base; msg-hdr.front_len = cpu_to_le32(msg-front.iov_len); - ceph_con_revoke(monc-con, msg); - ceph_con_send(monc-con, ceph_msg_get(msg)); + ceph_con_revoke(monc-con, msg); + ceph_con_send(monc-con, ceph_msg_get(msg)); monc-sub_sent = jiffies | 1; /* never 0 */ } @@ -247,7 +247,7 @@ static void handle_subscribe_ack(struct ceph_mon_client *monc, if (monc-hunting) { pr_info(mon%d %s session established\n, monc-cur_mon, - ceph_pr_addr(monc-con-peer_addr.in_addr)); + ceph_pr_addr(monc-con.peer_addr.in_addr)); monc-hunting = false; } dout(handle_subscribe_ack after %d seconds\n, seconds); @@ -461,7 +461,7 @@ static int do_generic_request(struct ceph_mon_client *monc, req-request-hdr.tid = cpu_to_le64(req-tid); __insert_generic_request(monc, req); monc-num_generic_requests++; - ceph_con_send(monc-con, ceph_msg_get(req-request)); + ceph_con_send(monc-con, ceph_msg_get(req-request)); mutex_unlock(monc-mutex); err = wait_for_completion_interruptible(req-completion); @@ -684,8 +684,8 @@ static void __resend_generic_request(struct ceph_mon_client *monc) for (p = rb_first(monc-generic_request_tree); p; p = rb_next(p)) { req = rb_entry(p, struct ceph_mon_generic_request, node); - ceph_con_revoke(monc-con, req-request); - ceph_con_send(monc-con, ceph_msg_get(req-request)); + ceph_con_revoke(monc-con, req-request); + ceph_con_send(monc-con, ceph_msg_get(req-request)); } } @@ -705,7 +705,7 @@ static void delayed_work(struct work_struct *work) __close_session(monc);
Re: [PATCH 08/13] libceph: start separating connection flags from state
On Wed, 30 May 2012, Alex Elder wrote: A ceph_connection holds a mixture of connection state (as in state machine state) and connection flags in a single state field. To make the distinction more clear, define a new flags field and use it rather than the state field to hold Boolean flag values. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/messenger.h | 18 + net/ceph/messenger.c | 50 2 files changed, 37 insertions(+), 31 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3fbd4be..920235e 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -103,20 +103,25 @@ struct ceph_msg_pos { #define MAX_DELAY_INTERVAL (5 * 60 * HZ) /* - * ceph_connection state bit flags + * ceph_connection flag bits */ + #define LOSSYTX 0 /* we can close channel or drop messages on errors */ -#define CONNECTING 1 -#define NEGOTIATING 2 #define KEEPALIVE_PENDING 3 #define WRITE_PENDING4 /* we have data ready to send */ +#define SOCK_CLOSED 11 /* socket state changed to closed */ +#define BACKOFF 15 + +/* + * ceph_connection states + */ +#define CONNECTING 1 +#define NEGOTIATING 2 #define STANDBY 8 /* no outgoing messages, socket closed. we keep * the ceph_connection around to maintain shared * state with the peer. */ #define CLOSED 10 /* we've closed the connection */ -#define SOCK_CLOSED 11 /* socket state changed to closed */ #define OPENING 13 /* open connection w/ (possibly new) peer */ -#define BACKOFF 15 Later it might be work prefixing these with FLAG_ and/or STATE_. Reviewed-by: Sage Weil s...@inktank.com /* * A single connection with another host. @@ -133,7 +138,8 @@ struct ceph_connection { struct ceph_messenger *msgr; struct socket *sock; - unsigned long state;/* connection state (see flags above) */ + unsigned long flags; + unsigned long state; const char *error_msg; /* error message, if any */ struct ceph_entity_addr peer_addr; /* peer address */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 19f1948..29055df 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -176,7 +176,7 @@ static void ceph_sock_write_space(struct sock *sk) * buffer. See net/ipv4/tcp_input.c:tcp_check_space() * and net/core/stream.c:sk_stream_write_space(). */ - if (test_bit(WRITE_PENDING, con-state)) { + if (test_bit(WRITE_PENDING, con-flags)) { if (sk_stream_wspace(sk) = sk_stream_min_wspace(sk)) { dout(%s %p queueing write work\n, __func__, con); clear_bit(SOCK_NOSPACE, sk-sk_socket-flags); @@ -203,7 +203,7 @@ static void ceph_sock_state_change(struct sock *sk) dout(%s TCP_CLOSE\n, __func__); case TCP_CLOSE_WAIT: dout(%s TCP_CLOSE_WAIT\n, __func__); - if (test_and_set_bit(SOCK_CLOSED, con-state) == 0) { + if (test_and_set_bit(SOCK_CLOSED, con-flags) == 0) { if (test_bit(CONNECTING, con-state)) con-error_msg = connection failed; else @@ -393,9 +393,9 @@ void ceph_con_close(struct ceph_connection *con) ceph_pr_addr(con-peer_addr.in_addr)); set_bit(CLOSED, con-state); /* in case there's queued work */ clear_bit(STANDBY, con-state); /* avoid connect_seq bump */ - clear_bit(LOSSYTX, con-state); /* so we retry next connect */ - clear_bit(KEEPALIVE_PENDING, con-state); - clear_bit(WRITE_PENDING, con-state); + clear_bit(LOSSYTX, con-flags); /* so we retry next connect */ + clear_bit(KEEPALIVE_PENDING, con-flags); + clear_bit(WRITE_PENDING, con-flags); mutex_lock(con-mutex); reset_connection(con); con-peer_global_seq = 0; @@ -612,7 +612,7 @@ static void prepare_write_message(struct ceph_connection *con) prepare_write_message_footer(con); } - set_bit(WRITE_PENDING, con-state); + set_bit(WRITE_PENDING, con-flags); } /* @@ -633,7 +633,7 @@ static void prepare_write_ack(struct ceph_connection *con) con-out_temp_ack); con-out_more = 1; /* more will follow.. eventually.. */ - set_bit(WRITE_PENDING, con-state); + set_bit(WRITE_PENDING, con-flags); } /* @@ -644,7 +644,7 @@ static void prepare_write_keepalive(struct ceph_connection *con) dout(prepare_write_keepalive %p\n, con); con_out_kvec_reset(con); con_out_kvec_add(con, sizeof (tag_keepalive), tag_keepalive); - set_bit(WRITE_PENDING, con-state); + set_bit(WRITE_PENDING, con-flags); } /* @@ -673,7 +673,7 @@
Re: [PATCH 09/13] libceph: start tracking connection socket state
On Wed, 30 May 2012, Alex Elder wrote: Start explicitly keeping track of the state of a ceph connection's socket, separate from the state of the connection itself. Create placeholder functions to encapsulate the state transitions. | NEW* | transient initial state | con_sock_state_init() v -- | CLOSED | initialized, but no socket (and no -- TCP connection) ^ \ | \ con_sock_state_connecting() |-- | \ + con_sock_state_closed() \ |\ \ | \ \ | --- \ | | CLOSING | socket event; \ | --- await close \ | ^| | || | + con_sock_state_closing() | | / \ | | / ---| |/ \ v | /-- | /-| CONNECTING | socket created, TCP | | / -- connect initiated | | | con_sock_state_connected() | | v - | CONNECTED | TCP connection established - Can we put this beautiful pictures in the header next to the states? Reviewed-by: Sage Weil s...@inktank.com Make the socket state an atomic variable, reinforcing that it's a distinct transtion with no possible intermediate/both states. This is almost certainly overkill at this point, though the transitions into CONNECTED and CLOSING state do get called via socket callback (the rest of the transitions occur with the connection mutex held). We can back out the atomicity later. Signed-off-by: Alex Elder el...@inktank.com --- include/linux/ceph/messenger.h |8 - net/ceph/messenger.c | 63 2 files changed, 69 insertions(+), 2 deletions(-) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 920235e..5e852f4 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -137,14 +137,18 @@ struct ceph_connection { const struct ceph_connection_operations *ops; struct ceph_messenger *msgr; + + atomic_t sock_state; struct socket *sock; + struct ceph_entity_addr peer_addr; /* peer address */ + struct ceph_entity_addr peer_addr_for_me; + unsigned long flags; unsigned long state; const char *error_msg; /* error message, if any */ - struct ceph_entity_addr peer_addr; /* peer address */ struct ceph_entity_name peer_name; /* peer name */ - struct ceph_entity_addr peer_addr_for_me; + unsigned peer_features; u32 connect_seq; /* identify the most recent connection attempt for this connection, client */ diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 29055df..7e11b07 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -29,6 +29,14 @@ * the sender. */ +/* State values for ceph_connection-sock_state; NEW is assumed to be 0 */ + +#define CON_SOCK_STATE_NEW 0 /* - CLOSED */ +#define CON_SOCK_STATE_CLOSED1 /* - CONNECTING */ +#define CON_SOCK_STATE_CONNECTING2 /* - CONNECTED or - CLOSING */ +#define CON_SOCK_STATE_CONNECTED 3 /* - CLOSING or - CLOSED */ +#define CON_SOCK_STATE_CLOSING 4 /* - CLOSED */ + /* static tag bytes (protocol control messages) */ static char tag_msg = CEPH_MSGR_TAG_MSG; static char tag_ack = CEPH_MSGR_TAG_ACK; @@ -147,6 +155,54 @@ void ceph_msgr_flush(void) } EXPORT_SYMBOL(ceph_msgr_flush); +/* Connection socket state transition functions */ + +static void con_sock_state_init(struct ceph_connection *con) +{ + int old_state; + + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CLOSED); + if (WARN_ON(old_state != CON_SOCK_STATE_NEW)) + printk(%s: unexpected old state %d\n, __func__, old_state); +} + +static void con_sock_state_connecting(struct ceph_connection *con) +{ + int old_state; + + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTING); + if (WARN_ON(old_state != CON_SOCK_STATE_CLOSED)) + printk(%s: unexpected old state %d\n, __func__, old_state); +} + +static void con_sock_state_connected(struct ceph_connection *con) +{ + int old_state; + + old_state = atomic_xchg(con-sock_state, CON_SOCK_STATE_CONNECTED); + if (WARN_ON(old_state != CON_SOCK_STATE_CONNECTING)) + printk(%s: unexpected old state %d\n, __func__, old_state); +} + +static void con_sock_state_closing(struct
Re: [PATCH 11/13] libceph: init monitor connection when opening
yep! On Wed, 30 May 2012, Alex Elder wrote: Hold off initializing a monitor client's connection until just before it gets opened for use. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/mon_client.c | 13 ++--- 1 files changed, 6 insertions(+), 7 deletions(-) diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index ac4d6b1..77da480 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -119,6 +119,7 @@ static void __close_session(struct ceph_mon_client *monc) dout(__close_session closing mon%d\n, monc-cur_mon); ceph_con_revoke(monc-con, monc-m_auth); ceph_con_close(monc-con); + monc-con.private = NULL; monc-cur_mon = -1; monc-pending_auth = 0; ceph_auth_reset(monc-auth); @@ -141,9 +142,13 @@ static int __open_session(struct ceph_mon_client *monc) monc-sub_renew_after = jiffies; /* i.e., expired */ monc-want_next_osdmap = !!monc-want_next_osdmap; - dout(open_session mon%d opening\n, monc-cur_mon); + ceph_con_init(monc-client-msgr, monc-con); + monc-con.private = monc; + monc-con.ops = mon_con_ops; monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON; monc-con.peer_name.num = cpu_to_le64(monc-cur_mon); + + dout(open_session mon%d opening\n, monc-cur_mon); ceph_con_open(monc-con, monc-monmap-mon_inst[monc-cur_mon].addr); @@ -760,10 +765,6 @@ int ceph_monc_init(struct ceph_mon_client *monc, struct ceph_client *cl) goto out; /* connection */ - ceph_con_init(monc-client-msgr, monc-con); - monc-con.private = monc; - monc-con.ops = mon_con_ops; - /* authentication */ monc-auth = ceph_auth_init(cl-options-name, cl-options-key); @@ -836,8 +837,6 @@ void ceph_monc_stop(struct ceph_mon_client *monc) mutex_lock(monc-mutex); __close_session(monc); - monc-con.private = NULL; - mutex_unlock(monc-mutex); ceph_auth_destroy(monc-auth); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/13] libceph: fully initialize connection in con_init()
Reviewed-by: Sage Weil s...@inktank.com On Wed, 30 May 2012, Alex Elder wrote: Move the initialization of a ceph connection's private pointer, operations vector pointer, and peer name information into ceph_con_init(). Rearrange the arguments so the connection pointer is first. Hide the byte-swapping of the peer entity number inside ceph_con_init() Signed-off-by: Alex Elder el...@inktank.com --- fs/ceph/mds_client.c |7 ++- include/linux/ceph/messenger.h |6 -- net/ceph/messenger.c |9 - net/ceph/mon_client.c |8 +++- net/ceph/osd_client.c |7 ++- 5 files changed, 19 insertions(+), 18 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index ad30261..ecd7f15 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -394,11 +394,8 @@ static struct ceph_mds_session *register_session(struct ceph_mds_client *mdsc, s-s_seq = 0; mutex_init(s-s_mutex); - ceph_con_init(mdsc-fsc-client-msgr, s-s_con); - s-s_con.private = s; - s-s_con.ops = mds_con_ops; - s-s_con.peer_name.type = CEPH_ENTITY_TYPE_MDS; - s-s_con.peer_name.num = cpu_to_le64(mds); + ceph_con_init(s-s_con, s, mds_con_ops, mdsc-fsc-client-msgr, + CEPH_ENTITY_TYPE_MDS, mds); spin_lock_init(s-s_gen_ttl_lock); s-s_cap_gen = 0; diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 5e852f4..dd27837 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -227,8 +227,10 @@ extern void ceph_messenger_init(struct ceph_messenger *msgr, u32 required_features, bool nocrc); -extern void ceph_con_init(struct ceph_messenger *msgr, - struct ceph_connection *con); +extern void ceph_con_init(struct ceph_connection *con, void *private, + const struct ceph_connection_operations *ops, + struct ceph_messenger *msgr, __u8 entity_type, + __u64 entity_num); extern void ceph_con_open(struct ceph_connection *con, struct ceph_entity_addr *addr); extern bool ceph_con_opened(struct ceph_connection *con); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index 7e11b07..cdf8299 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -514,15 +514,22 @@ void ceph_con_put(struct ceph_connection *con) /* * initialize a new connection. */ -void ceph_con_init(struct ceph_messenger *msgr, struct ceph_connection *con) +void ceph_con_init(struct ceph_connection *con, void *private, + const struct ceph_connection_operations *ops, + struct ceph_messenger *msgr, __u8 entity_type, __u64 entity_num) { dout(con_init %p\n, con); memset(con, 0, sizeof(*con)); + con-private = private; atomic_set(con-nref, 1); + con-ops = ops; con-msgr = msgr; con_sock_state_init(con); + con-peer_name.type = (__u8) entity_type; + con-peer_name.num = cpu_to_le64(entity_num); + mutex_init(con-mutex); INIT_LIST_HEAD(con-out_queue); INIT_LIST_HEAD(con-out_sent); diff --git a/net/ceph/mon_client.c b/net/ceph/mon_client.c index 77da480..9b4cef9 100644 --- a/net/ceph/mon_client.c +++ b/net/ceph/mon_client.c @@ -142,11 +142,9 @@ static int __open_session(struct ceph_mon_client *monc) monc-sub_renew_after = jiffies; /* i.e., expired */ monc-want_next_osdmap = !!monc-want_next_osdmap; - ceph_con_init(monc-client-msgr, monc-con); - monc-con.private = monc; - monc-con.ops = mon_con_ops; - monc-con.peer_name.type = CEPH_ENTITY_TYPE_MON; - monc-con.peer_name.num = cpu_to_le64(monc-cur_mon); + ceph_con_init(monc-con, monc, mon_con_ops, + monc-client-msgr, + CEPH_ENTITY_TYPE_MON, monc-cur_mon); dout(open_session mon%d opening\n, monc-cur_mon); ceph_con_open(monc-con, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index e30efbc..1f3951a 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -640,11 +640,8 @@ static struct ceph_osd *create_osd(struct ceph_osd_client *osdc, int onum) INIT_LIST_HEAD(osd-o_osd_lru); osd-o_incarnation = 1; - ceph_con_init(osdc-client-msgr, osd-o_con); - osd-o_con.private = osd; - osd-o_con.ops = osd_con_ops; - osd-o_con.peer_name.type = CEPH_ENTITY_TYPE_OSD; - osd-o_con.peer_name.num = cpu_to_le64(onum); + ceph_con_init(osd-o_con, osd, osd_con_ops, osdc-client-msgr, + CEPH_ENTITY_TYPE_OSD, onum); INIT_LIST_HEAD(osd-o_keepalive_item); return osd; -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to
Re: [PATCH 13/13] libceph: set CLOSED state bit in con_init
Reviewed-by: Sage Weil s...@inktank.com On Wed, 30 May 2012, Alex Elder wrote: Once a connection is fully initialized, it is really in a CLOSED state, so make that explicit by setting the bit in its state field. It is possible for a connection in NEGOTIATING state to get a failure, leading to ceph_fault() and ultimately ceph_con_close(). Clear that bits if it is set in that case, to reflect that the connection truly is closed and is no longer participating in a connect sequence. Issue a warning if ceph_con_open() is called on a connection that is not in CLOSED state. Signed-off-by: Alex Elder el...@inktank.com --- net/ceph/messenger.c |8 +++- 1 files changed, 7 insertions(+), 1 deletions(-) diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index cdf8299..85bfe12 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -452,10 +452,13 @@ void ceph_con_close(struct ceph_connection *con) dout(con_close %p peer %s\n, con, ceph_pr_addr(con-peer_addr.in_addr)); set_bit(CLOSED, con-state); /* in case there's queued work */ + clear_bit(NEGOTIATING, con-state); clear_bit(STANDBY, con-state); /* avoid connect_seq bump */ + clear_bit(LOSSYTX, con-flags); /* so we retry next connect */ clear_bit(KEEPALIVE_PENDING, con-flags); clear_bit(WRITE_PENDING, con-flags); + mutex_lock(con-mutex); reset_connection(con); con-peer_global_seq = 0; @@ -472,7 +475,8 @@ void ceph_con_open(struct ceph_connection *con, struct ceph_entity_addr *addr) { dout(con_open %p %s\n, con, ceph_pr_addr(addr-in_addr)); set_bit(OPENING, con-state); - clear_bit(CLOSED, con-state); + WARN_ON(!test_and_clear_bit(CLOSED, con-state)); + memcpy(con-peer_addr, addr, sizeof(*addr)); con-delay = 0; /* reset backoff memory */ queue_con(con); @@ -534,6 +538,8 @@ void ceph_con_init(struct ceph_connection *con, void *private, INIT_LIST_HEAD(con-out_queue); INIT_LIST_HEAD(con-out_sent); INIT_DELAYED_WORK(con-work, con_work); + + set_bit(CLOSED, con-state); } EXPORT_SYMBOL(ceph_con_init); -- 1.7.5.4 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rbd rm image slow with big images ?
That said, the current implementation is also stupid: it's doing a single io at a time. #2256 (next sprint) will parallelize this to make it go much faster (probably an order of magnitude?). Ah, ok, this is why is see low ios/network during delete. Thanks Sage and Wido for the explains, that's very clear! - Mail original - De: Sage Weil s...@inktank.com À: Wido den Hollander w...@widodh.nl Cc: Alexandre DERUMIER aderum...@odiso.com, ceph-devel@vger.kernel.org Envoyé: Jeudi 31 Mai 2012 20:19:44 Objet: Re: rbd rm image slow with big images ? On Thu, 31 May 2012, Wido den Hollander wrote: Hi, Is it the normal behaviour ? Maybe some xfs tuning could help ? It's in the nature of RBD. Yes. That said, the current implementation is also stupid: it's doing a single io at a time. #2256 (next sprint) will parallelize this to make it go much faster (probably an order of magnitude?). sage -- -- Alexandre D erumier Ingénieur Système Fixe : 03 20 68 88 90 Fax : 03 20 68 90 81 45 Bvd du Général Leclerc 59100 Roubaix - France 12 rue Marivaux 75002 Paris - France -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
I pulled the Java lib from https://github.com/noahdesu/ceph/tree/wip-java-cephfs However, I use ceph 0.47.1 installed directly from Ubuntu's repository with apt-get, not the one that I built with the java library. I assumed that since the java lib is just a wrapper. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. No, this occurs when I call mount(null) with the monitor being taken down. The library should throw an Exception instead, but since SIGSEGV originates from libcephfs.so so I guess it's more related to Ceph's internal code. Best regards, Nam Dang Tokyo Institute of Technology Tokyo, Japan On Fri, Jun 1, 2012 at 8:58 AM, Noah Watkins jayh...@cs.ucsc.edu wrote: On May 31, 2012, at 3:39 PM, Greg Farnum wrote: Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? I _think_ this is because the JVM uses its own threading library, and Ceph assumes pthreads and pthread compatible mutexes--is that assumption about Ceph correct? Hence the error that looks like Mutex::lock(bool) being reference for context during the segfault. To verify this all that is needed is some synchronization added to the Java. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. Nam: where are you grabbing the Java patches from? I'll push some updates. The only other scenario that comes to mind is related to signaling: The RADOS Java wrappers suffered from an interaction between the JVM and RADOS client signal handlers, in which either the JVM or RADOS would replace the handlers for the other (not sure which order). Anyway, the solution was to link in the JVM libjsig.so signal chaining library. This might be the same thing we are seeing here, but I'm betting it is the first theory I mentioned. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
I made a mistake in the previous email. As Noah said, this problem is due to the wrapper being used with an unsuccessfully mounted client. However, I think if the mount fails, the wrapper should throw an exception instead of letting the client continue. Best regards, Nam Dang Tokyo Institute of Technology Tokyo, Japan On Fri, Jun 1, 2012 at 1:44 PM, Nam Dang n...@de.cs.titech.ac.jp wrote: I pulled the Java lib from https://github.com/noahdesu/ceph/tree/wip-java-cephfs However, I use ceph 0.47.1 installed directly from Ubuntu's repository with apt-get, not the one that I built with the java library. I assumed that since the java lib is just a wrapper. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. No, this occurs when I call mount(null) with the monitor being taken down. The library should throw an Exception instead, but since SIGSEGV originates from libcephfs.so so I guess it's more related to Ceph's internal code. Best regards, Nam Dang Tokyo Institute of Technology Tokyo, Japan On Fri, Jun 1, 2012 at 8:58 AM, Noah Watkins jayh...@cs.ucsc.edu wrote: On May 31, 2012, at 3:39 PM, Greg Farnum wrote: Nevermind to my last comment. Hmm, I've seen this, but very rarely. Noah, do you have any leads on this? Do you think it's a bug in your Java code or in the C/++ libraries? I _think_ this is because the JVM uses its own threading library, and Ceph assumes pthreads and pthread compatible mutexes--is that assumption about Ceph correct? Hence the error that looks like Mutex::lock(bool) being reference for context during the segfault. To verify this all that is needed is some synchronization added to the Java. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. Nam: where are you grabbing the Java patches from? I'll push some updates. The only other scenario that comes to mind is related to signaling: The RADOS Java wrappers suffered from an interaction between the JVM and RADOS client signal handlers, in which either the JVM or RADOS would replace the handlers for the other (not sure which order). Anyway, the solution was to link in the JVM libjsig.so signal chaining library. This might be the same thing we are seeing here, but I'm betting it is the first theory I mentioned. - Noah -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SIGSEGV in cephfs-java, but probably in Ceph
On May 31, 2012, at 9:44 PM, Nam Dang wrote: I pulled the Java lib from https://github.com/noahdesu/ceph/tree/wip-java-cephfs However, I use ceph 0.47.1 installed directly from Ubuntu's repository with apt-get, not the one that I built with the java library. I assumed that since the java lib is just a wrapper. There are only two segfaults that I've ever encountered, one in which the C wrappers are used with an unmounted client, and the error Nam is seeing (although they could be related). I will re-submit an updated patch for the former, which should rule that out as the culprit. No, this occurs when I call mount(null) with the monitor being taken down. The library should throw an Exception instead, I agree. I'll push changes to the tree soon. Thanks. - Noah-- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html