Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1

2010-12-13 Thread Artem Trunov
Hi, Jeremy, Raghavendra

Thanks! That patch worked just fine for me as well.

cheers,
artem

On Mon, Dec 13, 2010 at 2:39 AM, Jeremy Stout  wrote:
> I recompiled GlusterFS using the unaccepted patch and I haven't
> received any RDMA error messages yet. I'll run some benchmarking tests
> over the next couple of days to test the program's stability.
>
> Thank you.
>
> On Fri, Dec 10, 2010 at 12:22 AM, Raghavendra G  
> wrote:
>> Hi Artem,
>>
>> you can check the maximum limits using the patch I had sent earlier in the 
>> same thread. Also, the patch
>> http://patches.gluster.com/patch/5844/ (which is not accepted yet), will 
>> check for whether the number of cqe being passed in ibv_creation_cq is 
>> greater than the value allowed by the device and if so, it will try to 
>> create CQ with maximum limit allowed by the device.
>>
>> regards,
>> - Original Message -
>> From: "Artem Trunov" 
>> To: "Raghavendra G" 
>> Cc: "Jeremy Stout" , gluster-users@gluster.org
>> Sent: Thursday, December 9, 2010 7:13:40 PM
>> Subject: Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1
>>
>> Hi, Ravendra, Jeremy
>>
>> This was very interesting debugging thread to me, since I have the
>> same symptoms, but unsure of the origin. Please see log for my mount
>> command at the end of the message.
>>
>> I have installed 3.3.1. My OFED is 1.5.1 - does it make serious
>> difference between already mentioned 1.5.2?
>>
>> On hardware limitations - I have Mellanox InfiniHost III Lx 20Gb/s and
>> it says in specs:
>>
>> "Supports 16 million QPs, EEs & CQs "
>>
>> Is this enough? How can I query for actual settings on max_cq, max_cqe?
>>
>> In general, how should I proceed? What are my other debugging options?
>> Should I try to go Jeremy path with hacking the gluster code?
>>
>> cheers
>> Artem.
>>
>> Log:
>>
>> -
>> [2010-12-09 15:15:53.847595] W [io-stats.c:1644:init] test-volume:
>> dangling volume. check volfile
>> [2010-12-09 15:15:53.847643] W [dict.c:1204:data_to_str] dict: @data=(nil)
>> [2010-12-09 15:15:53.847657] W [dict.c:1204:data_to_str] dict: @data=(nil)
>> [2010-12-09 15:15:53.858574] E [rdma.c:2066:rdma_create_cq]
>> rpc-transport/rdma: test-volume-client-1: creation of send_cq failed
>> [2010-12-09 15:15:53.858805] E [rdma.c:3771:rdma_get_device]
>> rpc-transport/rdma: test-volume-client-1: could not create CQ
>> [2010-12-09 15:15:53.858821] E [rdma.c:3957:rdma_init]
>> rpc-transport/rdma: could not create rdma device for mthca0
>> [2010-12-09 15:15:53.858893] E [rdma.c:4789:init]
>> test-volume-client-1: Failed to initialize IB Device
>> [2010-12-09 15:15:53.858909] E
>> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma'
>> initialization failed
>> pending frames:
>>
>> patchset: v3.1.1
>> signal received: 11
>> time of crash: 2010-12-09 15:15:53
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> fdatasync 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 3.1.1
>> /lib64/libc.so.6[0x32aca302d0]
>> /lib64/libc.so.6(strcmp+0x0)[0x32aca79140]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so[0x2c4fef6c]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so(init+0x2f)[0x2c50013f]
>> /usr/lib64/libgfrpc.so.0(rpc_transport_load+0x389)[0x3fcca0cac9]
>> /usr/lib64/libgfrpc.so.0(rpc_clnt_new+0xfe)[0x3fcca1053e]
>> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(client_init_rpc+0xa1)[0x2b194f01]
>> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(init+0x129)[0x2b1950d9]
>> /usr/lib64/libglusterfs.so.0(xlator_init+0x58)[0x3fcc617398]
>> /usr/lib64/libglusterfs.so.0(glusterfs_graph_init+0x31)[0x3fcc640291]
>> /usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x38)[0x3fcc6403c8]
>> /usr/sbin/glusterfs(glusterfs_process_volfp+0xfa)[0x40373a]
>> /usr/sbin/glusterfs(mgmt_getspec_cbk+0xc5)[0x406125]
>> /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x3fcca0f542]
>> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x3fcca0f73d]
>> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x3fcca0a95c]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2ad6ef9f]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_handler+0x170)[0x2ad6f130]
>> /usr/lib64/libglusterfs.so.0[0x3f

Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1

2010-12-10 Thread Artem Trunov
Hi all

To add some info:

1) I can query adapter settings with "ibv_devinfo -v" and get these values

2) I can vary max_cq via ib_mthca param num_cq, but that doesn't affect max_cqe.

cheers
Artem.

On Fri, Dec 10, 2010 at 1:41 PM, Artem Trunov  wrote:
> Hi, Raghavendra, Jeremy
>
> Thanks, I have tried with the patch and also with ofed 1.5.2 and got
> pretty much what Jeremy had:
>
> [2010-12-10 13:32:59.69007] E [rdma.c:2047:rdma_create_cq]
> rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq =
> 65408, max_cqe = 131071, max_mr = 131056
>
> Aren't these parameters configurable on some driver level? I am a bit
> new to the IB business, so don't know...
>
> How do you suggest to proceed? To try the unaccepted patch?
>
> cheers
> Artem.
>
> On Fri, Dec 10, 2010 at 6:22 AM, Raghavendra G  
> wrote:
>> Hi Artem,
>>
>> you can check the maximum limits using the patch I had sent earlier in the 
>> same thread. Also, the patch
>> http://patches.gluster.com/patch/5844/ (which is not accepted yet), will 
>> check for whether the number of cqe being passed in ibv_creation_cq is 
>> greater than the value allowed by the device and if so, it will try to 
>> create CQ with maximum limit allowed by the device.
>>
>> regards,
>> - Original Message -
>> From: "Artem Trunov" 
>> To: "Raghavendra G" 
>> Cc: "Jeremy Stout" , gluster-users@gluster.org
>> Sent: Thursday, December 9, 2010 7:13:40 PM
>> Subject: Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1
>>
>> Hi, Ravendra, Jeremy
>>
>> This was very interesting debugging thread to me, since I have the
>> same symptoms, but unsure of the origin. Please see log for my mount
>> command at the end of the message.
>>
>> I have installed 3.3.1. My OFED is 1.5.1 - does it make serious
>> difference between already mentioned 1.5.2?
>>
>> On hardware limitations - I have Mellanox InfiniHost III Lx 20Gb/s and
>> it says in specs:
>>
>> "Supports 16 million QPs, EEs & CQs "
>>
>> Is this enough? How can I query for actual settings on max_cq, max_cqe?
>>
>> In general, how should I proceed? What are my other debugging options?
>> Should I try to go Jeremy path with hacking the gluster code?
>>
>> cheers
>> Artem.
>>
>> Log:
>>
>> -
>> [2010-12-09 15:15:53.847595] W [io-stats.c:1644:init] test-volume:
>> dangling volume. check volfile
>> [2010-12-09 15:15:53.847643] W [dict.c:1204:data_to_str] dict: @data=(nil)
>> [2010-12-09 15:15:53.847657] W [dict.c:1204:data_to_str] dict: @data=(nil)
>> [2010-12-09 15:15:53.858574] E [rdma.c:2066:rdma_create_cq]
>> rpc-transport/rdma: test-volume-client-1: creation of send_cq failed
>> [2010-12-09 15:15:53.858805] E [rdma.c:3771:rdma_get_device]
>> rpc-transport/rdma: test-volume-client-1: could not create CQ
>> [2010-12-09 15:15:53.858821] E [rdma.c:3957:rdma_init]
>> rpc-transport/rdma: could not create rdma device for mthca0
>> [2010-12-09 15:15:53.858893] E [rdma.c:4789:init]
>> test-volume-client-1: Failed to initialize IB Device
>> [2010-12-09 15:15:53.858909] E
>> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma'
>> initialization failed
>> pending frames:
>>
>> patchset: v3.1.1
>> signal received: 11
>> time of crash: 2010-12-09 15:15:53
>> configuration details:
>> argp 1
>> backtrace 1
>> dlfcn 1
>> fdatasync 1
>> libpthread 1
>> llistxattr 1
>> setfsid 1
>> spinlock 1
>> epoll.h 1
>> xattr.h 1
>> st_atim.tv_nsec 1
>> package-string: glusterfs 3.1.1
>> /lib64/libc.so.6[0x32aca302d0]
>> /lib64/libc.so.6(strcmp+0x0)[0x32aca79140]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so[0x2c4fef6c]
>> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so(init+0x2f)[0x2c50013f]
>> /usr/lib64/libgfrpc.so.0(rpc_transport_load+0x389)[0x3fcca0cac9]
>> /usr/lib64/libgfrpc.so.0(rpc_clnt_new+0xfe)[0x3fcca1053e]
>> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(client_init_rpc+0xa1)[0x2b194f01]
>> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(init+0x129)[0x2b1950d9]
>> /usr/lib64/libglusterfs.so.0(xlator_init+0x58)[0x3fcc617398]
>> /usr/lib64/libglusterfs.so.0(glusterfs_graph_init+0x31)[0x3fcc640291]
>> /usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x38)[0x3fcc6403c8]
>> /usr/sbin/glusterfs(glusterfs_process_volfp+0xfa)[0x40373a]
>> /usr/sbin/glusterfs(mgmt_getspec_cbk+0xc5)[0x406125]
>> /usr/lib64/libgfrpc.so.0(rpc_

Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1

2010-12-10 Thread Artem Trunov
Hi, Raghavendra, Jeremy

Thanks, I have tried with the patch and also with ofed 1.5.2 and got
pretty much what Jeremy had:

[2010-12-10 13:32:59.69007] E [rdma.c:2047:rdma_create_cq]
rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq =
65408, max_cqe = 131071, max_mr = 131056

Aren't these parameters configurable on some driver level? I am a bit
new to the IB business, so don't know...

How do you suggest to proceed? To try the unaccepted patch?

cheers
Artem.

On Fri, Dec 10, 2010 at 6:22 AM, Raghavendra G  wrote:
> Hi Artem,
>
> you can check the maximum limits using the patch I had sent earlier in the 
> same thread. Also, the patch
> http://patches.gluster.com/patch/5844/ (which is not accepted yet), will 
> check for whether the number of cqe being passed in ibv_creation_cq is 
> greater than the value allowed by the device and if so, it will try to create 
> CQ with maximum limit allowed by the device.
>
> regards,
> - Original Message -
> From: "Artem Trunov" 
> To: "Raghavendra G" 
> Cc: "Jeremy Stout" , gluster-users@gluster.org
> Sent: Thursday, December 9, 2010 7:13:40 PM
> Subject: Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1
>
> Hi, Ravendra, Jeremy
>
> This was very interesting debugging thread to me, since I have the
> same symptoms, but unsure of the origin. Please see log for my mount
> command at the end of the message.
>
> I have installed 3.3.1. My OFED is 1.5.1 - does it make serious
> difference between already mentioned 1.5.2?
>
> On hardware limitations - I have Mellanox InfiniHost III Lx 20Gb/s and
> it says in specs:
>
> "Supports 16 million QPs, EEs & CQs "
>
> Is this enough? How can I query for actual settings on max_cq, max_cqe?
>
> In general, how should I proceed? What are my other debugging options?
> Should I try to go Jeremy path with hacking the gluster code?
>
> cheers
> Artem.
>
> Log:
>
> -
> [2010-12-09 15:15:53.847595] W [io-stats.c:1644:init] test-volume:
> dangling volume. check volfile
> [2010-12-09 15:15:53.847643] W [dict.c:1204:data_to_str] dict: @data=(nil)
> [2010-12-09 15:15:53.847657] W [dict.c:1204:data_to_str] dict: @data=(nil)
> [2010-12-09 15:15:53.858574] E [rdma.c:2066:rdma_create_cq]
> rpc-transport/rdma: test-volume-client-1: creation of send_cq failed
> [2010-12-09 15:15:53.858805] E [rdma.c:3771:rdma_get_device]
> rpc-transport/rdma: test-volume-client-1: could not create CQ
> [2010-12-09 15:15:53.858821] E [rdma.c:3957:rdma_init]
> rpc-transport/rdma: could not create rdma device for mthca0
> [2010-12-09 15:15:53.858893] E [rdma.c:4789:init]
> test-volume-client-1: Failed to initialize IB Device
> [2010-12-09 15:15:53.858909] E
> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma'
> initialization failed
> pending frames:
>
> patchset: v3.1.1
> signal received: 11
> time of crash: 2010-12-09 15:15:53
> configuration details:
> argp 1
> backtrace 1
> dlfcn 1
> fdatasync 1
> libpthread 1
> llistxattr 1
> setfsid 1
> spinlock 1
> epoll.h 1
> xattr.h 1
> st_atim.tv_nsec 1
> package-string: glusterfs 3.1.1
> /lib64/libc.so.6[0x32aca302d0]
> /lib64/libc.so.6(strcmp+0x0)[0x32aca79140]
> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so[0x2c4fef6c]
> /usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so(init+0x2f)[0x2c50013f]
> /usr/lib64/libgfrpc.so.0(rpc_transport_load+0x389)[0x3fcca0cac9]
> /usr/lib64/libgfrpc.so.0(rpc_clnt_new+0xfe)[0x3fcca1053e]
> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(client_init_rpc+0xa1)[0x2b194f01]
> /usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(init+0x129)[0x2b1950d9]
> /usr/lib64/libglusterfs.so.0(xlator_init+0x58)[0x3fcc617398]
> /usr/lib64/libglusterfs.so.0(glusterfs_graph_init+0x31)[0x3fcc640291]
> /usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x38)[0x3fcc6403c8]
> /usr/sbin/glusterfs(glusterfs_process_volfp+0xfa)[0x40373a]
> /usr/sbin/glusterfs(mgmt_getspec_cbk+0xc5)[0x406125]
> /usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x3fcca0f542]
> /usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x3fcca0f73d]
> /usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x3fcca0a95c]
> /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2ad6ef9f]
> /usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_handler+0x170)[0x2ad6f130]
> /usr/lib64/libglusterfs.so.0[0x3fcc637917]
> /usr/sbin/glusterfs(main+0x39b)[0x40470b]
> /lib64/libc.so.6(__libc_start_main+0xf4)[0x32aca1d994]
> /usr/sbin/glusterfs[0x402e29]
>
>
>
>
> On Fri, Dec 3, 2010 at 1:53 PM, Raghavendra G  wrote:
>> From the logs its evident that the reason for completion queu

Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1

2010-12-09 Thread Artem Trunov
Hi, Ravendra, Jeremy

This was very interesting debugging thread to me, since I have the
same symptoms, but unsure of the origin. Please see log for my mount
command at the end of the message.

I have installed 3.3.1. My OFED is 1.5.1 - does it make serious
difference between already mentioned 1.5.2?

On hardware limitations - I have Mellanox InfiniHost III Lx 20Gb/s and
it says in specs:

"Supports 16 million QPs, EEs & CQs "

Is this enough? How can I query for actual settings on max_cq, max_cqe?

In general, how should I proceed? What are my other debugging options?
Should I try to go Jeremy path with hacking the gluster code?

cheers
Artem.

Log:

-
[2010-12-09 15:15:53.847595] W [io-stats.c:1644:init] test-volume:
dangling volume. check volfile
[2010-12-09 15:15:53.847643] W [dict.c:1204:data_to_str] dict: @data=(nil)
[2010-12-09 15:15:53.847657] W [dict.c:1204:data_to_str] dict: @data=(nil)
[2010-12-09 15:15:53.858574] E [rdma.c:2066:rdma_create_cq]
rpc-transport/rdma: test-volume-client-1: creation of send_cq failed
[2010-12-09 15:15:53.858805] E [rdma.c:3771:rdma_get_device]
rpc-transport/rdma: test-volume-client-1: could not create CQ
[2010-12-09 15:15:53.858821] E [rdma.c:3957:rdma_init]
rpc-transport/rdma: could not create rdma device for mthca0
[2010-12-09 15:15:53.858893] E [rdma.c:4789:init]
test-volume-client-1: Failed to initialize IB Device
[2010-12-09 15:15:53.858909] E
[rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma'
initialization failed
pending frames:

patchset: v3.1.1
signal received: 11
time of crash: 2010-12-09 15:15:53
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.1.1
/lib64/libc.so.6[0x32aca302d0]
/lib64/libc.so.6(strcmp+0x0)[0x32aca79140]
/usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so[0x2c4fef6c]
/usr/lib64/glusterfs/3.1.1/rpc-transport/rdma.so(init+0x2f)[0x2c50013f]
/usr/lib64/libgfrpc.so.0(rpc_transport_load+0x389)[0x3fcca0cac9]
/usr/lib64/libgfrpc.so.0(rpc_clnt_new+0xfe)[0x3fcca1053e]
/usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(client_init_rpc+0xa1)[0x2b194f01]
/usr/lib64/glusterfs/3.1.1/xlator/protocol/client.so(init+0x129)[0x2b1950d9]
/usr/lib64/libglusterfs.so.0(xlator_init+0x58)[0x3fcc617398]
/usr/lib64/libglusterfs.so.0(glusterfs_graph_init+0x31)[0x3fcc640291]
/usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x38)[0x3fcc6403c8]
/usr/sbin/glusterfs(glusterfs_process_volfp+0xfa)[0x40373a]
/usr/sbin/glusterfs(mgmt_getspec_cbk+0xc5)[0x406125]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa2)[0x3fcca0f542]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x8d)[0x3fcca0f73d]
/usr/lib64/libgfrpc.so.0(rpc_transport_notify+0x2c)[0x3fcca0a95c]
/usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_poll_in+0x3f)[0x2ad6ef9f]
/usr/lib64/glusterfs/3.1.1/rpc-transport/socket.so(socket_event_handler+0x170)[0x2ad6f130]
/usr/lib64/libglusterfs.so.0[0x3fcc637917]
/usr/sbin/glusterfs(main+0x39b)[0x40470b]
/lib64/libc.so.6(__libc_start_main+0xf4)[0x32aca1d994]
/usr/sbin/glusterfs[0x402e29]




On Fri, Dec 3, 2010 at 1:53 PM, Raghavendra G  wrote:
> From the logs its evident that the reason for completion queue creation 
> failure is that the number of completion queue elements (in a completion 
> queue) we had requested in ibv_create_cq, (1024 * send_count) is less than 
> the maximum supported by the ib hardware (max_cqe = 131071).
>
> - Original Message -
> From: "Jeremy Stout" 
> To: "Raghavendra G" 
> Cc: gluster-users@gluster.org
> Sent: Friday, December 3, 2010 4:20:04 PM
> Subject: Re: [Gluster-users] RDMA Problems with GlusterFS 3.1.1
>
> I patched the source code and rebuilt GlusterFS. Here are the full logs:
> Server:
> [2010-12-03 07:08:55.945804] I [glusterd.c:275:init] management: Using
> /etc/glusterd as working directory
> [2010-12-03 07:08:55.947692] E [rdma.c:2047:rdma_create_cq]
> rpc-transport/rdma: max_mr_size = 18446744073709551615, max_cq =
> 65408, max_cqe = 131071, max_mr = 131056
> [2010-12-03 07:08:55.953226] E [rdma.c:2079:rdma_create_cq]
> rpc-transport/rdma: rdma.management: creation of send_cq failed
> [2010-12-03 07:08:55.953509] E [rdma.c:3785:rdma_get_device]
> rpc-transport/rdma: rdma.management: could not create CQ
> [2010-12-03 07:08:55.953582] E [rdma.c:3971:rdma_init]
> rpc-transport/rdma: could not create rdma device for mthca0
> [2010-12-03 07:08:55.953668] E [rdma.c:4803:init] rdma.management:
> Failed to initialize IB Device
> [2010-12-03 07:08:55.953691] E
> [rpc-transport.c:971:rpc_transport_load] rpc-transport: 'rdma'
> initialization failed
> [2010-12-03 07:08:55.953780] I [glusterd.c:96:glusterd_uuid_init]
> glusterd: generated UUID: 4eb47ca7-227c-49c4-97bd-25ac177b2f6a
> Given volfile:
> +--+
>  1: volume management
>  2:     type mgmt/glusterd
>  3:     option working-d