Re: [Pvfs2-users] Pvfs2 over infiniband stops working

Kyle Schochenmaier Sun, 30 Mar 2008 16:37:36 -0700

We are currently trying to track down this bug, as well as one other
involving potential data corruption under heavy load.
I would like to say that I havent seen this bug after some patches
that were committed a while back.


Can you include some more detailed information about your hardware
setup.. the types of nics specifically.
--We've found some bugs that occur on slower nics but not on faster
nics, so knowing what hardware you are running might help us out here.

Tomorrow I can sit down and look at this further, also I'm going to cc
this to the pvfs2-dev list.

~Kyle


On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter
<[EMAIL PROTECTED]> wrote:
> Dear pvfs2-users,
>
>  I have been trying to get pvfs2 working over infiniband for a few
>  weeks now and have made a lot of progress.  I am still stuck on one
>  last thing I can't seem to fix.
>
>  Basically, everything will be fine for a while (like a few days), then
>  I see the following in one of the pvfs2-server.logs (when the
>  debugging mask is set to "all"):
>
>  [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE 
> message not found.
>  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed]
>  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> [0x45b571]
>  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> [0x45d281]
>  [E 03/30 11:50]         [bt] 
> /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) 
> [0x43cd40]
>  [E 03/30 11:50]         [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server 
> [0x43508d]
>  [E 03/30 11:50]         [bt] /lib64/tls/libpthread.so.0 [0x354b90610a]
>  [E 03/30 11:50]         [bt] /lib64/tls/libc.so.6(__clone+0x73) 
> [0x354b0c68c3]
>
>  At this point all mounts will be hung and will require a
>  restart/remount of all servers and clients, and all jobs using this
>  space will need to be restarted.
>
>  Only one server seems to ever suffer this problem, i.e. we have 3
>  servers total for I/O (one for both metadata and I/O) and this message
>  can occur on any of the 3 servers.
>
>  It seems that this occurs only when the number of clients accessing
>  gets larger than say, 15-20 or perhaps it is a filesystem load issue?
>  I haven't been able to tell...
>
>  I am using the CVS version from 03/23/08 (I have also tried version
>  2.6.3 but this had other problems mentioned in the pvfs2 users mailing
>  list, so I decided to go to the CVS version).
>
>  I am using OFED version 1.1 on a cluster of dual core/processor
>  Opterons running kernel 2.6.9-42.ELsmp.  We have 114 clients which
>  mount the pvfs file space over infiniband and use it as scratch space.
>  They don't use mpi-io/romio they just write directly to the pvfs2 file
>  space mounted via IB (I guess they write through the kernel
>  interface). The errors seem to occur when more than 15-20 processors
>  worth of jobs try and read/write to the pvfs scratch space, or they
>  could be just random.
>
>  Does anyone have some clues for how to debug this further or track
>  down what the problem is?
>
>  Any suggestions are welcome.
>
>  Thanks,
>
>  Eric J. Walter
>  Department of Physics
>  College of William and Mary
>
>
>  _______________________________________________
>  Pvfs2-users mailing list
>  [email protected]
>  http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
>



-- 
Kyle Schochenmaier
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Re: [Pvfs2-users] Pvfs2 over infiniband stops working

Reply via email to