We are currently trying to track down this bug, as well as one other involving potential data corruption under heavy load. I would like to say that I havent seen this bug after some patches that were committed a while back.
Can you include some more detailed information about your hardware setup.. the types of nics specifically. --We've found some bugs that occur on slower nics but not on faster nics, so knowing what hardware you are running might help us out here. Tomorrow I can sit down and look at this further, also I'm going to cc this to the pvfs2-dev list. ~Kyle On Sun, Mar 30, 2008 at 1:05 PM, Eric J. Walter <[EMAIL PROTECTED]> wrote: > Dear pvfs2-users, > > I have been trying to get pvfs2 working over infiniband for a few > weeks now and have made a lot of progress. I am still stuck on one > last thing I can't seem to fix. > > Basically, everything will be fine for a while (like a few days), then > I see the following in one of the pvfs2-server.logs (when the > debugging mask is set to "all"): > > [E 03/30 11:50] Error: encourage_recv_incoming: mop_id 680cc0 in RTS_DONE > message not found. > [E 03/30 11:50] [bt] > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(error+0xbd) [0x45d9ed] > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > [0x45b571] > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > [0x45d281] > [E 03/30 11:50] [bt] > /share/apps/pvfs2_032308CVS/sbin/pvfs2-server(BMI_testcontext+0x120) > [0x43cd40] > [E 03/30 11:50] [bt] /share/apps/pvfs2_032308CVS/sbin/pvfs2-server > [0x43508d] > [E 03/30 11:50] [bt] /lib64/tls/libpthread.so.0 [0x354b90610a] > [E 03/30 11:50] [bt] /lib64/tls/libc.so.6(__clone+0x73) > [0x354b0c68c3] > > At this point all mounts will be hung and will require a > restart/remount of all servers and clients, and all jobs using this > space will need to be restarted. > > Only one server seems to ever suffer this problem, i.e. we have 3 > servers total for I/O (one for both metadata and I/O) and this message > can occur on any of the 3 servers. > > It seems that this occurs only when the number of clients accessing > gets larger than say, 15-20 or perhaps it is a filesystem load issue? > I haven't been able to tell... > > I am using the CVS version from 03/23/08 (I have also tried version > 2.6.3 but this had other problems mentioned in the pvfs2 users mailing > list, so I decided to go to the CVS version). > > I am using OFED version 1.1 on a cluster of dual core/processor > Opterons running kernel 2.6.9-42.ELsmp. We have 114 clients which > mount the pvfs file space over infiniband and use it as scratch space. > They don't use mpi-io/romio they just write directly to the pvfs2 file > space mounted via IB (I guess they write through the kernel > interface). The errors seem to occur when more than 15-20 processors > worth of jobs try and read/write to the pvfs scratch space, or they > could be just random. > > Does anyone have some clues for how to debug this further or track > down what the problem is? > > Any suggestions are welcome. > > Thanks, > > Eric J. Walter > Department of Physics > College of William and Mary > > > _______________________________________________ > Pvfs2-users mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users > -- Kyle Schochenmaier _______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
