If you want to test to see if it is an OSCAR problem, try running the "Test Cluster Setup" step from the install wizard again. I believe it runs a simple code via MPICH and LAM over the entire cluster. If those tests all pass, then both you network and basic MPICH setup are probably fine and its a mpi debugging question :)
Out of curriosity, is it homegrown code or is it a "standard" package like Gausian, for example? Either way, you are going to have to provide a lot more detailed information about what exactly you are trying to do to get much help from a mailing list, and like Robin said probably a list more closely related to your specific problem would be the best place to start. I would say that a computational geophysics list would be the best place, if such a thing exists. The MPICH guys are going to have a hard time helping much unless you post a fair hunk of your code, besides giving general debugging advice. A geophysics list might be able to help more, especially if it is a "standard code" you are using. On 1/5/07, Ben Turner - Dayboro Geophysical <[EMAIL PROTECTED]> wrote: > Thanks Robin, I'll give these ideas a try and try a mpich list. > Cheers > Ben > ----- Original Message ----- > From: "Robin Humble" <[EMAIL PROTECTED]> > To: <oscar-users@lists.sourceforge.net> > Sent: Saturday, January 06, 2007 1:03 PM > Subject: Re: [Oscar-users] p4_errors net-recv wakeup_slave etc > > > > On Sat, Jan 06, 2007 at 11:51:44AM +1000, Ben Turner - Dayboro Geophysical > > wrote: > >>I have oscar-4-2 installed on my ibm eserver cluster. I am trying to run a > >>parrallel > >>program on five nodes. The process runs for a while successfully but then > >>comes up with the following set of errors and fails. I have searched the > >>archives but > >>can't can't seem to find any answers. Does anybody have an idea? > > > > best guess is that it's a bug in your MPI code. > > > > the below messages are errors from mpich which look fairly generic and > > don't really convey much information, at least to me. usually they just > > mean one of your threads died 'cos the code ran into a problem, but it > > could be many things. > > > > suggestions: > > you could try compiling your code against LAM instead of mpich which > > might produce different errors that make more sense to you. > > you could try asking on an mpich mailing list. > > run each thread of your code inside a debugger so you can see where it > > crashes. > > there's an outside chance that it's a networking problem with the > > cluster, but if the code runs for a while before failing this seems > > unlikely. > > there are a bunch of other things it might be, but the above seem the > > most likely. > > > > unfortunately none of the above is really OSCAR related in any way so > > it probably isn't the right place to ask your questions... > > > > cheers, > > robin > > > >> > >>p4_1171: (2106.532602) net_recv failed for fd = 3 > >>p4_1171: p4_error: net_recv read, errno = : 104 > >>rm_l_4_1189: (2106.532930) net_send: could not write to fd=5, errno = 32 > >>p2_1234: (2111.528337) net_recv failed for fd = 3 > >>p2_1234: p4_error: net_recv read, errno = : 104 > >>rm_l_2_1252: (2111.528627) net_send: could not write to fd=5, errno = 32 > >>p3_1166: p4_error: net_recv read: probable EOF on socket: 1 > >>rm_l_3_1184: (2109.069751) net_send: could not write to fd=5, errno = 32 > >>p1_3576: (2114.044709) net_recv failed for fd = 3 > >>p1_3576: p4_error: net_recv read, errno = : 104 > >>rm_l_1_3594: (2114.044992) net_send: could not write to fd=5, errno = 32 > >>bm_list_7639: (2116.924922) wakeup_slave: unable to interrupt slave 0 pid > >>7638 > >>bm_list_7639: (2116.925033) wakeup_slave: unable to interrupt slave 0 pid > >>7638 > >>bm_list_7639: (2116.925086) wakeup_slave: unable to interrupt slave 0 pid > >>7638 > >>bm_list_7639: (2116.925135) wakeup_slave: unable to interrupt slave 0 pid > >>7638 > >>bm_list_7639: (2116.925181) wakeup_slave: unable to interrupt slave 0 pid > >>7638 > >>p5_1098: p4_error: net_recv read: probable EOF on socket: 1 > >>rm_l_5_1116: (2104.011456) net_send: could not write to fd=5, errno = 32 > >> > >>Cheers > >>Ben > > > > ------------------------------------------------------------------------- > > Take Surveys. Earn Cash. Influence the Future of IT > > Join SourceForge.net's Techsay panel and you'll get the chance to share > > your > > opinions on IT & business topics through brief surveys - and earn cash > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > > _______________________________________________ > > Oscar-users mailing list > > Oscar-users@lists.sourceforge.net > > https://lists.sourceforge.net/lists/listinfo/oscar-users > > > > > > > > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share your > opinions on IT & business topics through brief surveys - and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Oscar-users mailing list > Oscar-users@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/oscar-users > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys - and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Oscar-users mailing list Oscar-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/oscar-users