If you want to test to see if it is an OSCAR problem, try running the
"Test Cluster Setup" step from the install wizard again.  I believe it
runs a simple code via MPICH and LAM over the entire cluster.  If
those tests all pass, then both you network and basic MPICH setup are
probably fine and its a mpi debugging question :)

Out of curriosity, is it homegrown code or is it a "standard" package
like Gausian, for example?

Either way, you are going to have to provide a lot more detailed
information about what exactly you are trying to do to get much help
from a mailing list, and like Robin said probably a list more closely
related to your specific problem would be the best place to start.

I would say that a computational geophysics list would be the best
place, if such a thing exists.  The MPICH guys are going to have a
hard time helping much unless you post a fair hunk of your code,
besides giving general debugging advice.  A geophysics list might be
able to help more, especially if it is a "standard code" you are
using.

On 1/5/07, Ben Turner - Dayboro Geophysical <[EMAIL PROTECTED]> wrote:
> Thanks Robin, I'll give these ideas a try and try a mpich list.
> Cheers
> Ben
> ----- Original Message -----
> From: "Robin Humble" <[EMAIL PROTECTED]>
> To: <oscar-users@lists.sourceforge.net>
> Sent: Saturday, January 06, 2007 1:03 PM
> Subject: Re: [Oscar-users] p4_errors net-recv wakeup_slave etc
>
>
> > On Sat, Jan 06, 2007 at 11:51:44AM +1000, Ben Turner - Dayboro Geophysical
> > wrote:
> >>I have oscar-4-2 installed on my ibm eserver cluster. I am trying to run a
> >>parrallel
> >>program on five nodes. The process runs for a while successfully but then
> >>comes up with the following set of errors and fails. I have searched the
> >>archives but
> >>can't can't seem to find any answers. Does anybody have an idea?
> >
> > best guess is that it's a bug in your MPI code.
> >
> > the below messages are errors from mpich which look fairly generic and
> > don't really convey much information, at least to me. usually they just
> > mean one of your threads died 'cos the code ran into a problem, but it
> > could be many things.
> >
> > suggestions:
> > you could try compiling your code against LAM instead of mpich which
> > might produce different errors that make more sense to you.
> > you could try asking on an mpich mailing list.
> > run each thread of your code inside a debugger so you can see where it
> > crashes.
> > there's an outside chance that it's a networking problem with the
> > cluster, but if the code runs for a while before failing this seems
> > unlikely.
> > there are a bunch of other things it might be, but the above seem the
> > most likely.
> >
> > unfortunately none of the above is really OSCAR related in any way so
> > it probably isn't the right place to ask your questions...
> >
> > cheers,
> > robin
> >
> >>
> >>p4_1171: (2106.532602) net_recv failed for fd = 3
> >>p4_1171: p4_error: net_recv read, errno = : 104
> >>rm_l_4_1189: (2106.532930) net_send: could not write to fd=5, errno = 32
> >>p2_1234: (2111.528337) net_recv failed for fd = 3
> >>p2_1234: p4_error: net_recv read, errno = : 104
> >>rm_l_2_1252: (2111.528627) net_send: could not write to fd=5, errno = 32
> >>p3_1166: p4_error: net_recv read: probable EOF on socket: 1
> >>rm_l_3_1184: (2109.069751) net_send: could not write to fd=5, errno = 32
> >>p1_3576: (2114.044709) net_recv failed for fd = 3
> >>p1_3576: p4_error: net_recv read, errno = : 104
> >>rm_l_1_3594: (2114.044992) net_send: could not write to fd=5, errno = 32
> >>bm_list_7639: (2116.924922) wakeup_slave: unable to interrupt slave 0 pid
> >>7638
> >>bm_list_7639: (2116.925033) wakeup_slave: unable to interrupt slave 0 pid
> >>7638
> >>bm_list_7639: (2116.925086) wakeup_slave: unable to interrupt slave 0 pid
> >>7638
> >>bm_list_7639: (2116.925135) wakeup_slave: unable to interrupt slave 0 pid
> >>7638
> >>bm_list_7639: (2116.925181) wakeup_slave: unable to interrupt slave 0 pid
> >>7638
> >>p5_1098: p4_error: net_recv read: probable EOF on socket: 1
> >>rm_l_5_1116: (2104.011456) net_send: could not write to fd=5, errno = 32
> >>
> >>Cheers
> >>Ben
> >
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > opinions on IT & business topics through brief surveys - and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > _______________________________________________
> > Oscar-users mailing list
> > Oscar-users@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/oscar-users
> >
> >
> >
>
>
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> _______________________________________________
> Oscar-users mailing list
> Oscar-users@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/oscar-users
>

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Oscar-users mailing list
Oscar-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/oscar-users

Reply via email to