We were able to solve ssh problem. But now MPI is not able to use component yalla. We are running following command
mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 /root/app2 : -n 1 --hostfile /root/host2 /root/backend command is run in chroot environment on JARVICENAE27 and other node is JARVICENAE125. JARVICENAE125 is able to select yalla since that is a remote node and thus is not trying to run the job in chroot environment. But JARVICENAE27 is throwing few MXM related errors and yalla is not selected. Following are the logs of the command with verbose. Any idea what might be wrong ? [1432283901.548917] sys.c:719 MXM WARN Conflicting CPU frequencies detected, using: 2601.00 [JARVICENAE125:00909] mca: base: components_register: registering pml components [JARVICENAE125:00909] mca: base: components_register: found loaded component v [JARVICENAE125:00909] mca: base: components_register: component v register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component bfo [JARVICENAE125:00909] mca: base: components_register: component bfo register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component cm [JARVICENAE125:00909] mca: base: components_register: component cm register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component ob1 [JARVICENAE125:00909] mca: base: components_register: component ob1 register function successful [JARVICENAE125:00909] mca: base: components_register: found loaded component yalla [JARVICENAE125:00909] mca: base: components_register: component yalla register function successful [JARVICENAE125:00909] mca: base: components_open: opening pml components [JARVICENAE125:00909] mca: base: components_open: found loaded component v [JARVICENAE125:00909] mca: base: components_open: component v open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component bfo [JARVICENAE125:00909] mca: base: components_open: component bfo open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component cm [JARVICENAE125:00909] mca: base: components_open: component cm open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component ob1 [JARVICENAE125:00909] mca: base: components_open: component ob1 open function successful [JARVICENAE125:00909] mca: base: components_open: found loaded component yalla [JARVICENAE125:00909] mca: base: components_open: component yalla open function successful [JARVICENAE125:00909] select: component v not in the include list [JARVICENAE125:00909] select: component bfo not in the include list [JARVICENAE125:00909] select: initializing pml component cm [JARVICENAE27:06474] mca: base: components_register: registering pml components [JARVICENAE27:06474] mca: base: components_register: found loaded component v [JARVICENAE27:06474] mca: base: components_register: component v register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component bfo [JARVICENAE27:06474] mca: base: components_register: component bfo register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component cm [JARVICENAE27:06474] mca: base: components_register: component cm register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component ob1 [JARVICENAE27:06474] mca: base: components_register: component ob1 register function successful [JARVICENAE27:06474] mca: base: components_register: found loaded component yalla [JARVICENAE27:06474] mca: base: components_register: component yalla register function successful [JARVICENAE27:06474] mca: base: components_open: opening pml components [JARVICENAE27:06474] mca: base: components_open: found loaded component v [JARVICENAE27:06474] mca: base: components_open: component v open function successful [JARVICENAE27:06474] mca: base: components_open: found loaded component bfo [JARVICENAE27:06474] mca: base: components_open: component bfo open function successful [JARVICENAE27:06474] mca: base: components_open: found loaded component cm libibverbs: Warning: no userspace device-specific driver found for /sys/class/infiniband_verbs/uverbs0 [1432283901.559929] sys.c:719 MXM WARN Conflicting CPU frequencies detected, using: 2601.00 [1432283901.561294] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR There are no Mellanox cards detected. [JARVICENAE27:06474] mca: base: close: component cm closed [JARVICENAE27:06474] mca: base: close: unloading component cm [JARVICENAE27:06474] mca: base: components_open: found loaded component ob1 [JARVICENAE27:06474] mca: base: components_open: component ob1 open function successful [JARVICENAE27:06474] mca: base: components_open: found loaded component yalla [1432283901.561732] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR There are no Mellanox cards detected. [JARVICENAE27:06474] mca: base: components_open: component yalla open function failed [JARVICENAE27:06474] mca: base: close: component yalla closed [JARVICENAE27:06474] mca: base: close: unloading component yalla [JARVICENAE27:06474] select: component v not in the include list [JARVICENAE27:06474] select: component bfo not in the include list [JARVICENAE27:06474] select: initializing pml component ob1 [JARVICENAE27:06474] select: init returned priority 20 [JARVICENAE27:06474] selected ob1 best priority 20 [JARVICENAE27:06474] select: component ob1 selected [JARVICENAE27:06474] mca: base: close: component v closed [JARVICENAE27:06474] mca: base: close: unloading component v [JARVICENAE27:06474] mca: base: close: component bfo closed [JARVICENAE27:06474] mca: base: close: unloading component bfo [JARVICENAE125:00909] select: init returned priority 30 [JARVICENAE125:00909] select: initializing pml component ob1 [JARVICENAE125:00909] select: init returned failure for component ob1 [JARVICENAE125:00909] select: initializing pml component yalla [JARVICENAE125:00909] select: init returned priority 50 [JARVICENAE125:00909] selected yalla best priority 50 [JARVICENAE125:00909] select: component cm not selected / finalized [JARVICENAE125:00909] select: component yalla selected [JARVICENAE125:00909] mca: base: close: component v closed [JARVICENAE125:00909] mca: base: close: unloading component v [JARVICENAE125:00909] mca: base: close: component bfo closed [JARVICENAE125:00909] mca: base: close: unloading component bfo [JARVICENAE125:00909] mca: base: close: component cm closed [JARVICENAE125:00909] mca: base: close: unloading component cm [JARVICENAE125:00909] mca: base: close: component ob1 closed [JARVICENAE125:00909] mca: base: close: unloading component ob1 [JARVICENAE27:06474] check:select: modex not reqd On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, so we see two nodes have been allocated: > > 1. JARVICENAE27 - appears to be the node where mpirun is running > > 2. 10.3.0.176 > > Does that match what you expected? > > If you cannot ssh (without a password) between machines, then we will not > be able to run. > > > On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com> wrote: > > I get following output with verbose > > [JARVICENAE27:00654] mca: base: components_register: registering ras > components > [JARVICENAE27:00654] mca: base: components_register: found loaded > component loadleveler > [JARVICENAE27:00654] mca: base: components_register: component loadleveler > register function successful > [JARVICENAE27:00654] mca: base: components_register: found loaded > component simulator > [JARVICENAE27:00654] mca: base: components_register: component simulator > register function successful > [JARVICENAE27:00654] mca: base: components_register: found loaded > component slurm > [JARVICENAE27:00654] mca: base: components_register: component slurm > register function successful > [JARVICENAE27:00654] mca: base: components_open: opening ras components > [JARVICENAE27:00654] mca: base: components_open: found loaded component > loadleveler > [JARVICENAE27:00654] mca: base: components_open: component loadleveler > open function successful > [JARVICENAE27:00654] mca: base: components_open: found loaded component > simulator > [JARVICENAE27:00654] mca: base: components_open: found loaded component > slurm > [JARVICENAE27:00654] mca: base: components_open: component slurm open > function successful > [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components > [JARVICENAE27:00654] mca:base:select:( ras) Querying component > [loadleveler] > [JARVICENAE27:00654] mca:base:select:( ras) Skipping component > [loadleveler]. Query failed to return a module > [JARVICENAE27:00654] mca:base:select:( ras) Querying component [simulator] > [JARVICENAE27:00654] mca:base:select:( ras) Skipping component > [simulator]. Query failed to return a module > [JARVICENAE27:00654] mca:base:select:( ras) Querying component [slurm] > [JARVICENAE27:00654] mca:base:select:( ras) Skipping component [slurm]. > Query failed to return a module > [JARVICENAE27:00654] mca:base:select:( ras) No component selected! > > ====================== ALLOCATED NODES ====================== > JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP > 10.3.0.176: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN > > Also, I am not able to ssh to other machine from one machine in chroot > environment. Can that be a problem ? > > Thanks > Rahul > > On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what >> it thinks it is doing. Which OMPI version are you using - master? >> >> >> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com> wrote: >> >> Hi, >> >> We have been trying to run MPI jobs (consisting of two different >> binaries, one each ) in two nodes, using hostfile option as following >> >> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 >> /root/app2 : -n 1 --hostfile /root/host2 /root/backend >> >> We are doing this in chroot environment. We have set the HPCX env in >> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot >> env) contains IPs of two nodes respectively. >> >> We are getting following error >> >> " all nodes which are allocated for this job are already filled " >> >> However when we use chroot but don't use hostfile option (both processes >> run in same node) OR we use hostfile option but outside chroot, it works. >> >> Anyone has any idea if chroot can cause above error and how to solve it ? >> >> Thanks >> Rahul >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26845.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26847.php >> > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26860.php > > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26861.php >