btw, what is a rationale to run in chroot env? is it dockers-like env? does "ibv_devinfo -v" works for you from chroot env?
On Tue, May 26, 2015 at 7:08 AM, Rahul Yadav <robora...@gmail.com> wrote: > Yes Ralph, MXM cards are on the node. Command runs fine if I run it out of > the chroot environment. > > Thanks > Rahul > > On Mon, May 25, 2015 at 9:03 PM, Ralph Castain <r...@open-mpi.org> wrote: > >> Well, it isn’t finding any MXM cards on NAE27 - do you have any there? >> >> You can’t use yalla without MXM cards on all nodes >> >> >> On May 25, 2015, at 8:51 PM, Rahul Yadav <robora...@gmail.com> wrote: >> >> We were able to solve ssh problem. >> >> But now MPI is not able to use component yalla. We are running following >> command >> >> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 >> /root/app2 : -n 1 --hostfile /root/host2 /root/backend >> >> command is run in chroot environment on JARVICENAE27 and other node is >> JARVICENAE125. JARVICENAE125 is able to select yalla since that is a >> remote node and thus is not trying to run the job in chroot environment. >> But JARVICENAE27 is throwing few MXM related errors and yalla is not >> selected. >> >> Following are the logs of the command with verbose. >> >> Any idea what might be wrong ? >> >> [1432283901.548917] sys.c:719 MXM WARN Conflicting CPU >> frequencies detected, using: 2601.00 >> [JARVICENAE125:00909] mca: base: components_register: registering pml >> components >> [JARVICENAE125:00909] mca: base: components_register: found loaded >> component v >> [JARVICENAE125:00909] mca: base: components_register: component v >> register function successful >> [JARVICENAE125:00909] mca: base: components_register: found loaded >> component bfo >> [JARVICENAE125:00909] mca: base: components_register: component bfo >> register function successful >> [JARVICENAE125:00909] mca: base: components_register: found loaded >> component cm >> [JARVICENAE125:00909] mca: base: components_register: component cm >> register function successful >> [JARVICENAE125:00909] mca: base: components_register: found loaded >> component ob1 >> [JARVICENAE125:00909] mca: base: components_register: component ob1 >> register function successful >> [JARVICENAE125:00909] mca: base: components_register: found loaded >> component yalla >> [JARVICENAE125:00909] mca: base: components_register: component yalla >> register function successful >> [JARVICENAE125:00909] mca: base: components_open: opening pml components >> [JARVICENAE125:00909] mca: base: components_open: found loaded component v >> [JARVICENAE125:00909] mca: base: components_open: component v open >> function successful >> [JARVICENAE125:00909] mca: base: components_open: found loaded component >> bfo >> [JARVICENAE125:00909] mca: base: components_open: component bfo open >> function successful >> [JARVICENAE125:00909] mca: base: components_open: found loaded component >> cm >> [JARVICENAE125:00909] mca: base: components_open: component cm open >> function successful >> [JARVICENAE125:00909] mca: base: components_open: found loaded component >> ob1 >> [JARVICENAE125:00909] mca: base: components_open: component ob1 open >> function successful >> [JARVICENAE125:00909] mca: base: components_open: found loaded component >> yalla >> [JARVICENAE125:00909] mca: base: components_open: component yalla open >> function successful >> [JARVICENAE125:00909] select: component v not in the include list >> [JARVICENAE125:00909] select: component bfo not in the include list >> [JARVICENAE125:00909] select: initializing pml component cm >> [JARVICENAE27:06474] mca: base: components_register: registering pml >> components >> [JARVICENAE27:06474] mca: base: components_register: found loaded >> component v >> [JARVICENAE27:06474] mca: base: components_register: component v register >> function successful >> [JARVICENAE27:06474] mca: base: components_register: found loaded >> component bfo >> [JARVICENAE27:06474] mca: base: components_register: component bfo >> register function successful >> [JARVICENAE27:06474] mca: base: components_register: found loaded >> component cm >> [JARVICENAE27:06474] mca: base: components_register: component cm >> register function successful >> [JARVICENAE27:06474] mca: base: components_register: found loaded >> component ob1 >> [JARVICENAE27:06474] mca: base: components_register: component ob1 >> register function successful >> [JARVICENAE27:06474] mca: base: components_register: found loaded >> component yalla >> [JARVICENAE27:06474] mca: base: components_register: component yalla >> register function successful >> [JARVICENAE27:06474] mca: base: components_open: opening pml components >> [JARVICENAE27:06474] mca: base: components_open: found loaded component v >> [JARVICENAE27:06474] mca: base: components_open: component v open >> function successful >> [JARVICENAE27:06474] mca: base: components_open: found loaded component >> bfo >> [JARVICENAE27:06474] mca: base: components_open: component bfo open >> function successful >> [JARVICENAE27:06474] mca: base: components_open: found loaded component cm >> libibverbs: Warning: no userspace device-specific driver found for >> /sys/class/infiniband_verbs/uverbs0 >> [1432283901.559929] sys.c:719 MXM WARN Conflicting CPU >> frequencies detected, using: 2601.00 >> [1432283901.561294] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR >> There are no Mellanox cards detected. >> [JARVICENAE27:06474] mca: base: close: component cm closed >> [JARVICENAE27:06474] mca: base: close: unloading component cm >> [JARVICENAE27:06474] mca: base: components_open: found loaded component >> ob1 >> [JARVICENAE27:06474] mca: base: components_open: component ob1 open >> function successful >> [JARVICENAE27:06474] mca: base: components_open: found loaded component >> yalla >> [1432283901.561732] [JARVICENAE27:6474 :0] ib_dev.c:573 MXM ERROR >> There are no Mellanox cards detected. >> [JARVICENAE27:06474] mca: base: components_open: component yalla open >> function failed >> [JARVICENAE27:06474] mca: base: close: component yalla closed >> [JARVICENAE27:06474] mca: base: close: unloading component yalla >> [JARVICENAE27:06474] select: component v not in the include list >> [JARVICENAE27:06474] select: component bfo not in the include list >> [JARVICENAE27:06474] select: initializing pml component ob1 >> [JARVICENAE27:06474] select: init returned priority 20 >> [JARVICENAE27:06474] selected ob1 best priority 20 >> [JARVICENAE27:06474] select: component ob1 selected >> [JARVICENAE27:06474] mca: base: close: component v closed >> [JARVICENAE27:06474] mca: base: close: unloading component v >> [JARVICENAE27:06474] mca: base: close: component bfo closed >> [JARVICENAE27:06474] mca: base: close: unloading component bfo >> [JARVICENAE125:00909] select: init returned priority 30 >> [JARVICENAE125:00909] select: initializing pml component ob1 >> [JARVICENAE125:00909] select: init returned failure for component ob1 >> [JARVICENAE125:00909] select: initializing pml component yalla >> [JARVICENAE125:00909] select: init returned priority 50 >> [JARVICENAE125:00909] selected yalla best priority 50 >> [JARVICENAE125:00909] select: component cm not selected / finalized >> [JARVICENAE125:00909] select: component yalla selected >> [JARVICENAE125:00909] mca: base: close: component v closed >> [JARVICENAE125:00909] mca: base: close: unloading component v >> [JARVICENAE125:00909] mca: base: close: component bfo closed >> [JARVICENAE125:00909] mca: base: close: unloading component bfo >> [JARVICENAE125:00909] mca: base: close: component cm closed >> [JARVICENAE125:00909] mca: base: close: unloading component cm >> [JARVICENAE125:00909] mca: base: close: component ob1 closed >> [JARVICENAE125:00909] mca: base: close: unloading component ob1 >> [JARVICENAE27:06474] check:select: modex not reqd >> >> >> On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org> wrote: >> >>> Okay, so we see two nodes have been allocated: >>> >>> 1. JARVICENAE27 - appears to be the node where mpirun is running >>> >>> 2. 10.3.0.176 >>> >>> Does that match what you expected? >>> >>> If you cannot ssh (without a password) between machines, then we will >>> not be able to run. >>> >>> >>> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com> wrote: >>> >>> I get following output with verbose >>> >>> [JARVICENAE27:00654] mca: base: components_register: registering ras >>> components >>> [JARVICENAE27:00654] mca: base: components_register: found loaded >>> component loadleveler >>> [JARVICENAE27:00654] mca: base: components_register: component >>> loadleveler register function successful >>> [JARVICENAE27:00654] mca: base: components_register: found loaded >>> component simulator >>> [JARVICENAE27:00654] mca: base: components_register: component simulator >>> register function successful >>> [JARVICENAE27:00654] mca: base: components_register: found loaded >>> component slurm >>> [JARVICENAE27:00654] mca: base: components_register: component slurm >>> register function successful >>> [JARVICENAE27:00654] mca: base: components_open: opening ras components >>> [JARVICENAE27:00654] mca: base: components_open: found loaded component >>> loadleveler >>> [JARVICENAE27:00654] mca: base: components_open: component loadleveler >>> open function successful >>> [JARVICENAE27:00654] mca: base: components_open: found loaded component >>> simulator >>> [JARVICENAE27:00654] mca: base: components_open: found loaded component >>> slurm >>> [JARVICENAE27:00654] mca: base: components_open: component slurm open >>> function successful >>> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components >>> [JARVICENAE27:00654] mca:base:select:( ras) Querying component >>> [loadleveler] >>> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component >>> [loadleveler]. Query failed to return a module >>> [JARVICENAE27:00654] mca:base:select:( ras) Querying component >>> [simulator] >>> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component >>> [simulator]. Query failed to return a module >>> [JARVICENAE27:00654] mca:base:select:( ras) Querying component [slurm] >>> [JARVICENAE27:00654] mca:base:select:( ras) Skipping component [slurm]. >>> Query failed to return a module >>> [JARVICENAE27:00654] mca:base:select:( ras) No component selected! >>> >>> ====================== ALLOCATED NODES ====================== >>> JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP >>> 10.3.0.176: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN >>> >>> Also, I am not able to ssh to other machine from one machine in chroot >>> environment. Can that be a problem ? >>> >>> Thanks >>> Rahul >>> >>> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org> wrote: >>> >>>> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what >>>> it thinks it is doing. Which OMPI version are you using - master? >>>> >>>> >>>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com> wrote: >>>> >>>> Hi, >>>> >>>> We have been trying to run MPI jobs (consisting of two different >>>> binaries, one each ) in two nodes, using hostfile option as following >>>> >>>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1 >>>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend >>>> >>>> We are doing this in chroot environment. We have set the HPCX env in >>>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot >>>> env) contains IPs of two nodes respectively. >>>> >>>> We are getting following error >>>> >>>> " all nodes which are allocated for this job are already filled " >>>> >>>> However when we use chroot but don't use hostfile option (both >>>> processes run in same node) OR we use hostfile option but outside chroot, >>>> it works. >>>> >>>> Anyone has any idea if chroot can cause above error and how to solve it >>>> ? >>>> >>>> Thanks >>>> Rahul >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php >>>> >>>> >>>> >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> Link to this post: >>>> http://www.open-mpi.org/community/lists/users/2015/05/26847.php >>>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/05/26860.php >>> >>> >>> >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >>> Link to this post: >>> http://www.open-mpi.org/community/lists/users/2015/05/26861.php >>> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26927.php >> >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2015/05/26929.php >> > > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2015/05/26930.php > -- Kind Regards, M.