We were able to solve ssh problem.

But now MPI is not able to use component yalla. We are running following
command

mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
/root/app2 : -n 1 --hostfile /root/host2 /root/backend

command is run in chroot environment on JARVICENAE27 and other node is
JARVICENAE125. JARVICENAE125 is able to select yalla since that is a remote
node and thus is not trying to run the job in chroot environment. But
JARVICENAE27
is throwing few MXM related errors and yalla is not selected.

Following are the logs of the command with verbose.

Any idea what might be wrong ?

[1432283901.548917]         sys.c:719  MXM  WARN  Conflicting CPU
frequencies detected, using: 2601.00
[JARVICENAE125:00909] mca: base: components_register: registering pml
components
[JARVICENAE125:00909] mca: base: components_register: found loaded
component v
[JARVICENAE125:00909] mca: base: components_register: component v register
function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component bfo
[JARVICENAE125:00909] mca: base: components_register: component bfo
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component cm
[JARVICENAE125:00909] mca: base: components_register: component cm register
function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component ob1
[JARVICENAE125:00909] mca: base: components_register: component ob1
register function successful
[JARVICENAE125:00909] mca: base: components_register: found loaded
component yalla
[JARVICENAE125:00909] mca: base: components_register: component yalla
register function successful
[JARVICENAE125:00909] mca: base: components_open: opening pml components
[JARVICENAE125:00909] mca: base: components_open: found loaded component v
[JARVICENAE125:00909] mca: base: components_open: component v open function
successful
[JARVICENAE125:00909] mca: base: components_open: found loaded component bfo
[JARVICENAE125:00909] mca: base: components_open: component bfo open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded component cm
[JARVICENAE125:00909] mca: base: components_open: component cm open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded component ob1
[JARVICENAE125:00909] mca: base: components_open: component ob1 open
function successful
[JARVICENAE125:00909] mca: base: components_open: found loaded component
yalla
[JARVICENAE125:00909] mca: base: components_open: component yalla open
function successful
[JARVICENAE125:00909] select: component v not in the include list
[JARVICENAE125:00909] select: component bfo not in the include list
[JARVICENAE125:00909] select: initializing pml component cm
[JARVICENAE27:06474] mca: base: components_register: registering pml
components
[JARVICENAE27:06474] mca: base: components_register: found loaded component
v
[JARVICENAE27:06474] mca: base: components_register: component v register
function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded component
bfo
[JARVICENAE27:06474] mca: base: components_register: component bfo register
function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded component
cm
[JARVICENAE27:06474] mca: base: components_register: component cm register
function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded component
ob1
[JARVICENAE27:06474] mca: base: components_register: component ob1 register
function successful
[JARVICENAE27:06474] mca: base: components_register: found loaded component
yalla
[JARVICENAE27:06474] mca: base: components_register: component yalla
register function successful
[JARVICENAE27:06474] mca: base: components_open: opening pml components
[JARVICENAE27:06474] mca: base: components_open: found loaded component v
[JARVICENAE27:06474] mca: base: components_open: component v open function
successful
[JARVICENAE27:06474] mca: base: components_open: found loaded component bfo
[JARVICENAE27:06474] mca: base: components_open: component bfo open
function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded component cm
libibverbs: Warning: no userspace device-specific driver found for
/sys/class/infiniband_verbs/uverbs0
[1432283901.559929]         sys.c:719  MXM  WARN  Conflicting CPU
frequencies detected, using: 2601.00
[1432283901.561294] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
There are no Mellanox cards detected.
[JARVICENAE27:06474] mca: base: close: component cm closed
[JARVICENAE27:06474] mca: base: close: unloading component cm
[JARVICENAE27:06474] mca: base: components_open: found loaded component ob1
[JARVICENAE27:06474] mca: base: components_open: component ob1 open
function successful
[JARVICENAE27:06474] mca: base: components_open: found loaded component
yalla
[1432283901.561732] [JARVICENAE27:6474 :0]      ib_dev.c:573  MXM  ERROR
There are no Mellanox cards detected.
[JARVICENAE27:06474] mca: base: components_open: component yalla open
function failed
[JARVICENAE27:06474] mca: base: close: component yalla closed
[JARVICENAE27:06474] mca: base: close: unloading component yalla
[JARVICENAE27:06474] select: component v not in the include list
[JARVICENAE27:06474] select: component bfo not in the include list
[JARVICENAE27:06474] select: initializing pml component ob1
[JARVICENAE27:06474] select: init returned priority 20
[JARVICENAE27:06474] selected ob1 best priority 20
[JARVICENAE27:06474] select: component ob1 selected
[JARVICENAE27:06474] mca: base: close: component v closed
[JARVICENAE27:06474] mca: base: close: unloading component v
[JARVICENAE27:06474] mca: base: close: component bfo closed
[JARVICENAE27:06474] mca: base: close: unloading component bfo
[JARVICENAE125:00909] select: init returned priority 30
[JARVICENAE125:00909] select: initializing pml component ob1
[JARVICENAE125:00909] select: init returned failure for component ob1
[JARVICENAE125:00909] select: initializing pml component yalla
[JARVICENAE125:00909] select: init returned priority 50
[JARVICENAE125:00909] selected yalla best priority 50
[JARVICENAE125:00909] select: component cm not selected / finalized
[JARVICENAE125:00909] select: component yalla selected
[JARVICENAE125:00909] mca: base: close: component v closed
[JARVICENAE125:00909] mca: base: close: unloading component v
[JARVICENAE125:00909] mca: base: close: component bfo closed
[JARVICENAE125:00909] mca: base: close: unloading component bfo
[JARVICENAE125:00909] mca: base: close: component cm closed
[JARVICENAE125:00909] mca: base: close: unloading component cm
[JARVICENAE125:00909] mca: base: close: component ob1 closed
[JARVICENAE125:00909] mca: base: close: unloading component ob1
[JARVICENAE27:06474] check:select: modex not reqd


On Wed, May 13, 2015 at 8:02 AM, Ralph Castain <r...@open-mpi.org> wrote:

> Okay, so we see two nodes have been allocated:
>
> 1. JARVICENAE27 - appears to be the node where mpirun is running
>
> 2. 10.3.0.176
>
> Does that match what you expected?
>
> If you cannot ssh (without a password) between machines, then we will not
> be able to run.
>
>
> On May 13, 2015, at 12:21 AM, Rahul Yadav <robora...@gmail.com> wrote:
>
> I get following output with verbose
>
> [JARVICENAE27:00654] mca: base: components_register: registering ras
> components
> [JARVICENAE27:00654] mca: base: components_register: found loaded
> component loadleveler
> [JARVICENAE27:00654] mca: base: components_register: component loadleveler
> register function successful
> [JARVICENAE27:00654] mca: base: components_register: found loaded
> component simulator
> [JARVICENAE27:00654] mca: base: components_register: component simulator
> register function successful
> [JARVICENAE27:00654] mca: base: components_register: found loaded
> component slurm
> [JARVICENAE27:00654] mca: base: components_register: component slurm
> register function successful
> [JARVICENAE27:00654] mca: base: components_open: opening ras components
> [JARVICENAE27:00654] mca: base: components_open: found loaded component
> loadleveler
> [JARVICENAE27:00654] mca: base: components_open: component loadleveler
> open function successful
> [JARVICENAE27:00654] mca: base: components_open: found loaded component
> simulator
> [JARVICENAE27:00654] mca: base: components_open: found loaded component
> slurm
> [JARVICENAE27:00654] mca: base: components_open: component slurm open
> function successful
> [JARVICENAE27:00654] mca:base:select: Auto-selecting ras components
> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component
> [loadleveler]
> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
> [loadleveler]. Query failed to return a module
> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [simulator]
> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component
> [simulator]. Query failed to return a module
> [JARVICENAE27:00654] mca:base:select:(  ras) Querying component [slurm]
> [JARVICENAE27:00654] mca:base:select:(  ras) Skipping component [slurm].
> Query failed to return a module
> [JARVICENAE27:00654] mca:base:select:(  ras) No component selected!
>
> ======================   ALLOCATED NODES   ======================
>        JARVICENAE27: slots=1 max_slots=0 slots_inuse=0 state=UP
>        10.3.0.176: slots=1 max_slots=0 slots_inuse=0 state=UNKNOWN
>
> Also, I am not able to ssh to other machine from one machine in chroot
> environment. Can that be a problem ?
>
> Thanks
> Rahul
>
> On Thu, May 7, 2015 at 8:06 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Try adding —mca ras_base_verbose 10 to your cmd line and let’s see what
>> it thinks it is doing. Which OMPI version are you using - master?
>>
>>
>> On May 6, 2015, at 11:24 PM, Rahul Yadav <robora...@gmail.com> wrote:
>>
>> Hi,
>>
>> We have been trying to run MPI jobs (consisting of two different
>> binaries, one each ) in two nodes,  using hostfile option as following
>>
>> mpirun --allow-run-as-root --mca pml yalla -n 1 --hostfile /root/host1
>> /root/app2 : -n 1 --hostfile /root/host2 /root/backend
>>
>> We are doing this in chroot environment. We have set the HPCX env in
>> chroot'ed environment itself. /root/host1 and /root/host2 (inside chroot
>> env) contains IPs of two nodes respectively.
>>
>> We are getting following error
>>
>> " all nodes which are allocated for this job are already filled "
>>
>> However when we use chroot but don't use hostfile option (both processes
>> run in same node) OR we use hostfile option but outside chroot, it works.
>>
>> Anyone has any idea if chroot can cause above error and how to solve it ?
>>
>> Thanks
>> Rahul
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26845.php
>>
>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post:
>> http://www.open-mpi.org/community/lists/users/2015/05/26847.php
>>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26860.php
>
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post:
> http://www.open-mpi.org/community/lists/users/2015/05/26861.php
>

Reply via email to