Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-15 Thread Tena Sakai
Hi Gus,

Please read my comments inline.


On 2/14/11 7:05 PM, "Gus Correa"  wrote:

> Hi Tena
>
> Answers inline.
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>>> Hence, I don't understand why the lack of symmetry in the
>>> firewall protection.
>>> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
>>> Maybe dashen was installed later, just got whatever boilerplate firewall
>>> that comes with RedHat, CentOS, Fedora.
>>> If there is a gateway for this LAN somewhere with another firewall,
>>> which is probably the case,
>>
>> You are correct.  We had a system administrator, but we lost
>> that person and I installed dasher from scratch myslef and
>> I did use boilerplage firewall from centos 5.5 distribution.
>>
>
> I read your answer to Ashley and Reuti telling that you
> turned the firewall off and OpenMPI now works with vixen and dashen.
> That's good news!
>
>>> Do you have Internet access from either machine?
>>
>> Yes, I do.
>
> The LAN gateway is probably doing NAT.
I think that's the case.

> I would guess it also has its own firewall.
Yes, I believe so.

> Is there anybody there that could tell you about this?
I am afraid not...  Every time I ask something, I get
run-around or disinformation.

>
>>
>>> Vixen has yet another private IP 10.1.1.2 (eth0),
>>> with a bit weird combination of broadcast address 192.168.255.255(?),
>>> mask 255.0.0.0.
>>> vixen is/was part of another group of machines, via this other IP,
>>> cluster perhaps?
>>
>> We have a Rocks HPC cluster.  The cluster head is called blitzen
>> and there are 8 nodes in the cluster.  We have completely outgrown
>> this setting.  For example, I am running an application for last
>> 2 weeks with 4 of 8 nodes and the other 4 nodes have been used
>> by my colleagues and I expect my jobs to run another 2-3 weeks.
>> Which is why I am interested in cloud.
>>
>> Vixen is not part of the Rocks cluster, but it is an nfs server,
>> as well as database server.  Here's ifconfig of blitzen:
>>
>>   [tsakai@blitzen Rmpi]$ ifconfig
>>   eth0  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0B
>> inet addr:10.1.1.1  Bcast:10.255.255.255  Mask:255.0.0.0
>> inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:14637456238 (13.6 GiB)  TX bytes:25487423161 (23.7 GiB)
>> Interrupt:193 Memory:ec00-ec012100
>>
>>   eth1  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0D
>> inet addr:172.16.1.106  Bcast:172.16.3.255  Mask:255.255.252.0
>> inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:44685802310 (41.6 GiB)  TX bytes:28223858173 (26.2 GiB)
>> Interrupt:193 Memory:ea00-ea012100
>>
>>   loLink encap:Local Loopback
>> inet addr:127.0.0.1  Mask:255.0.0.0
>> inet6 addr: ::1/128 Scope:Host
>> UP LOOPBACK RUNNING  MTU:16436  Metric:1
>> RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:0
>> RX bytes:27450135463 (25.5 GiB)  TX bytes:27450135463 (25.5 GiB)
>>
>> And here's the same thing of vixen:
>> [tsakai@vixen Rmpi]$ cat moo
>>   eth0  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
>> inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
>> inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:47837093368 (44.5 GiB)  TX bytes:54525223424 (50.7 GiB)
>> Interrupt:185 Memory:ea00-ea012100
>>
>>   eth1  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:33
>> inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
>> inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
>> UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
>> TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
>> collisions:0 txqueuelen:1000
>> RX bytes:371146631795 (345.6 GiB)  TX bytes:13424275898600 (12.2
>> TiB)
>> Interrupt:193 Memory:ec00-ec012100
>>
>>   

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Gus Correa

Hi Tena

Answers inline.

Tena Sakai wrote:

Hi Gus,


Hence, I don't understand why the lack of symmetry in the
firewall protection.
Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
Maybe dashen was installed later, just got whatever boilerplate firewall
that comes with RedHat, CentOS, Fedora.
If there is a gateway for this LAN somewhere with another firewall,
which is probably the case,


You are correct.  We had a system administrator, but we lost
that person and I installed dasher from scratch myslef and
I did use boilerplage firewall from centos 5.5 distribution.



I read your answer to Ashley and Reuti telling that you
turned the firewall off and OpenMPI now works with vixen and dashen.
That's good news!


Do you have Internet access from either machine?


Yes, I do.


The LAN gateway is probably doing NAT.
I would guess it also has its own firewall.
Is there anybody there that could tell you about this?




Vixen has yet another private IP 10.1.1.2 (eth0),
with a bit weird combination of broadcast address 192.168.255.255(?),
mask 255.0.0.0.
vixen is/was part of another group of machines, via this other IP,
cluster perhaps?


We have a Rocks HPC cluster.  The cluster head is called blitzen
and there are 8 nodes in the cluster.  We have completely outgrown
this setting.  For example, I am running an application for last
2 weeks with 4 of 8 nodes and the other 4 nodes have been used
by my colleagues and I expect my jobs to run another 2-3 weeks.
Which is why I am interested in cloud.

Vixen is not part of the Rocks cluster, but it is an nfs server,
as well as database server.  Here's ifconfig of blitzen:

  [tsakai@blitzen Rmpi]$ ifconfig
  eth0  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0B
inet addr:10.1.1.1  Bcast:10.255.255.255  Mask:255.0.0.0
inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:14637456238 (13.6 GiB)  TX bytes:25487423161 (23.7 GiB)
Interrupt:193 Memory:ec00-ec012100
  
  eth1  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0D

inet addr:172.16.1.106  Bcast:172.16.3.255  Mask:255.255.252.0
inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:44685802310 (41.6 GiB)  TX bytes:28223858173 (26.2 GiB)
Interrupt:193 Memory:ea00-ea012100
  
  loLink encap:Local Loopback

inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:27450135463 (25.5 GiB)  TX bytes:27450135463 (25.5 GiB)

And here's the same thing of vixen:
[tsakai@vixen Rmpi]$ cat moo
  eth0  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:47837093368 (44.5 GiB)  TX bytes:54525223424 (50.7 GiB)
Interrupt:185 Memory:ea00-ea012100
  
  eth1  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:33

inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:371146631795 (345.6 GiB)  TX bytes:13424275898600 (12.2
TiB)
Interrupt:193 Memory:ec00-ec012100
  
  loLink encap:Local Loopback

inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0
TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1190988294201 (1.0 TiB)  TX bytes:1190988294201 (1.0
TiB)

I think you are also correct as to:


a bit weird combination 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Hi Gus,

> Hence, I don't understand why the lack of symmetry in the
> firewall protection.
> Either vixen's is too loose, or dashen's is too tight, I'd risk to say.
> Maybe dashen was installed later, just got whatever boilerplate firewall
> that comes with RedHat, CentOS, Fedora.
> If there is a gateway for this LAN somewhere with another firewall,
> which is probably the case,

You are correct.  We had a system administrator, but we lost
that person and I installed dasher from scratch myslef and
I did use boilerplage firewall from centos 5.5 distribution.

> Do you have Internet access from either machine?

Yes, I do.

> Vixen has yet another private IP 10.1.1.2 (eth0),
> with a bit weird combination of broadcast address 192.168.255.255(?),
> mask 255.0.0.0.
> vixen is/was part of another group of machines, via this other IP,
> cluster perhaps?

We have a Rocks HPC cluster.  The cluster head is called blitzen
and there are 8 nodes in the cluster.  We have completely outgrown
this setting.  For example, I am running an application for last
2 weeks with 4 of 8 nodes and the other 4 nodes have been used
by my colleagues and I expect my jobs to run another 2-3 weeks.
Which is why I am interested in cloud.

Vixen is not part of the Rocks cluster, but it is an nfs server,
as well as database server.  Here's ifconfig of blitzen:

  [tsakai@blitzen Rmpi]$ ifconfig
  eth0  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0B
inet addr:10.1.1.1  Bcast:10.255.255.255  Mask:255.0.0.0
inet6 addr: fe80::219:b9ff:fee0:c00b/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:58859908 errors:0 dropped:0 overruns:0 frame:0
TX packets:38795319 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:14637456238 (13.6 GiB)  TX bytes:25487423161 (23.7 GiB)
Interrupt:193 Memory:ec00-ec012100

  eth1  Link encap:Ethernet  HWaddr 00:19:B9:E0:C0:0D
inet addr:172.16.1.106  Bcast:172.16.3.255  Mask:255.255.252.0
inet6 addr: fe80::219:b9ff:fee0:c00d/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:99465693 errors:0 dropped:0 overruns:0 frame:0
TX packets:46026372 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:44685802310 (41.6 GiB)  TX bytes:28223858173 (26.2 GiB)
Interrupt:193 Memory:ea00-ea012100

  loLink encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:80078179 errors:0 dropped:0 overruns:0 frame:0
TX packets:80078179 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:27450135463 (25.5 GiB)  TX bytes:27450135463 (25.5 GiB)

And here's the same thing of vixen:
[tsakai@vixen Rmpi]$ cat moo
  eth0  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:31
inet addr:10.1.1.2  Bcast:192.168.255.255  Mask:255.0.0.0
inet6 addr: fe80::21a:a0ff:fe1c:31/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:61942079 errors:0 dropped:0 overruns:0 frame:0
TX packets:61950934 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:47837093368 (44.5 GiB)  TX bytes:54525223424 (50.7 GiB)
Interrupt:185 Memory:ea00-ea012100

  eth1  Link encap:Ethernet  HWaddr 00:1A:A0:1C:00:33
inet addr:172.16.1.107  Bcast:172.16.3.255  Mask:255.255.252.0
inet6 addr: fe80::21a:a0ff:fe1c:33/64 Scope:Link
UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
RX packets:5204606192 errors:0 dropped:0 overruns:0 frame:0
TX packets:8935890067 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:371146631795 (345.6 GiB)  TX bytes:13424275898600 (12.2
TiB)
Interrupt:193 Memory:ec00-ec012100

  loLink encap:Local Loopback
inet addr:127.0.0.1  Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING  MTU:16436  Metric:1
RX packets:244240818 errors:0 dropped:0 overruns:0 frame:0
TX packets:244240818 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:1190988294201 (1.0 TiB)  TX bytes:1190988294201 (1.0
TiB)

I think you are also correct as to:

> a bit weird combination of broadcast address 192.168.255.255 (?),
> and mask 255.0.0.0.

I think they are both misconfigured.  I will fix them when I can.

> What is in your ${TORQUE}/server_priv/nodes file?
> IPs or names (vixen & dashen).

We don't use TORQUE.  We do use SGE from blitzen.

> Are they on a DNS server or do you resolve their 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Hi Gus,

Thank you very much for your help and reply.

I agree with each and every point you make.  I look forward to
the day I can write 'little How To.'

Regards,

Tena


On 2/14/11 1:47 PM, "Gus Correa"  wrote:

> Hi Tena
>
> Answers inline.
> This is getting big!
>
> Tena Sakai wrote:
>> Hi Gus,
>>
>> Thank you for your reply, comments, and suggestions.
>>
>> EC2 does have support, but it is with extra charge and I am
>> discouraged to use it for budgetary reasons.  Also, I have
>> heard that their support is a bit toward virtualization
>> and amazon environment specific.  I may have to override
>> all these and ask them for a help...
>>
>
> Ask them at least for super-saving free shipping.
> I do this all the time.
> The downside is that delivery takes 4-9 business days at least ... :)
>
>> (Incidentally, I really like openmpi mailing list.  The
>>  atmosphere you people generate and sustain is quite wonderful
>
> +1 4 that!
>
> The best friendly+knowledgeable+helpful combination amongst
> all 10+ mailing lists I subscribe to.
> Kudos to Jeff, Ralph, and the other developers for keeping it this way.
>
>>  and I hope one day
>> I can be a contributing member.)
>>
>
> How about your bringing the world of MPI and cloud computing
> into the list with your ongoing postings?
> I think it is a sound contribution.
> When you get it to work, maybe you can write up a little 'HowTo'. :)
>
>> As to your suggestions:
>> 1) This is a good idea.  I will do hostname via mpirun.  Increasing
>>complexity from the simplest will probably reveal something I
>>don't know.
>> 2) I have run R serially on EC2 with this very ami.  I have not seen
>>any problem and many others have done the same.
>>
>> Also, here is an idea I came up in my sleep that I want to check
>> out.
>
> I hope Amazon EC2 is not making nightmares out of your dreams.
>
>> The ami I have been using is a centos 5.5, which I have built
>> from ground up.  EC2 has something called Amazon Linux ami.  I
>> don't know what distribution that is and I am sure it doesn't have
>> R, nor openmpi.  But I thought I would load these components I
>> need to the Amazon Linux (again as you suggest by starting the
>> simplest case) and see if I can reproduce the behavior I have
>> been experiencing on different (and Amazon "official" ami).
>>
>
> That would be a possible way to go, but given my EC2-blindness,
> I can all but wonder if it has chances to work.
> I have yet to understand whether you copy your compiled tools
> (OpenMPI, R, etc) from your local machines to EC2,
> or if you build/compile them directly on the EC2 environment.
> Also, it's not clear to me if the OS in EC2 is an image
> from your local machines' OS/Linux distro, or independent of them,
> or if you can choose to have it either way.
>
> On another posting, Ashley Pittman reported to
> be using OpenMPI in Amazon EC2 without problems,
> suggests pathway and gives several tips for that.
> That is probably a more promising path,
> which you may want to try.
>
>> I will report as I discover interesting/relevant finding.
>>
>> Regards,
>>
>> Tena
>>
>
> Best,
> Gus Correa
>
>>
>> On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:
>>
>>> Hi Tena
>>>
>>> Thank you for taking the time to explain the details of
>>> the EC2 procedure.
>>>
>>> I am afraid everything in my bag of tricks was used.
>>> As Ralph and Jeff suggested, this seems to be a very specific
>>> problem with EC2.
>>>
>>> The difference in behavior when you run as root vs. when you
>>> run as Tena, tells that there is some use restriction to regular users
>>> in EC2 that isn't present in common machines (Linux or other), I guess.
>>> This may be yet another  'stone to turn', as you like to say.
>>> It also suggests that there is nothing wrong in principle with your
>>> openMPI setup or with your program, otherwise root would not be able to run
>>> it.
>>>
>>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>>> I wonder if EC2 has any type of user support where you could ask
>>> for help.
>>> After all, it is a paid sevice, isn't it?
>>> (OpenMPI is not paid and has a great customer service, doesn't it?  :) )
>>> You have a well documented case to present,
>>> and the very peculiar fact that the program fails for normal users but runs
>>> for root.
>>> This should help the EC2 support to start looking for a solution.
>>>
>>> I am running out of suggestions of what you could try on your own.
>>> But let me try:
>>>
>>> 1) You may try to reduce the problem to its less common denominator,
>>> perhaps by trying to run non-R based MPI programs on EC2, maybe the
>>> hello_c.c,
>>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
>>> This would be to avoid the extra layer of complexity introduced by R.
>>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2
>>> hostname).
>>> I.e. go in a progression of increasing complexity, see where you 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Thank you, Jeff.  That's good to know, but for now
I am going to stick to -app option.  For this old
dog can't learn new tricks so fast.

Regards,

Tena


On 2/14/11 6:18 AM, "Jeff Squyres"  wrote:

> On Feb 13, 2011, at 10:54 PM, Tena Sakai wrote:
> 
>> I have a file app.ac3, which looks like:
>>  [tsakai@vixen Rmpi]$ cat app.ac3
>>  -H dasher.egcrc.org  -np 1 hostname
>>  -H dasher.egcrc.org  -np 1 hostname
>>  -H vixen.egcrc.org   -np 1 hostname
>>  -H vixen.egcrc.org   -np 1 hostname
> 
> Note that you don't *have* to use app files.  For example, the moral
> equivalent of the above is:
> 
> mpirun --host dasher.egcrc.org,vixen.egcrc.org -np 4 hostname




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Hi Ashley,

You nailed it precisely.  I turned dasher's firewall off per
your instruction and the problem went away.

  [tsakai@dasher Rmpi]$  cat app.ac3
  -H dasher.egcrc.org  -np 1 hostname
  -H dasher.egcrc.org  -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ mpirun -app app.ac3
  dasher.egcrc.org
  dasher.egcrc.org
  vixen.egcrc.org
  vixen.egcrc.org
  [tsakai@dasher Rmpi]$

Incidentally, all machines we have are inside the firewall
and none is publicly reachable without the use of vpn.

Many thanks to you.
Also, thanks to Reuti who raised this possibility in the
first place.

Regards,

Tena

On 2/14/11 1:37 PM, "Ashley Pittman"  wrote:

> 
> On 14 Feb 2011, at 21:10, Tena Sakai wrote:
>> Regarding firewall, they are different:
> 
>> 
>> I don't understand what they mean.
> 
> vixen has a normal, or empty config and as such has no firewall, dasher has a
> number of firewall rules configured which could easily be the cause of the
> problem on these two machines.  To be able to run OpenMPI across these two
> machines you'll need to disable the firewall on dasher.
> 
> To disable the firewall the command (as root) is "service iptables off" to
> turn it off until next boot or "chkconfig iptables off" to do it permanently
> from the next boot, obviously you should check with your network administrator
> before doing this.
> 
> Ashley.




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Gus Correa

Hi Tena

Answers inline.
This is getting big!

Tena Sakai wrote:

Hi Gus,

Thank you for your reply, comments, and suggestions.

EC2 does have support, but it is with extra charge and I am
discouraged to use it for budgetary reasons.  Also, I have
heard that their support is a bit toward virtualization
and amazon environment specific.  I may have to override
all these and ask them for a help...



Ask them at least for super-saving free shipping.
I do this all the time.
The downside is that delivery takes 4-9 business days at least ... :)


(Incidentally, I really like openmpi mailing list.  The
 atmosphere you people generate and sustain is quite wonderful


+1 4 that!

The best friendly+knowledgeable+helpful combination amongst
all 10+ mailing lists I subscribe to.
Kudos to Jeff, Ralph, and the other developers for keeping it this way.

 and I hope one day 
I can be a contributing member.)




How about your bringing the world of MPI and cloud computing
into the list with your ongoing postings?
I think it is a sound contribution.
When you get it to work, maybe you can write up a little 'HowTo'. :)


As to your suggestions:
1) This is a good idea.  I will do hostname via mpirun.  Increasing
   complexity from the simplest will probably reveal something I
   don't know.
2) I have run R serially on EC2 with this very ami.  I have not seen
   any problem and many others have done the same.

Also, here is an idea I came up in my sleep that I want to check
out.  


I hope Amazon EC2 is not making nightmares out of your dreams.


The ami I have been using is a centos 5.5, which I have built
from ground up.  EC2 has something called Amazon Linux ami.  I
don't know what distribution that is and I am sure it doesn't have
R, nor openmpi.  But I thought I would load these components I
need to the Amazon Linux (again as you suggest by starting the
simplest case) and see if I can reproduce the behavior I have
been experiencing on different (and Amazon "official" ami).



That would be a possible way to go, but given my EC2-blindness,
I can all but wonder if it has chances to work.
I have yet to understand whether you copy your compiled tools
(OpenMPI, R, etc) from your local machines to EC2,
or if you build/compile them directly on the EC2 environment.
Also, it's not clear to me if the OS in EC2 is an image
from your local machines' OS/Linux distro, or independent of them,
or if you can choose to have it either way.

On another posting, Ashley Pittman reported to
be using OpenMPI in Amazon EC2 without problems,
suggests pathway and gives several tips for that.
That is probably a more promising path,
which you may want to try.


I will report as I discover interesting/relevant finding.

Regards,

Tena



Best,
Gus Correa



On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:


Hi Tena

Thank you for taking the time to explain the details of
the EC2 procedure.

I am afraid everything in my bag of tricks was used.
As Ralph and Jeff suggested, this seems to be a very specific
problem with EC2.

The difference in behavior when you run as root vs. when you
run as Tena, tells that there is some use restriction to regular users
in EC2 that isn't present in common machines (Linux or other), I guess.
This may be yet another  'stone to turn', as you like to say.
It also suggests that there is nothing wrong in principle with your
openMPI setup or with your program, otherwise root would not be able to run
it.

Besides Ralph suggestion of trying the EC2 mailing list archive,
I wonder if EC2 has any type of user support where you could ask
for help.
After all, it is a paid sevice, isn't it?
(OpenMPI is not paid and has a great customer service, doesn't it?  :) )
You have a well documented case to present,
and the very peculiar fact that the program fails for normal users but runs
for root.
This should help the EC2 support to start looking for a solution.

I am running out of suggestions of what you could try on your own.
But let me try:

1) You may try to reduce the problem to its less common denominator,
perhaps by trying to run non-R based MPI programs on EC2, maybe the hello_c.c,
ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
This would be to avoid the extra layer of complexity introduced by R.
Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 hostname).
I.e. go in a progression of increasing complexity, see where you hit the wall.
This may shed some light on what is going on.

I don't know if this suggestion may  really help, though.
It is not clear to me where the thing fails, whether it is during program
execution,
or while mpiexec is setting up the environment for the program to run.
If it is very early in the process, before the program starts, my suggestion
won't work.
Jeff and Ralph, who know OpenMPI inside out, may have better advice in this
regard.

2) Another thing would be to try to run R on E2C in serial mode, without
mpiexec,
interactively or via 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Ashley Pittman

On 14 Feb 2011, at 21:10, Tena Sakai wrote:
> Regarding firewall, they are different:

> 
> I don't understand what they mean.

vixen has a normal, or empty config and as such has no firewall, dasher has a 
number of firewall rules configured which could easily be the cause of the 
problem on these two machines.  To be able to run OpenMPI across these two 
machines you'll need to disable the firewall on dasher.

To disable the firewall the command (as root) is "service iptables off" to turn 
it off until next boot or "chkconfig iptables off" to do it permanently from 
the next boot, obviously you should check with your network administrator 
before doing this.

Ashley.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Kevin . Buckley

This probably shows my lack of understanding as to how OpenMPI
negotiates the connectivity between nodes when given a choice
of interfaces but anyway:

 does dasher have any network interfaces that vixen does not?

The scenario I am imgaining would be that you ssh into dasher
from vixen using a "network" that both share and similarly, when
you mpirun from vixen, the network that OpenMPI uses is constrained
by the interfaces that can be seen from vixen, so you are fine.

However when you are on dasher, mpirun sees another interface which
it takes a liking to and so tries to use that, but that interface
is not available to vixen so the OpenMPI processes spawned there
terminate when they can't find that interface so as to talk back
to dasher's controlling process.

I know that you are no longer working with VMs but it's along those
lines that I was thinking: extra network interfaces that you assume
won't be used but which are and which could then be overcome by use
of an explicit

 --mca btl_tcp_if_exclude virbr0

or some such construction (virbr0 used as an example here).

Kevin

-- 
Kevin M. Buckley  Room:  CO327
School of Engineering and Phone: +64 4 463 5971
 Computer Science
Victoria University of Wellington
New Zealand



Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Hi Gus,

Thank you for your response.

I have verified that
 1) /etc/hosts files on both machines vixen and dasher are identical
 2) both machines have nothing but comments in hosts.allow and hosts.deny
Regarding firewall, they are different:
On vixen this how it looks:
  [root@vixen ec2]# cat /etc/sysconfig/iptables
  cat: /etc/sysconfig/iptables: No such file or directory
  [root@vixen ec2]#
  [root@vixen ec2]# /sbin/iptables --list
  Chain INPUT (policy ACCEPT)
  target prot opt source   destination

  Chain FORWARD (policy ACCEPT)
  target prot opt source   destination

  Chain OUTPUT (policy ACCEPT)
  target prot opt source   destination
  [root@vixen ec2]#

On dasher:
  [tsakai@dasher Rmpi]$ sudo cat /etc/sysconfig/iptables
  # Firewall configuration written by system-config-securitylevel
  # Manual customization of this file is not recommended.
  *filter
  :INPUT ACCEPT [0:0]
  :FORWARD ACCEPT [0:0]
  :OUTPUT ACCEPT [0:0]
  :RH-Firewall-1-INPUT - [0:0]
  -A INPUT -j RH-Firewall-1-INPUT
  -A FORWARD -j RH-Firewall-1-INPUT
  -A RH-Firewall-1-INPUT -i lo -j ACCEPT
  -A RH-Firewall-1-INPUT -p icmp --icmp-type any -j ACCEPT
  -A RH-Firewall-1-INPUT -p 50 -j ACCEPT
  -A RH-Firewall-1-INPUT -p 51 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp --dport 5353 -d 224.0.0.251 -j ACCEPT
  -A RH-Firewall-1-INPUT -p udp -m udp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -p tcp -m tcp --dport 631 -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j
ACCEPT
  -A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j
ACCEPT
  -A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
  COMMIT
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ sudo /sbin/iptables --list
  [sudo] password for tsakai:
  Chain INPUT (policy ACCEPT)
  target prot opt source   destination
  RH-Firewall-1-INPUT  all  --  anywhere anywhere

  Chain FORWARD (policy ACCEPT)
  target prot opt source   destination
  RH-Firewall-1-INPUT  all  --  anywhere anywhere

  Chain OUTPUT (policy ACCEPT)
  target prot opt source   destination

  Chain RH-Firewall-1-INPUT (2 references)
  target prot opt source   destination
  ACCEPT all  --  anywhere anywhere
  ACCEPT icmp --  anywhere anywhereicmp any
  ACCEPT esp  --  anywhere anywhere
  ACCEPT ah   --  anywhere anywhere
  ACCEPT udp  --  anywhere 224.0.0.251 udp dpt:mdns
  ACCEPT udp  --  anywhere anywhereudp dpt:ipp
  ACCEPT tcp  --  anywhere anywheretcp dpt:ipp
  ACCEPT all  --  anywhere anywherestate
RELATED,ESTABLISHED
  ACCEPT tcp  --  anywhere anywherestate NEW tcp
dpt:ssh
  ACCEPT tcp  --  anywhere anywherestate NEW tcp
dpt:http
  REJECT all  --  anywhere anywherereject-with
icmp-host-prohibited
  [tsakai@dasher Rmpi]$

I don't understand what they mean.  Can you see any clue as to
why vixen can and dasher cannot run mpirun with the app file:
  -H dasher.egcrc.org  -np 1 hostname
  -H dasher.egcrc.org  -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname

Many thanks.

Tena

On 2/14/11 11:15 AM, "Gus Correa"  wrote:

> Tena Sakai wrote:
>> Hi Reuti,
>>
>>> a) can you ssh from dasher to vixen?
>> Yes, no problem.
>>   [tsakai@dasher Rmpi]$
>>   [tsakai@dasher Rmpi]$ hostname
>>   dasher.egcrc.org
>>   [tsakai@dasher Rmpi]$
>>   [tsakai@dasher Rmpi]$ ssh vixen
>>   Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
>>   [tsakai@vixen ~]$
>>   [tsakai@vixen ~]$ hostname
>>   vixen.egcrc.org
>>   [tsakai@vixen ~]$
>>
>>> b) firewall on vixen?
>> There is no firewall on vixen that I know of, but I don't
>> know how I can definitively show it one way or the other.
>> Can you please suggest how I can do this?
>>
>> Regards,
>>
>> Tena
>>
>>
>
> Hi Tena
>
> Besides Reuti suggestions:
>
> Check the consistency of /etc/hosts on both machines.
> Check if there are restrictions on /etc/hosts.allow and
> /etc/hosts.deny on both machines.
> Check if both the MPI directories and your home/work directory
> is mounted/available on both machines.
> (We may have been through this checklist before, sorry if I forgot.)
>
> Firewall info (not very friendly syntax ...):
>
> iptables --list
>
> or maybe better:
>
> cat /etc/sysconfig/iptables
>
> I hope it helps,
> Gus Correa
>
>> On 2/14/11 4:38 AM, "Reuti"  wrote:
>>
>>> Hi,
>>>
>>> Am 14.02.2011 um 04:54 schrieb Tena Sakai:
>>>
 I have digressed and started downward descent...

 I was trying to make a simple and clear case.  Everything
 I 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Gus Correa

Tena Sakai wrote:

Hi Reuti,


a) can you ssh from dasher to vixen?

Yes, no problem.
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ hostname
  dasher.egcrc.org
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ ssh vixen
  Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
  [tsakai@vixen ~]$
  [tsakai@vixen ~]$ hostname
  vixen.egcrc.org
  [tsakai@vixen ~]$


b) firewall on vixen?

There is no firewall on vixen that I know of, but I don't
know how I can definitively show it one way or the other.
Can you please suggest how I can do this?

Regards,

Tena




Hi Tena

Besides Reuti suggestions:

Check the consistency of /etc/hosts on both machines.
Check if there are restrictions on /etc/hosts.allow and
/etc/hosts.deny on both machines.
Check if both the MPI directories and your home/work directory
is mounted/available on both machines.
(We may have been through this checklist before, sorry if I forgot.)

Firewall info (not very friendly syntax ...):

iptables --list

or maybe better:

cat /etc/sysconfig/iptables

I hope it helps,
Gus Correa


On 2/14/11 4:38 AM, "Reuti"  wrote:


Hi,

Am 14.02.2011 um 04:54 schrieb Tena Sakai:


I have digressed and started downward descent...

I was trying to make a simple and clear case.  Everything
I write in this very mail is about local machines.  There
is no virtual machines involved.  I am talking about two
machines, vixen and dasher, which share the same file
structure.  Vixen is a nfs server and dasher is an nfs
client.  I have just installed openmpi 1.4.3 on dasher,
which is the same version I have on vixen.

I have a file app.ac3, which looks like:
 [tsakai@vixen Rmpi]$ cat app.ac3
 -H dasher.egcrc.org  -np 1 hostname
 -H dasher.egcrc.org  -np 1 hostname
 -H vixen.egcrc.org   -np 1 hostname
 -H vixen.egcrc.org   -np 1 hostname
 [tsakai@vixen Rmpi]$

Vixen can run this without any problem:
 [tsakai@vixen Rmpi]$ mpirun -app app.ac3
 vixen.egcrc.org
 vixen.egcrc.org
 dasher.egcrc.org
 dasher.egcrc.org
 [tsakai@vixen Rmpi]$

But I can't run this very command from dasher:
 [tsakai@vixen Rmpi]$
 [tsakai@vixen Rmpi]$ ssh dasher
 Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
 [tsakai@dasher ~]$
 [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/
 [tsakai@dasher Rmpi]$
 [tsakai@dasher Rmpi]$ mpirun -app app.ac3
 mpirun: killing job...

a) can you ssh from dasher to vixen?

b) firewall on vixen?

-- Reuti



 --
 mpirun noticed that the job aborted, but has no info as to the process
 that caused that situation.
 --
 --
 mpirun was unable to cleanly terminate the daemons on the nodes shown
 below. Additional manual cleanup may be required - please refer to
 the "orte-clean" tool for assistance.
 --
   vixen.egcrc.org - daemon did not report back when launched
 [tsakai@dasher Rmpi]$

After I issue the mpirun command, it hangs and I had to Cntrol-C out
of it at which point it generated all lines " mpirun: killing job..."
and below.

A strange thing is that dahser has no problem executing the same
thing via ssh:
 [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname
 vixen.egcrc.org
 [tsakai@dasher Rmpi]$

In fact, dasher can run it via mpirun so long as no foreign machine
is present in the app file.  Ie.,
 [tsakai@dasher Rmpi]$ cat app.ac4
 -H dasher.egcrc.org  -np 1 hostname
 -H dasher.egcrc.org  -np 1 hostname
 # -H vixen.egcrc.org   -np 1 hostname
 # -H vixen.egcrc.org   -np 1 hostname
 [tsakai@dasher Rmpi]$
 [tsakai@dasher Rmpi]$ mpirun -app app.ac4
 dasher.egcrc.org
 dasher.egcrc.org
 [tsakai@dasher Rmpi]$

Can you please tell me why I can go one way (from vixen to dasher)
and not the other way (dasher to vixen)?

Thank you.

Tena


On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:


Hi Tena

Thank you for taking the time to explain the details of
the EC2 procedure.

I am afraid everything in my bag of tricks was used.
As Ralph and Jeff suggested, this seems to be a very specific
problem with EC2.

The difference in behavior when you run as root vs. when you
run as Tena, tells that there is some use restriction to regular users
in EC2 that isn't present in common machines (Linux or other), I guess.
This may be yet another  'stone to turn', as you like to say.
It also suggests that there is nothing wrong in principle with your
openMPI setup or with your program, otherwise root would not be able to run
it.

Besides Ralph suggestion of trying the EC2 mailing list archive,
I wonder if EC2 has any type of user support where you could ask
for help.
After all, it is a paid sevice, isn't it?
(OpenMPI is not paid and has a great customer service, doesn't it?  :) )
You have a well documented case to present,
and the very peculiar fact 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Hi Reuti,

> a) can you ssh from dasher to vixen?
Yes, no problem.
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ hostname
  dasher.egcrc.org
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ ssh vixen
  Last login: Mon Feb 14 10:39:20 2011 from dasher.egcrc.org
  [tsakai@vixen ~]$
  [tsakai@vixen ~]$ hostname
  vixen.egcrc.org
  [tsakai@vixen ~]$

> b) firewall on vixen?
There is no firewall on vixen that I know of, but I don't
know how I can definitively show it one way or the other.
Can you please suggest how I can do this?

Regards,

Tena


On 2/14/11 4:38 AM, "Reuti"  wrote:

> Hi,
>
> Am 14.02.2011 um 04:54 schrieb Tena Sakai:
>
>> I have digressed and started downward descent...
>>
>> I was trying to make a simple and clear case.  Everything
>> I write in this very mail is about local machines.  There
>> is no virtual machines involved.  I am talking about two
>> machines, vixen and dasher, which share the same file
>> structure.  Vixen is a nfs server and dasher is an nfs
>> client.  I have just installed openmpi 1.4.3 on dasher,
>> which is the same version I have on vixen.
>>
>> I have a file app.ac3, which looks like:
>>  [tsakai@vixen Rmpi]$ cat app.ac3
>>  -H dasher.egcrc.org  -np 1 hostname
>>  -H dasher.egcrc.org  -np 1 hostname
>>  -H vixen.egcrc.org   -np 1 hostname
>>  -H vixen.egcrc.org   -np 1 hostname
>>  [tsakai@vixen Rmpi]$
>>
>> Vixen can run this without any problem:
>>  [tsakai@vixen Rmpi]$ mpirun -app app.ac3
>>  vixen.egcrc.org
>>  vixen.egcrc.org
>>  dasher.egcrc.org
>>  dasher.egcrc.org
>>  [tsakai@vixen Rmpi]$
>>
>> But I can't run this very command from dasher:
>>  [tsakai@vixen Rmpi]$
>>  [tsakai@vixen Rmpi]$ ssh dasher
>>  Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
>>  [tsakai@dasher ~]$
>>  [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/
>>  [tsakai@dasher Rmpi]$
>>  [tsakai@dasher Rmpi]$ mpirun -app app.ac3
>>  mpirun: killing job...
>
> a) can you ssh from dasher to vixen?
>
> b) firewall on vixen?
>
> -- Reuti
>
>
>>  --
>>  mpirun noticed that the job aborted, but has no info as to the process
>>  that caused that situation.
>>  --
>>  --
>>  mpirun was unable to cleanly terminate the daemons on the nodes shown
>>  below. Additional manual cleanup may be required - please refer to
>>  the "orte-clean" tool for assistance.
>>  --
>>vixen.egcrc.org - daemon did not report back when launched
>>  [tsakai@dasher Rmpi]$
>>
>> After I issue the mpirun command, it hangs and I had to Cntrol-C out
>> of it at which point it generated all lines " mpirun: killing job..."
>> and below.
>>
>> A strange thing is that dahser has no problem executing the same
>> thing via ssh:
>>  [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname
>>  vixen.egcrc.org
>>  [tsakai@dasher Rmpi]$
>>
>> In fact, dasher can run it via mpirun so long as no foreign machine
>> is present in the app file.  Ie.,
>>  [tsakai@dasher Rmpi]$ cat app.ac4
>>  -H dasher.egcrc.org  -np 1 hostname
>>  -H dasher.egcrc.org  -np 1 hostname
>>  # -H vixen.egcrc.org   -np 1 hostname
>>  # -H vixen.egcrc.org   -np 1 hostname
>>  [tsakai@dasher Rmpi]$
>>  [tsakai@dasher Rmpi]$ mpirun -app app.ac4
>>  dasher.egcrc.org
>>  dasher.egcrc.org
>>  [tsakai@dasher Rmpi]$
>>
>> Can you please tell me why I can go one way (from vixen to dasher)
>> and not the other way (dasher to vixen)?
>>
>> Thank you.
>>
>> Tena
>>
>>
>> On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:
>>
>>> Hi Tena
>>>
>>> Thank you for taking the time to explain the details of
>>> the EC2 procedure.
>>>
>>> I am afraid everything in my bag of tricks was used.
>>> As Ralph and Jeff suggested, this seems to be a very specific
>>> problem with EC2.
>>>
>>> The difference in behavior when you run as root vs. when you
>>> run as Tena, tells that there is some use restriction to regular users
>>> in EC2 that isn't present in common machines (Linux or other), I guess.
>>> This may be yet another  'stone to turn', as you like to say.
>>> It also suggests that there is nothing wrong in principle with your
>>> openMPI setup or with your program, otherwise root would not be able to run
>>> it.
>>>
>>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>>> I wonder if EC2 has any type of user support where you could ask
>>> for help.
>>> After all, it is a paid sevice, isn't it?
>>> (OpenMPI is not paid and has a great customer service, doesn't it?  :) )
>>> You have a well documented case to present,
>>> and the very peculiar fact that the program fails for normal users but runs
>>> for root.
>>> This should help the EC2 support to start looking for a solution.
>>>
>>> I am running out of suggestions of what you 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Tena Sakai
Thank you. Ashley, for clarification between sudo and su.
I live in a sphere of ignorance, but I feel I am slightly
enlightened.

Regards,

Tena


On 2/14/11 1:39 AM, "Ashley Pittman"  wrote:

> 
> "sudo" and "su" are two similar commands for doing nearly identical things,
> you should be running one or the other but there is no need to run both.
> "sudo -s" is probably the command you should have used.  It's a very common
> mistake.
> 
> sudo is a command for allowing you to run commands as another user, either
> using your own or no password.  su is a command to allow you to run commands
> as another user using their password.  What sudo su is doing is running a
> command as root which is then running a shell as root, "sudo -s" is a much
> better way of achieving the same effect.
> 
> Ashley.
> 
> On 13 Feb 2011, at 22:16, Tena Sakai wrote:
> 
>> Thank you, Ashley, for your comments.
>> 
>> I do have a question.
>> I was using 'sudo su' to document the problem I am running
>> into for people who read this mailing list, as well as for
>> my own record.  Why would you say I shouldn't be doing so?
>> 
>> Regards,
>> 
>> Tena
>> 
>> 
>> On 2/13/11 1:29 PM, "Ashley Pittman"  wrote:
>> 
>>> On 12 Feb 2011, at 14:06, Ralph Castain wrote:
>>> 
 Have you searched the email archive and/or web for openmpi and Amazon
 cloud?
 Others have previously worked through many of these problems for that
 environment - might be worth a look to see if someone already solved this,
 or
 at least a contact point for someone who is already running in that
 environment.
>>> 
>>> I've run Open MPI on Amazon ec2 for over a year and never experienced any
>>> problems like the original poster describes.
>>> 
 IIRC, there are some unique problems with running on that platform.
>>> 
>>> 
>>> None that I'm aware of.
>>> 
>>> EC2 really is no different from any other environment I've used, either real
>>> or virtual, a simple download, ./configure, make and make install has always
>>> resulted in a working OpenMPI assuming a shared install location and home
>>> directory (for launching applications from).
>>> 
>>> When I'm using EC2 I tend to re-name machines into something that is easier
>>> to
>>> follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I
>>> change the hostname of each host and then write a /etc/hosts file to convert
>>> from hostname to internal IP address.  I them export /home from cloud0.ec2
>>> to
>>> all the other nodes and configure OpenMPI with --prefix=/home/ashley/install
>>> so that the code is installed everywhere.
>>> 
>>> For EC2 Instances I commonly use Fedora but have also used Ubuntu and
>>> Solaris,
>>> all have been fundamentally similar.
>>> 
>>> My other tip for using EC2 would be to use a persistent "home" folder by
>>> renting a disk partition and attaching it to the first instance you boot in
>>> a
>>> session.  You pay for this by Gb/Month, I was able to use a 5Gb device which
>>> I
>>> mounted at /home in cloud0.ec2 and NFS exported to the other instances,
>>> again
>>> at /home.  You'll need to add "ForwardAgent yes" to your personal
>>> .ssh/config
>>> to allow you to hop around inside the virtual cluster without entering a
>>> password.  The persistent devices are called "Volumes" in EC2 speak, there
>>> is
>>> no need to create snapshots unless you want to share your volume with other
>>> people.
>>> 
>>> Ashley.
>>> 
>>> Ps, I would recommend reading up on sudo and su, "sudo su" is not a command
>>> you should be typing.
>> 
>> 
>> ___
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Jeff Squyres
On Feb 13, 2011, at 10:54 PM, Tena Sakai wrote:

> I have a file app.ac3, which looks like:
>  [tsakai@vixen Rmpi]$ cat app.ac3
>  -H dasher.egcrc.org  -np 1 hostname
>  -H dasher.egcrc.org  -np 1 hostname
>  -H vixen.egcrc.org   -np 1 hostname
>  -H vixen.egcrc.org   -np 1 hostname

Note that you don't *have* to use app files.  For example, the moral equivalent 
of the above is:

mpirun --host dasher.egcrc.org,vixen.egcrc.org -np 4 hostname

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Jeff Squyres
On Feb 13, 2011, at 2:37 PM, Tena Sakai wrote:

> Also, here is an idea I came up in my sleep that I want to check
> out.  The ami I have been using is a centos 5.5, which I have built
> from ground up.  EC2 has something called Amazon Linux ami.  I
> don't know what distribution that is and I am sure it doesn't have
> R, nor openmpi.  But I thought I would load these components I
> need to the Amazon Linux (again as you suggest by starting the
> simplest case) and see if I can reproduce the behavior I have
> been experiencing on different (and Amazon "official" ami).

This is most telling to me (that you have a custom-built Linux).  Now that I'm 
back at a proper keyboard, I checked why pipe(2) would fail, and it only has 3 
reasons in both Linux and OS X:

- pointer is invalid (which is not the case here)
- process' file descriptor table is full
- kernel's file descriptor table is full

It would be quite surprising to run into either of the last 2 cases in a stock 
Linux kernel build.

One further thought -- ensure that SELinux is disabled (all the extra security 
stuff).  I'm guessing that Open MPI *can* run with SELinux if SELinux is 
configured in a specific way, but I have no direct experience with that.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Reuti
Hi,

Am 14.02.2011 um 04:54 schrieb Tena Sakai:

> I have digressed and started downward descent...
> 
> I was trying to make a simple and clear case.  Everything
> I write in this very mail is about local machines.  There
> is no virtual machines involved.  I am talking about two
> machines, vixen and dasher, which share the same file
> structure.  Vixen is a nfs server and dasher is an nfs
> client.  I have just installed openmpi 1.4.3 on dasher,
> which is the same version I have on vixen.
> 
> I have a file app.ac3, which looks like:
>  [tsakai@vixen Rmpi]$ cat app.ac3
>  -H dasher.egcrc.org  -np 1 hostname
>  -H dasher.egcrc.org  -np 1 hostname
>  -H vixen.egcrc.org   -np 1 hostname
>  -H vixen.egcrc.org   -np 1 hostname
>  [tsakai@vixen Rmpi]$
> 
> Vixen can run this without any problem:
>  [tsakai@vixen Rmpi]$ mpirun -app app.ac3
>  vixen.egcrc.org
>  vixen.egcrc.org
>  dasher.egcrc.org
>  dasher.egcrc.org
>  [tsakai@vixen Rmpi]$
> 
> But I can't run this very command from dasher:
>  [tsakai@vixen Rmpi]$
>  [tsakai@vixen Rmpi]$ ssh dasher
>  Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
>  [tsakai@dasher ~]$
>  [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/
>  [tsakai@dasher Rmpi]$
>  [tsakai@dasher Rmpi]$ mpirun -app app.ac3
>  mpirun: killing job...

a) can you ssh from dasher to vixen?

b) firewall on vixen?

-- Reuti


>  --
>  mpirun noticed that the job aborted, but has no info as to the process
>  that caused that situation.
>  --
>  --
>  mpirun was unable to cleanly terminate the daemons on the nodes shown
>  below. Additional manual cleanup may be required - please refer to
>  the "orte-clean" tool for assistance.
>  --
>vixen.egcrc.org - daemon did not report back when launched
>  [tsakai@dasher Rmpi]$
> 
> After I issue the mpirun command, it hangs and I had to Cntrol-C out
> of it at which point it generated all lines " mpirun: killing job..."
> and below.
> 
> A strange thing is that dahser has no problem executing the same
> thing via ssh:
>  [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname
>  vixen.egcrc.org
>  [tsakai@dasher Rmpi]$
> 
> In fact, dasher can run it via mpirun so long as no foreign machine
> is present in the app file.  Ie.,
>  [tsakai@dasher Rmpi]$ cat app.ac4
>  -H dasher.egcrc.org  -np 1 hostname
>  -H dasher.egcrc.org  -np 1 hostname
>  # -H vixen.egcrc.org   -np 1 hostname
>  # -H vixen.egcrc.org   -np 1 hostname
>  [tsakai@dasher Rmpi]$
>  [tsakai@dasher Rmpi]$ mpirun -app app.ac4
>  dasher.egcrc.org
>  dasher.egcrc.org
>  [tsakai@dasher Rmpi]$
> 
> Can you please tell me why I can go one way (from vixen to dasher)
> and not the other way (dasher to vixen)?
> 
> Thank you.
> 
> Tena
> 
> 
> On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:
> 
>> Hi Tena
>> 
>> Thank you for taking the time to explain the details of
>> the EC2 procedure.
>> 
>> I am afraid everything in my bag of tricks was used.
>> As Ralph and Jeff suggested, this seems to be a very specific
>> problem with EC2.
>> 
>> The difference in behavior when you run as root vs. when you
>> run as Tena, tells that there is some use restriction to regular users
>> in EC2 that isn't present in common machines (Linux or other), I guess.
>> This may be yet another  'stone to turn', as you like to say.
>> It also suggests that there is nothing wrong in principle with your
>> openMPI setup or with your program, otherwise root would not be able to run
>> it.
>> 
>> Besides Ralph suggestion of trying the EC2 mailing list archive,
>> I wonder if EC2 has any type of user support where you could ask
>> for help.
>> After all, it is a paid sevice, isn't it?
>> (OpenMPI is not paid and has a great customer service, doesn't it?  :) )
>> You have a well documented case to present,
>> and the very peculiar fact that the program fails for normal users but runs
>> for root.
>> This should help the EC2 support to start looking for a solution.
>> 
>> I am running out of suggestions of what you could try on your own.
>> But let me try:
>> 
>> 1) You may try to reduce the problem to its less common denominator,
>> perhaps by trying to run non-R based MPI programs on EC2, maybe the 
>> hello_c.c,
>> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
>> This would be to avoid the extra layer of complexity introduced by R.
>> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 
>> hostname).
>> I.e. go in a progression of increasing complexity, see where you hit the 
>> wall.
>> This may shed some light on what is going on.
>> 
>> I don't know if this suggestion may  really help, though.
>> It is not clear to me where the thing fails, whether it is during program
>> 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-14 Thread Ashley Pittman

"sudo" and "su" are two similar commands for doing nearly identical things, you 
should be running one or the other but there is no need to run both.  "sudo -s" 
is probably the command you should have used.  It's a very common mistake.

sudo is a command for allowing you to run commands as another user, either 
using your own or no password.  su is a command to allow you to run commands as 
another user using their password.  What sudo su is doing is running a command 
as root which is then running a shell as root, "sudo -s" is a much better way 
of achieving the same effect.

Ashley.

On 13 Feb 2011, at 22:16, Tena Sakai wrote:

> Thank you, Ashley, for your comments.
> 
> I do have a question.
> I was using 'sudo su' to document the problem I am running
> into for people who read this mailing list, as well as for
> my own record.  Why would you say I shouldn't be doing so?
> 
> Regards,
> 
> Tena
> 
> 
> On 2/13/11 1:29 PM, "Ashley Pittman"  wrote:
> 
>> On 12 Feb 2011, at 14:06, Ralph Castain wrote:
>> 
>>> Have you searched the email archive and/or web for openmpi and Amazon cloud?
>>> Others have previously worked through many of these problems for that
>>> environment - might be worth a look to see if someone already solved this, 
>>> or
>>> at least a contact point for someone who is already running in that
>>> environment.
>> 
>> I've run Open MPI on Amazon ec2 for over a year and never experienced any
>> problems like the original poster describes.
>> 
>>> IIRC, there are some unique problems with running on that platform.
>> 
>> 
>> None that I'm aware of.
>> 
>> EC2 really is no different from any other environment I've used, either real
>> or virtual, a simple download, ./configure, make and make install has always
>> resulted in a working OpenMPI assuming a shared install location and home
>> directory (for launching applications from).
>> 
>> When I'm using EC2 I tend to re-name machines into something that is easier 
>> to
>> follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I
>> change the hostname of each host and then write a /etc/hosts file to convert
>> from hostname to internal IP address.  I them export /home from cloud0.ec2 to
>> all the other nodes and configure OpenMPI with --prefix=/home/ashley/install
>> so that the code is installed everywhere.
>> 
>> For EC2 Instances I commonly use Fedora but have also used Ubuntu and 
>> Solaris,
>> all have been fundamentally similar.
>> 
>> My other tip for using EC2 would be to use a persistent "home" folder by
>> renting a disk partition and attaching it to the first instance you boot in a
>> session.  You pay for this by Gb/Month, I was able to use a 5Gb device which 
>> I
>> mounted at /home in cloud0.ec2 and NFS exported to the other instances, again
>> at /home.  You'll need to add "ForwardAgent yes" to your personal .ssh/config
>> to allow you to hop around inside the virtual cluster without entering a
>> password.  The persistent devices are called "Volumes" in EC2 speak, there is
>> no need to create snapshots unless you want to share your volume with other
>> people.
>> 
>> Ashley.
>> 
>> Ps, I would recommend reading up on sudo and su, "sudo su" is not a command
>> you should be typing.
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Tena Sakai
Hi Gus,

I have digressed and started downward descent...

I was trying to make a simple and clear case.  Everything
I write in this very mail is about local machines.  There
is no virtual machines involved.  I am talking about two
machines, vixen and dasher, which share the same file
structure.  Vixen is a nfs server and dasher is an nfs
client.  I have just installed openmpi 1.4.3 on dasher,
which is the same version I have on vixen.

I have a file app.ac3, which looks like:
  [tsakai@vixen Rmpi]$ cat app.ac3
  -H dasher.egcrc.org  -np 1 hostname
  -H dasher.egcrc.org  -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  -H vixen.egcrc.org   -np 1 hostname
  [tsakai@vixen Rmpi]$

Vixen can run this without any problem:
  [tsakai@vixen Rmpi]$ mpirun -app app.ac3
  vixen.egcrc.org
  vixen.egcrc.org
  dasher.egcrc.org
  dasher.egcrc.org
  [tsakai@vixen Rmpi]$

But I can't run this very command from dasher:
  [tsakai@vixen Rmpi]$
  [tsakai@vixen Rmpi]$ ssh dasher
  Last login: Sun Feb 13 19:26:57 2011 from vixen.egcrc.org
  [tsakai@dasher ~]$
  [tsakai@dasher ~]$ cd Notes/R/parallel/Rmpi/
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ mpirun -app app.ac3
  mpirun: killing job...

  --
  mpirun noticed that the job aborted, but has no info as to the process
  that caused that situation.
  --
  --
  mpirun was unable to cleanly terminate the daemons on the nodes shown
  below. Additional manual cleanup may be required - please refer to
  the "orte-clean" tool for assistance.
  --
vixen.egcrc.org - daemon did not report back when launched
  [tsakai@dasher Rmpi]$

After I issue the mpirun command, it hangs and I had to Cntrol-C out
of it at which point it generated all lines " mpirun: killing job..."
and below.

A strange thing is that dahser has no problem executing the same
thing via ssh:
  [tsakai@dasher Rmpi]$ ssh vixen.egcrc.org hostname
  vixen.egcrc.org
  [tsakai@dasher Rmpi]$

In fact, dasher can run it via mpirun so long as no foreign machine
is present in the app file.  Ie.,
  [tsakai@dasher Rmpi]$ cat app.ac4
  -H dasher.egcrc.org  -np 1 hostname
  -H dasher.egcrc.org  -np 1 hostname
  # -H vixen.egcrc.org   -np 1 hostname
  # -H vixen.egcrc.org   -np 1 hostname
  [tsakai@dasher Rmpi]$
  [tsakai@dasher Rmpi]$ mpirun -app app.ac4
  dasher.egcrc.org
  dasher.egcrc.org
  [tsakai@dasher Rmpi]$

Can you please tell me why I can go one way (from vixen to dasher)
and not the other way (dasher to vixen)?

Thank you.

Tena


On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:

> Hi Tena
>
> Thank you for taking the time to explain the details of
> the EC2 procedure.
>
> I am afraid everything in my bag of tricks was used.
> As Ralph and Jeff suggested, this seems to be a very specific
> problem with EC2.
>
> The difference in behavior when you run as root vs. when you
> run as Tena, tells that there is some use restriction to regular users
> in EC2 that isn't present in common machines (Linux or other), I guess.
> This may be yet another  'stone to turn', as you like to say.
> It also suggests that there is nothing wrong in principle with your
> openMPI setup or with your program, otherwise root would not be able to run
> it.
>
> Besides Ralph suggestion of trying the EC2 mailing list archive,
> I wonder if EC2 has any type of user support where you could ask
> for help.
> After all, it is a paid sevice, isn't it?
> (OpenMPI is not paid and has a great customer service, doesn't it?  :) )
> You have a well documented case to present,
> and the very peculiar fact that the program fails for normal users but runs
> for root.
> This should help the EC2 support to start looking for a solution.
>
> I am running out of suggestions of what you could try on your own.
> But let me try:
>
> 1) You may try to reduce the problem to its less common denominator,
> perhaps by trying to run non-R based MPI programs on EC2, maybe the hello_c.c,
> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
> This would be to avoid the extra layer of complexity introduced by R.
> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 hostname).
> I.e. go in a progression of increasing complexity, see where you hit the wall.
> This may shed some light on what is going on.
>
> I don't know if this suggestion may  really help, though.
> It is not clear to me where the thing fails, whether it is during program
> execution,
> or while mpiexec is setting up the environment for the program to run.
> If it is very early in the process, before the program starts, my suggestion
> won't work.
> Jeff and Ralph, who know OpenMPI inside out, may have better advice in this
> regard.
>
> 2) Another thing 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Tena Sakai
Thank you, Ashley, for your comments.

I do have a question.
I was using 'sudo su' to document the problem I am running
into for people who read this mailing list, as well as for
my own record.  Why would you say I shouldn't be doing so?

Regards,

Tena


On 2/13/11 1:29 PM, "Ashley Pittman"  wrote:

> On 12 Feb 2011, at 14:06, Ralph Castain wrote:
> 
>> Have you searched the email archive and/or web for openmpi and Amazon cloud?
>> Others have previously worked through many of these problems for that
>> environment - might be worth a look to see if someone already solved this, or
>> at least a contact point for someone who is already running in that
>> environment.
> 
> I've run Open MPI on Amazon ec2 for over a year and never experienced any
> problems like the original poster describes.
> 
>> IIRC, there are some unique problems with running on that platform.
> 
> 
> None that I'm aware of.
> 
> EC2 really is no different from any other environment I've used, either real
> or virtual, a simple download, ./configure, make and make install has always
> resulted in a working OpenMPI assuming a shared install location and home
> directory (for launching applications from).
> 
> When I'm using EC2 I tend to re-name machines into something that is easier to
> follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I
> change the hostname of each host and then write a /etc/hosts file to convert
> from hostname to internal IP address.  I them export /home from cloud0.ec2 to
> all the other nodes and configure OpenMPI with --prefix=/home/ashley/install
> so that the code is installed everywhere.
> 
> For EC2 Instances I commonly use Fedora but have also used Ubuntu and Solaris,
> all have been fundamentally similar.
> 
> My other tip for using EC2 would be to use a persistent "home" folder by
> renting a disk partition and attaching it to the first instance you boot in a
> session.  You pay for this by Gb/Month, I was able to use a 5Gb device which I
> mounted at /home in cloud0.ec2 and NFS exported to the other instances, again
> at /home.  You'll need to add "ForwardAgent yes" to your personal .ssh/config
> to allow you to hop around inside the virtual cluster without entering a
> password.  The persistent devices are called "Volumes" in EC2 speak, there is
> no need to create snapshots unless you want to share your volume with other
> people.
> 
> Ashley.
> 
> Ps, I would recommend reading up on sudo and su, "sudo su" is not a command
> you should be typing.




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Ashley Pittman
On 12 Feb 2011, at 14:06, Ralph Castain wrote:

> Have you searched the email archive and/or web for openmpi and Amazon cloud? 
> Others have previously worked through many of these problems for that 
> environment - might be worth a look to see if someone already solved this, or 
> at least a contact point for someone who is already running in that 
> environment.

I've run Open MPI on Amazon ec2 for over a year and never experienced any 
problems like the original poster describes.

> IIRC, there are some unique problems with running on that platform.


None that I'm aware of.

EC2 really is no different from any other environment I've used, either real or 
virtual, a simple download, ./configure, make and make install has always 
resulted in a working OpenMPI assuming a shared install location and home 
directory (for launching applications from).

When I'm using EC2 I tend to re-name machines into something that is easier to 
follow, typically "cloud[0-15].ec2" assuming I am running 16 machines, I change 
the hostname of each host and then write a /etc/hosts file to convert from 
hostname to internal IP address.  I them export /home from cloud0.ec2 to all 
the other nodes and configure OpenMPI with --prefix=/home/ashley/install so 
that the code is installed everywhere.

For EC2 Instances I commonly use Fedora but have also used Ubuntu and Solaris, 
all have been fundamentally similar.

My other tip for using EC2 would be to use a persistent "home" folder by 
renting a disk partition and attaching it to the first instance you boot in a 
session.  You pay for this by Gb/Month, I was able to use a 5Gb device which I 
mounted at /home in cloud0.ec2 and NFS exported to the other instances, again 
at /home.  You'll need to add "ForwardAgent yes" to your personal .ssh/config 
to allow you to hop around inside the virtual cluster without entering a 
password.  The persistent devices are called "Volumes" in EC2 speak, there is 
no need to create snapshots unless you want to share your volume with other 
people.

Ashley.

Ps, I would recommend reading up on sudo and su, "sudo su" is not a command you 
should be typing.

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk




Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Tena Sakai
Hi Gus,

Thank you for your reply, comments, and suggestions.

EC2 does have support, but it is with extra charge and I am
discouraged to use it for budgetary reasons.  Also, I have
heard that their support is a bit toward virtualization
and amazon environment specific.  I may have to override
all these and ask them for a help...

(Incidentally, I really like openmpi mailing list.  The
 atmosphere you people generate and sustain is quite wonderful
 and I hope one day I can be a contributing member.)

As to your suggestions:
1) This is a good idea.  I will do hostname via mpirun.  Increasing
   complexity from the simplest will probably reveal something I
   don't know.
2) I have run R serially on EC2 with this very ami.  I have not seen
   any problem and many others have done the same.

Also, here is an idea I came up in my sleep that I want to check
out.  The ami I have been using is a centos 5.5, which I have built
from ground up.  EC2 has something called Amazon Linux ami.  I
don't know what distribution that is and I am sure it doesn't have
R, nor openmpi.  But I thought I would load these components I
need to the Amazon Linux (again as you suggest by starting the
simplest case) and see if I can reproduce the behavior I have
been experiencing on different (and Amazon "official" ami).

I will report as I discover interesting/relevant finding.

Regards,

Tena


On 2/12/11 9:42 PM, "Gustavo Correa"  wrote:

> Hi Tena
>
> Thank you for taking the time to explain the details of
> the EC2 procedure.
>
> I am afraid everything in my bag of tricks was used.
> As Ralph and Jeff suggested, this seems to be a very specific
> problem with EC2.
>
> The difference in behavior when you run as root vs. when you
> run as Tena, tells that there is some use restriction to regular users
> in EC2 that isn't present in common machines (Linux or other), I guess.
> This may be yet another  'stone to turn', as you like to say.
> It also suggests that there is nothing wrong in principle with your
> openMPI setup or with your program, otherwise root would not be able to run
> it.
>
> Besides Ralph suggestion of trying the EC2 mailing list archive,
> I wonder if EC2 has any type of user support where you could ask
> for help.
> After all, it is a paid sevice, isn't it?
> (OpenMPI is not paid and has a great customer service, doesn't it?  :) )
> You have a well documented case to present,
> and the very peculiar fact that the program fails for normal users but runs
> for root.
> This should help the EC2 support to start looking for a solution.
>
> I am running out of suggestions of what you could try on your own.
> But let me try:
>
> 1) You may try to reduce the problem to its less common denominator,
> perhaps by trying to run non-R based MPI programs on EC2, maybe the hello_c.c,
> ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
> This would be to avoid the extra layer of complexity introduced by R.
> Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 hostname).
> I.e. go in a progression of increasing complexity, see where you hit the wall.
> This may shed some light on what is going on.
>
> I don't know if this suggestion may  really help, though.
> It is not clear to me where the thing fails, whether it is during program
> execution,
> or while mpiexec is setting up the environment for the program to run.
> If it is very early in the process, before the program starts, my suggestion
> won't work.
> Jeff and Ralph, who know OpenMPI inside out, may have better advice in this
> regard.
>
> 2) Another thing would be to try to run R on E2C in serial mode, without
> mpiexec,
> interactively or via script, to see who EC2 doesn't like: R or OpenMPI (but
> maybe it's both).
>
> Gus Correa
>
> On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:
>
>> Hi Gus,
>>
>> Thank you for your tips.
>>
>> I didn't find any smoking gun or anything comes close.
>> Here's the upshot:
>>
>>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
>>  core file size  (blocks, -c) 0
>>  data seg size   (kbytes, -d) unlimited
>>  scheduling priority (-e) 0
>>  file size   (blocks, -f) unlimited
>>  pending signals (-i) 61504
>>  max locked memory   (kbytes, -l) 32
>>  max memory size (kbytes, -m) unlimited
>>  open files  (-n) 1024
>>  pipe size(512 bytes, -p) 8
>>  POSIX message queues (bytes, -q) 819200
>>  real-time priority  (-r) 0
>>  stack size  (kbytes, -s) 8192
>>  cpu time   (seconds, -t) unlimited
>>  max user processes  (-u) 61504
>>  virtual memory  (kbytes, -v) unlimited
>>  file locks  (-x) unlimited
>>  [tsakai@ip-10-114-239-188 ~]$
>>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>>  bash-3.2#
>>  bash-3.2# ulimit -a
>>  core file size  (blocks, -c) 0
>>  data seg size   (kbytes, -d) unlimited
>>  

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-13 Thread Gustavo Correa
Hi Tena

Thank you for taking the time to explain the details of
the EC2 procedure.

I am afraid everything in my bag of tricks was used.
As Ralph and Jeff suggested, this seems to be a very specific
problem with EC2.

The difference in behavior when you run as root vs. when you
run as Tena, tells that there is some use restriction to regular users
in EC2 that isn't present in common machines (Linux or other), I guess.
This may be yet another  'stone to turn', as you like to say.
It also suggests that there is nothing wrong in principle with your
openMPI setup or with your program, otherwise root would not be able to run it.

Besides Ralph suggestion of trying the EC2 mailing list archive,
I wonder if EC2 has any type of user support where you could ask
for help.
After all, it is a paid sevice, isn't it?
(OpenMPI is not paid and has a great customer service, doesn't it?  :) )
You have a well documented case to present, 
and the very peculiar fact that the program fails for normal users but runs for 
root.
This should help the EC2 support to start looking for a solution.

I am running out of suggestions of what you could try on your own.
But let me try:

1) You may try to reduce the problem to its less common denominator,
perhaps by trying to run non-R based MPI programs on EC2, maybe the hello_c.c,
ring_c.c, and connectivity_c.c programs in the OpenMPi examples directory.
This would be to avoid the extra layer of complexity introduced by R.
Even simpler would be to run 'hostname' with mpiexec (mpiexec -np 2 hostname).
I.e. go in a progression of increasing complexity, see where you hit the wall.
This may shed some light on what is going on.

I don't know if this suggestion may  really help, though.
It is not clear to me where the thing fails, whether it is during program 
execution,
or while mpiexec is setting up the environment for the program to run.
If it is very early in the process, before the program starts, my suggestion 
won't work.
Jeff and Ralph, who know OpenMPI inside out, may have better advice in this 
regard. 

2) Another thing would be to try to run R on E2C in serial mode, without 
mpiexec, 
interactively or via script, to see who EC2 doesn't like: R or OpenMPI (but 
maybe it's both).

Gus Correa

On Feb 11, 2011, at 9:54 PM, Tena Sakai wrote:

> Hi Gus,
> 
> Thank you for your tips.
> 
> I didn't find any smoking gun or anything comes close.
> Here's the upshot:
> 
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) 61504
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) unlimited
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  bash-3.2#
>  bash-3.2#
>  bash-3.2# ulimit -a > root_ulimit-a
>  bash-3.2# exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>  14c14
>  < max user processes  (-u) unlimited
>  ---
>> max user processes  (-u) 61504
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
> /proc/sys/fs/file-max
>  480 0   762674
>  762674
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>  512 0   762674
>  762674
>  bash-3.2# exit
>  exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>  -bash: sysctl: command not found
>  

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-12 Thread Tena Sakai
Hi Ralph,

Yes, I have.  In openmpi email archive I found a few threads
related to EC2, but nothing relevant to what I am experiencing.
On EC2 discussion list, I find a few mention of openmpi, out of
which one case maybe I can be of help, but nothing that is relevant
to my situation.

Regards,

Tena


On 2/12/11 6:06 AM, "Ralph Castain"  wrote:

> Have you searched the email archive and/or web for openmpi and Amazon cloud?
> Others have previously worked through many of these problems for that
> environment - might be worth a look to see if someone already solved this, or
> at least a contact point for someone who is already running in that
> environment.
>
> IIRC, there are some unique problems with running on that platform.
>
>
> On Feb 12, 2011, at 12:38 AM, Tena Sakai wrote:
>
>> Hi Gus,
>>
>> Thank you for all your suggestions.
>>
>> I fixed the limits as you suggested and ran the test and
>> I am still getting the same failure.  More on that in a
>> bit.  But here is a bit of my response to what you mentioned.
>>
>>> the IP number you checked now is not the same as in your
>>> message with the MPI failure/errors.
>>> Not sure if I understand which computers we're talking about,
>>> or where these computers are (at Amazon?),
>>> or if they change depending on each session you use to run your programs,
>>> if they are identical machines with the same limits or if they differ.
>>
>> Everything I mentioned in last 2-3 days is on Amazon EC2 cloud.  I
>> have no problem running the same thing locally (vixen is my local
>> machine):
>>
>>  [tsakai@vixen Rmpi]$ cat app.ac1
>>  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
>>  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
>>  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7
>>  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8
>>  [tsakai@vixen Rmpi]$
>>  [tsakai@vixen Rmpi]$ mpirun --app app.ac1
>>  5 vixen.egcrc.org
>>  8 vixen.egcrc.org
>>  13 blitzen.egcrc.org
>>  21 blitzen.egcrc.org
>>  [tsakai@vixen Rmpi]$ # these lines are correct result.
>>  [tsakai@vixen Rmpi]$
>>
>> With Amazon EC2, where the strange behavior happens, is a virtualized
>> environment.  They charge by hours.  I launch an instance of a machine
>> when I need it and I shut them down when I am done.  Each time I get
>> different IP addresses (2 per instance, one on internal network and
>> the other for public interface).  That is why I don't show consistent
>> ip address or dns.  Every time I shutdown the machine, what I did on
>> that instance disappears and on next instance I have to recreate it
>> from scratch --case in point is ~/home/.ssh/config--, which is what
>> I have been doing (unless I take 'snapshot' of the image and save it
>> to a persistent storage (and doing snapshot is a bit of work)).
>>
>>> One of the error messages mentions LD_LIBRARY_PATH.
>>> Is it set to point to the OpenMPI lib directory?
>>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly
>>> set.
>>
>> Yes, I have been setting LD_LIBRARY_PATH manually every time, because
>> I have neglected to put it into my bash startup file as part of AMI
>> (Amazon Machine Image) building.
>>
>> Now what I have done is get onto an instance as tsakai, save output
>> from 'ulimit -a', set /etc/security/limits.conf parameters as you
>> suggest, get off and re-log onto the instance (thereby activating
>> those ulimit parameters), and ran the same (actually simpler) test,
>> as tsakai and as root.
>>
>>  [tsakai@vixen Rmpi]$
>>  [tsakai@vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to
>>  [tsakai@vixen Rmpi]$ # make ssh invocation line shorter.
>>  [tsakai@vixen Rmpi]$
>>  [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
>>  The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com
>> (50.16.55.64)' can't be established.
>>  RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>  Are you sure you want to continue connecting (yes/no)? yes
>>  Last login: Tue Feb  8 22:52:54 2011 from 10.201.197.188
>>  [tsakai@ip-10-114-138-129 ~]$
>>  [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.1
>>  [tsakai@ip-10-114-138-129 ~]$
>>  [tsakai@ip-10-114-138-129 ~]$ sudo su
>>  bash-3.2#
>>  bash-3.2# cat - >> /etc/security/limits.conf
>>  *   -   memlock -1
>>  *   -   stack   -1
>>  *   -   nofile  4096
>>  bash-3.2#
>>  bash-3.2# tail /etc/security/limits.conf
>>  #@studenthardnproc   20
>>  #@facultysoftnproc   20
>>  #@facultyhardnproc   50
>>  #ftp hardnproc   0
>>  #@student-   maxlogins   4
>>
>>  # End of file
>>  *   -   memlock -1
>>  *   -   stack   -1
>>  *   -   nofile  4096
>>  bash-3.2#
>>  bash-3.2# exit
>>  exit
>>  [tsakai@ip-10-114-138-129 ~]$
>>  [tsakai@ip-10-114-138-129 ~]$ # logout and log back in to activate the
>>  

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-12 Thread Ralph Castain
Have you searched the email archive and/or web for openmpi and Amazon cloud? 
Others have previously worked through many of these problems for that 
environment - might be worth a look to see if someone already solved this, or 
at least a contact point for someone who is already running in that environment.

IIRC, there are some unique problems with running on that platform.


On Feb 12, 2011, at 12:38 AM, Tena Sakai wrote:

> Hi Gus,
> 
> Thank you for all your suggestions.
> 
> I fixed the limits as you suggested and ran the test and
> I am still getting the same failure.  More on that in a
> bit.  But here is a bit of my response to what you mentioned.
> 
>> the IP number you checked now is not the same as in your
>> message with the MPI failure/errors.
>> Not sure if I understand which computers we're talking about,
>> or where these computers are (at Amazon?),
>> or if they change depending on each session you use to run your programs,
>> if they are identical machines with the same limits or if they differ.
> 
> Everything I mentioned in last 2-3 days is on Amazon EC2 cloud.  I
> have no problem running the same thing locally (vixen is my local
> machine):
> 
>  [tsakai@vixen Rmpi]$ cat app.ac1
>  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 5
>  -H vixen   -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 6
>  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 7
>  -H blitzen -np 1 Rscript /Users/tsakai/Notes/R/parallel/Rmpi/fib.R 8
>  [tsakai@vixen Rmpi]$
>  [tsakai@vixen Rmpi]$ mpirun --app app.ac1
>  5 vixen.egcrc.org
>  8 vixen.egcrc.org
>  13 blitzen.egcrc.org
>  21 blitzen.egcrc.org
>  [tsakai@vixen Rmpi]$ # these lines are correct result.
>  [tsakai@vixen Rmpi]$
> 
> With Amazon EC2, where the strange behavior happens, is a virtualized
> environment.  They charge by hours.  I launch an instance of a machine
> when I need it and I shut them down when I am done.  Each time I get
> different IP addresses (2 per instance, one on internal network and
> the other for public interface).  That is why I don't show consistent
> ip address or dns.  Every time I shutdown the machine, what I did on
> that instance disappears and on next instance I have to recreate it
> from scratch --case in point is ~/home/.ssh/config--, which is what
> I have been doing (unless I take 'snapshot' of the image and save it
> to a persistent storage (and doing snapshot is a bit of work)).
> 
>> One of the error messages mentions LD_LIBRARY_PATH.
>> Is it set to point to the OpenMPI lib directory?
>> Remember, OpenMPI requires both PATH and LD_LIBRARY_PATH properly
>> set.
> 
> Yes, I have been setting LD_LIBRARY_PATH manually every time, because
> I have neglected to put it into my bash startup file as part of AMI
> (Amazon Machine Image) building.
> 
> Now what I have done is get onto an instance as tsakai, save output
> from 'ulimit -a', set /etc/security/limits.conf parameters as you
> suggest, get off and re-log onto the instance (thereby activating
> those ulimit parameters), and ran the same (actually simpler) test,
> as tsakai and as root.
> 
>  [tsakai@vixen Rmpi]$
>  [tsakai@vixen Rmpi]$ # 2ec2 below is a script/wrapper around ssh to
>  [tsakai@vixen Rmpi]$ # make ssh invocation line shorter.
>  [tsakai@vixen Rmpi]$
>  [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
>  The authenticity of host 'ec2-50-16-55-64.compute-1.amazonaws.com
> (50.16.55.64)' can't be established.
>  RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>  Are you sure you want to continue connecting (yes/no)? yes
>  Last login: Tue Feb  8 22:52:54 2011 from 10.201.197.188
>  [tsakai@ip-10-114-138-129 ~]$
>  [tsakai@ip-10-114-138-129 ~]$ ulimit -a > mylimit.1
>  [tsakai@ip-10-114-138-129 ~]$
>  [tsakai@ip-10-114-138-129 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# cat - >> /etc/security/limits.conf
>  *   -   memlock -1
>  *   -   stack   -1
>  *   -   nofile  4096
>  bash-3.2#
>  bash-3.2# tail /etc/security/limits.conf
>  #@studenthardnproc   20
>  #@facultysoftnproc   20
>  #@facultyhardnproc   50
>  #ftp hardnproc   0
>  #@student-   maxlogins   4
> 
>  # End of file
>  *   -   memlock -1
>  *   -   stack   -1
>  *   -   nofile  4096
>  bash-3.2#
>  bash-3.2# exit
>  exit
>  [tsakai@ip-10-114-138-129 ~]$
>  [tsakai@ip-10-114-138-129 ~]$ # logout and log back in to activate the
>  [tsakai@ip-10-114-138-129 ~]$ # new setting.
>  [tsakai@ip-10-114-138-129 ~]$ exit
>  logout
>  [tsakai@vixen ec2]$
>  [tsakai@vixen ec2]$ # I am back on vixen and about to relogging back onto
>  [tsakai@vixen ec2]$ # the instance which is still running.
>  [tsakai@vixen ec2]$
>  [tsakai@vixen ec2]$ 2ec2 ec2-50-16-55-64.compute-1.amazonaws.com
>  Last login: Fri Feb 11 23:50:47 2011 from 63.193.205.1
>  [tsakai@ip-10-114-138-129 ~]$
>  [tsakai@ip-10-114-138-129 ~]$ 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Jeff Squyres (jsquyres)
Sounds about right. I'm not near a keyboard to check the reasons why pipe(2) 
would fail. 

Specifically, OMPI is failing when it is trying to setup stdin/stdout/stderr 
forwarding for your job. Very strange. 

Sent from my PDA. No type good. 

On Feb 11, 2011, at 9:56 PM, "Tena Sakai"  wrote:

> Hi Gus,
> 
> Thank you for your tips.
> 
> I didn't find any smoking gun or anything comes close.
> Here's the upshot:
> 
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) 61504
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# ulimit -a
>  core file size  (blocks, -c) 0
>  data seg size   (kbytes, -d) unlimited
>  scheduling priority (-e) 0
>  file size   (blocks, -f) unlimited
>  pending signals (-i) 61504
>  max locked memory   (kbytes, -l) 32
>  max memory size (kbytes, -m) unlimited
>  open files  (-n) 1024
>  pipe size(512 bytes, -p) 8
>  POSIX message queues (bytes, -q) 819200
>  real-time priority  (-r) 0
>  stack size  (kbytes, -s) 8192
>  cpu time   (seconds, -t) unlimited
>  max user processes  (-u) unlimited
>  virtual memory  (kbytes, -v) unlimited
>  file locks  (-x) unlimited
>  bash-3.2#
>  bash-3.2#
>  bash-3.2# ulimit -a > root_ulimit-a
>  bash-3.2# exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
>  14c14
>  < max user processes  (-u) unlimited
>  ---
>> max user processes  (-u) 61504
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
> /proc/sys/fs/file-max
>  480 0   762674
>  762674
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo su
>  bash-3.2#
>  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>  512 0   762674
>  762674
>  bash-3.2# exit
>  exit
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
>  -bash: sysctl: command not found
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
>  /sbin/sysctl -a |grep fs.file-max
>  error: permission denied on key 'kernel.cad_pid'
>  error: permission denied on key 'kernel.cap-bound'
>  fs.file-max = 762674
>  [tsakai@ip-10-114-239-188 ~]$
>  [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
>  fs.file-max = 762674
>  [tsakai@ip-10-114-239-188 ~]$
> 
> I see a bit of difference between root and tsakai, but I cannot
> believe such small difference results in somewhat a catastrophic
> failure as I have reported.  Would you agree with me?
> 
> Regards,
> 
> Tena
> 
> On 2/11/11 6:06 PM, "Gus Correa"  wrote:
> 
>> Hi Tena
>> 
>> Please read one answer inline.
>> 
>> Tena Sakai wrote:
>>> Hi Jeff,
>>> Hi Gus,
>>> 
>>> Thanks for your replies.
>>> 
>>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>>> as identical to that of root.  In that setting I reproduced the
>>> same result as before: root can run mpirun correctly and tsakai
>>> cannot.
>>> 
>>> I have also checked out permission on /tmp directory.  tsakai has
>>> no problem creating files under /tmp.
>>> 
>>> I am trying to come up with a strategy to show that each and every
>>> programs in the PATH has "world" executable permission.  It is a
>>> stone to turn over, but I am not holding my breath.
>>> 
 ... you are running out of file descriptors. Are file descriptors
 limited on a per-process basis, perchance?
>>> 
>>> I have never heard there is such restriction on Amazon EC2.  There
>>> are folks who keep running instances for a long, long time.  Whereas
>>> in my case, I launch 2 instances, check things out, and then turn
>>> the instances off.  (Given that the state of California has a huge
>>> debts, our funding is very tight.)  So, I really doubt that's the
>>> case.  I have run mpirun unsuccessfully as user tsakai and immediately
>>> after successfully as root.  Still, I would be happy if you can tell
>>> 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi Gus,

Thank you for your tips.

I didn't find any smoking gun or anything comes close.
Here's the upshot:

  [tsakai@ip-10-114-239-188 ~]$ ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) 61504
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# ulimit -a
  core file size  (blocks, -c) 0
  data seg size   (kbytes, -d) unlimited
  scheduling priority (-e) 0
  file size   (blocks, -f) unlimited
  pending signals (-i) 61504
  max locked memory   (kbytes, -l) 32
  max memory size (kbytes, -m) unlimited
  open files  (-n) 1024
  pipe size(512 bytes, -p) 8
  POSIX message queues (bytes, -q) 819200
  real-time priority  (-r) 0
  stack size  (kbytes, -s) 8192
  cpu time   (seconds, -t) unlimited
  max user processes  (-u) unlimited
  virtual memory  (kbytes, -v) unlimited
  file locks  (-x) unlimited
  bash-3.2#
  bash-3.2#
  bash-3.2# ulimit -a > root_ulimit-a
  bash-3.2# exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ ulimit -a > tsakai_ulimit-a
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ diff root_ulimit-a tsakai_ulimit-a
  14c14
  < max user processes  (-u) unlimited
  ---
  > max user processes  (-u) 61504
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ cat /proc/sys/fs/file-nr
/proc/sys/fs/file-max
  480 0   762674
  762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo su
  bash-3.2#
  bash-3.2# cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
  512 0   762674
  762674
  bash-3.2# exit
  exit
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sysctl -a |grep fs.file-max
  -bash: sysctl: command not found
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ /sbin/!!
  /sbin/sysctl -a |grep fs.file-max
  error: permission denied on key 'kernel.cad_pid'
  error: permission denied on key 'kernel.cap-bound'
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$
  [tsakai@ip-10-114-239-188 ~]$ sudo /sbin/sysctl -a | grep fs.file-max
  fs.file-max = 762674
  [tsakai@ip-10-114-239-188 ~]$

I see a bit of difference between root and tsakai, but I cannot
believe such small difference results in somewhat a catastrophic
failure as I have reported.  Would you agree with me?

Regards,

Tena

On 2/11/11 6:06 PM, "Gus Correa"  wrote:

> Hi Tena
>
> Please read one answer inline.
>
> Tena Sakai wrote:
>> Hi Jeff,
>> Hi Gus,
>>
>> Thanks for your replies.
>>
>> I have pretty much ruled out PATH issues by setting tsakai's PATH
>> as identical to that of root.  In that setting I reproduced the
>> same result as before: root can run mpirun correctly and tsakai
>> cannot.
>>
>> I have also checked out permission on /tmp directory.  tsakai has
>> no problem creating files under /tmp.
>>
>> I am trying to come up with a strategy to show that each and every
>> programs in the PATH has "world" executable permission.  It is a
>> stone to turn over, but I am not holding my breath.
>>
>>> ... you are running out of file descriptors. Are file descriptors
>>> limited on a per-process basis, perchance?
>>
>> I have never heard there is such restriction on Amazon EC2.  There
>> are folks who keep running instances for a long, long time.  Whereas
>> in my case, I launch 2 instances, check things out, and then turn
>> the instances off.  (Given that the state of California has a huge
>> debts, our funding is very tight.)  So, I really doubt that's the
>> case.  I have run mpirun unsuccessfully as user tsakai and immediately
>> after successfully as root.  Still, I would be happy if you can tell
>> me a way to tell number of file descriptors used or remmain.
>>
>> Your mentioned file descriptors made me think of something under
>> /dev.  But I don't know exactly what I am fishing.  Do you have
>> some suggestions?
>>
>
> 1) If the environment has anything to do with Linux,
> check:
>
> cat /proc/sys/fs/file-nr /proc/sys/fs/file-max
>
>
> or
>
> sysctl -a |grep fs.file-max
>
> This max can be set (fs.file-max=whatever_is_reasonable)
> in /etc/sysctl.conf
>
> 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Gus Correa

Hi Tena

Please read one answer inline.

Tena Sakai wrote:

Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.


... you are running out of file descriptors. Are file descriptors
limited on a per-process basis, perchance?


I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?



1) If the environment has anything to do with Linux,
check:

cat /proc/sys/fs/file-nr /proc/sys/fs/file-max


or

sysctl -a |grep fs.file-max

This max can be set (fs.file-max=whatever_is_reasonable)
in /etc/sysctl.conf

See 'man sysctl' and 'man sysctl.conf'

2) Another possible source of limits.

Check "ulimit -a" (bash) or "limit" (tcsh).

If you need to change look at:

/etc/security/limits.conf

(See also 'man limits.conf')

**

Since "root can but Tena cannot",
I would check 2) first,
as they are the 'per user/per group' limits,
whereas 1) is kernel/system-wise.

I hope this helps,
Gus Correa

PS - I know you are a wise and careful programmer,
but here we had cases of programs that would
fail because of too many files that were open and never closed,
eventually exceeding the max available/permissible.
So, it does happen.


I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)"  wrote:


It is concerning if the pipe system call fails - I can't think of why that
would happen. Thats not usually a permissions issue but rather a deeper
indication that something is either seriously wrong on your system or you are
running out of file descriptors. Are file descriptors limited on a per-process
basis, perchance?

Sent from my PDA. No type good.

On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:


Hi Tena

Since root can but you can't,
is is a directory permission problem perhaps?
Check the execution directory permission (on both machines,
if this is not NFS mounted dir).
I am not sure, but IIRR OpenMPI also uses /tmp for
under-the-hood stuff, worth checking permissions there also.
Just a naive guess.

Congrats for all the progress with the cloudy MPI!

Gus Correa

Tena Sakai wrote:

Hi,
I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:
 [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
 Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ll
 total 8
 -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
 -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ll .ssh
 total 16
 -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
 -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
 -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
 -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
 [tsakai@ip-10-195-198-31 ~]$
 [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
 Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
 [tsakai@ip-10-100-243-195 ~]$ hostname
 ip-10-100-243-195
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ ll
 total 8
 -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
 -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$
 [tsakai@ip-10-100-243-195 ~]$ cat app.ac
 -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
 -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
 -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
 -H ip-10-100-243-195.ec2.internal -np 1 Rscript 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi Jeff,
Hi Gus,

Thanks for your replies.

I have pretty much ruled out PATH issues by setting tsakai's PATH
as identical to that of root.  In that setting I reproduced the
same result as before: root can run mpirun correctly and tsakai
cannot.

I have also checked out permission on /tmp directory.  tsakai has
no problem creating files under /tmp.

I am trying to come up with a strategy to show that each and every
programs in the PATH has "world" executable permission.  It is a
stone to turn over, but I am not holding my breath.

> ... you are running out of file descriptors. Are file descriptors
> limited on a per-process basis, perchance?

I have never heard there is such restriction on Amazon EC2.  There
are folks who keep running instances for a long, long time.  Whereas
in my case, I launch 2 instances, check things out, and then turn
the instances off.  (Given that the state of California has a huge
debts, our funding is very tight.)  So, I really doubt that's the
case.  I have run mpirun unsuccessfully as user tsakai and immediately
after successfully as root.  Still, I would be happy if you can tell
me a way to tell number of file descriptors used or remmain.

Your mentioned file descriptors made me think of something under
/dev.  But I don't know exactly what I am fishing.  Do you have
some suggestions?

I wish I could reproduce this (weired) behavior on a different
set of machines.  I certainly cannot in my local environment.  Sigh!

Regards,

Tena


On 2/11/11 3:17 PM, "Jeff Squyres (jsquyres)"  wrote:

> It is concerning if the pipe system call fails - I can't think of why that
> would happen. Thats not usually a permissions issue but rather a deeper
> indication that something is either seriously wrong on your system or you are
> running out of file descriptors. Are file descriptors limited on a per-process
> basis, perchance?
>
> Sent from my PDA. No type good.
>
> On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:
>
>> Hi Tena
>>
>> Since root can but you can't,
>> is is a directory permission problem perhaps?
>> Check the execution directory permission (on both machines,
>> if this is not NFS mounted dir).
>> I am not sure, but IIRR OpenMPI also uses /tmp for
>> under-the-hood stuff, worth checking permissions there also.
>> Just a naive guess.
>>
>> Congrats for all the progress with the cloudy MPI!
>>
>> Gus Correa
>>
>> Tena Sakai wrote:
>>> Hi,
>>> I have made a bit more progress.  I think I can say ssh authenti-
>>> cation problem is behind me now.  I am still having a problem running
>>> mpirun, but the latest discovery, which I can reproduce, is that
>>> I can run mpirun as root.  Here's the session log:
>>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>>  total 16
>>>  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>>  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
>>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>>  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>>  ip-10-100-243-195
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>>  total 8
>>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>>  [tsakai@ip-10-100-243-195 ~]$
>>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>>  logout
>>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>>  ip-10-195-198-31
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>>  [tsakai@ip-10-195-198-31 ~]$
>>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>>  --
>>>  mpirun was unable to launch the specified application as it encountered an
>>> error:
>>>  

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Jeff Squyres (jsquyres)
It is concerning if the pipe system call fails - I can't think of why that 
would happen. Thats not usually a permissions issue but rather a deeper 
indication that something is either seriously wrong on your system or you are 
running out of file descriptors. Are file descriptors limited on a per-process 
basis, perchance?

Sent from my PDA. No type good. 

On Feb 11, 2011, at 10:08 AM, "Gus Correa"  wrote:

> Hi Tena
> 
> Since root can but you can't,
> is is a directory permission problem perhaps?
> Check the execution directory permission (on both machines,
> if this is not NFS mounted dir).
> I am not sure, but IIRR OpenMPI also uses /tmp for
> under-the-hood stuff, worth checking permissions there also.
> Just a naive guess.
> 
> Congrats for all the progress with the cloudy MPI!
> 
> Gus Correa
> 
> Tena Sakai wrote:
>> Hi,
>> I have made a bit more progress.  I think I can say ssh authenti-
>> cation problem is behind me now.  I am still having a problem running
>> mpirun, but the latest discovery, which I can reproduce, is that
>> I can run mpirun as root.  Here's the session log:
>>  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
>>  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ll
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
>>  total 16
>>  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
>>  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
>>  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
>>  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
>>  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
>>  [tsakai@ip-10-100-243-195 ~]$ hostname
>>  ip-10-100-243-195
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ ll
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
>>  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
>>  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
>>  [tsakai@ip-10-100-243-195 ~]$
>>  [tsakai@ip-10-100-243-195 ~]$ exit
>>  logout
>>  Connection to ip-10-100-243-195.ec2.internal closed.
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ hostname
>>  ip-10-195-198-31
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
>>  --
>>  mpirun was unable to launch the specified application as it encountered an
>> error:
>>  Error: pipe function call failed when setting up I/O forwarding subsystem
>>  Node: ip-10-195-198-31
>>  while attempting to start process rank 0.
>>  --
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # try it as root
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ sudo su
>>  bash-3.2#
>>  bash-3.2# pwd
>>  /home/tsakai
>>  bash-3.2#
>>  bash-3.2# ls -l /root/.ssh/config
>>  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
>>  bash-3.2#
>>  bash-3.2# cat /root/.ssh/config
>>  Host *
>>  IdentityFile /root/.ssh/.derobee/.kagi
>>  IdentitiesOnly yes
>>  BatchMode yes
>>  bash-3.2#
>>  bash-3.2# pwd
>>  /home/tsakai
>>  bash-3.2#
>>  bash-3.2# ls -l
>>  total 8
>>  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
>>  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
>>  bash-3.2#
>>  bash-3.2# # now is the time for mpirun
>>  bash-3.2#
>>  bash-3.2# mpirun --app ./app.ac
>>  13 ip-10-100-243-195
>>  21 ip-10-100-243-195
>>  5 ip-10-195-198-31
>>  8 ip-10-195-198-31
>>  bash-3.2#
>>  bash-3.2# # It works (being root)!
>>  bash-3.2#
>>  bash-3.2# exit
>>  exit
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
>>  [tsakai@ip-10-195-198-31 ~]$
>>  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
>>  --
>>  mpirun was unable to launch the specified application as it encountered an
>> error:
>>  Error: pipe function call failed when setting up I/O forwarding 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Gus Correa

Hi Tena

Since root can but you can't,
is is a directory permission problem perhaps?
Check the execution directory permission (on both machines,
if this is not NFS mounted dir).
I am not sure, but IIRR OpenMPI also uses /tmp for
under-the-hood stuff, worth checking permissions there also.
Just a naive guess.

Congrats for all the progress with the cloudy MPI!

Gus Correa

Tena Sakai wrote:

Hi,

I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:

  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
  total 16
  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
  [tsakai@ip-10-100-243-195 ~]$ hostname
  ip-10-100-243-195
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ exit
  logout
  Connection to ip-10-100-243-195.ec2.internal closed.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ hostname
  ip-10-195-198-31
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it as root
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ sudo su
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l /root/.ssh/config
  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
  bash-3.2#
  bash-3.2# cat /root/.ssh/config
  Host *
  IdentityFile /root/.ssh/.derobee/.kagi
  IdentitiesOnly yes
  BatchMode yes
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  bash-3.2#
  bash-3.2# # now is the time for mpirun
  bash-3.2#
  bash-3.2# mpirun --app ./app.ac
  13 ip-10-100-243-195
  21 ip-10-100-243-195
  5 ip-10-195-198-31
  8 ip-10-195-198-31
  bash-3.2#
  bash-3.2# # It works (being root)!
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ exit
  logout
  [tsakai@vixen ec2]$

So, why does it say "pipe function call failed when setting up
I/O forwarding subsystem Node: ip-10-195-198-31" ?
The node it is referring to is not the remote machine.  It is
What I call machine A.  I first thought maybe this is a problem
With PATH variable.  But I don't think so.  I compared root's
Path to that of tsaki's and made them 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-11 Thread Tena Sakai
Hi,

I have made a bit more progress.  I think I can say ssh authenti-
cation problem is behind me now.  I am still having a problem running
mpirun, but the latest discovery, which I can reproduce, is that
I can run mpirun as root.  Here's the session log:

  [tsakai@vixen ec2]$ 2ec2 ec2-184-73-104-242.compute-1.amazonaws.com
  Last login: Fri Feb 11 00:41:11 2011 from 10.100.243.195
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ll .ssh
  total 16
  -rw--- 1 tsakai tsakai  232 Feb  5 23:19 authorized_keys
  -rw--- 1 tsakai tsakai  102 Feb 11 00:34 config
  -rw-r--r-- 1 tsakai tsakai 1302 Feb 11 00:36 known_hosts
  -rw--- 1 tsakai tsakai  887 Feb  8 22:03 tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ ssh ip-10-100-243-195.ec2.internal
  Last login: Fri Feb 11 00:36:20 2011 from 10.195.198.31
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # I am on machine B
  [tsakai@ip-10-100-243-195 ~]$ hostname
  ip-10-100-243-195
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ ll
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:44 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:47 fib.R
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ cat app.ac
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 5
  -H ip-10-195-198-31.ec2.internal -np 1 Rscript /home/tsakai/fib.R 6
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 7
  -H ip-10-100-243-195.ec2.internal -np 1 Rscript /home/tsakai/fib.R 8
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ # go back to machine A
  [tsakai@ip-10-100-243-195 ~]$
  [tsakai@ip-10-100-243-195 ~]$ exit
  logout
  Connection to ip-10-100-243-195.ec2.internal closed.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ hostname
  ip-10-195-198-31
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # Execute mpirun
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun -app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it as root
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ sudo su
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l /root/.ssh/config
  -rw--- 1 root root 103 Feb 11 00:56 /root/.ssh/config
  bash-3.2#
  bash-3.2# cat /root/.ssh/config
  Host *
  IdentityFile /root/.ssh/.derobee/.kagi
  IdentitiesOnly yes
  BatchMode yes
  bash-3.2#
  bash-3.2# pwd
  /home/tsakai
  bash-3.2#
  bash-3.2# ls -l
  total 8
  -rw-rw-r-- 1 tsakai tsakai 274 Feb 11 00:47 app.ac
  -rwxr-xr-x 1 tsakai tsakai 379 Feb 11 00:48 fib.R
  bash-3.2#
  bash-3.2# # now is the time for mpirun
  bash-3.2#
  bash-3.2# mpirun --app ./app.ac
  13 ip-10-100-243-195
  21 ip-10-100-243-195
  5 ip-10-195-198-31
  8 ip-10-195-198-31
  bash-3.2#
  bash-3.2# # It works (being root)!
  bash-3.2#
  bash-3.2# exit
  exit
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # try it one more time as tsakai
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ mpirun --app app.ac
  --
  mpirun was unable to launch the specified application as it encountered an
error:

  Error: pipe function call failed when setting up I/O forwarding subsystem
  Node: ip-10-195-198-31

  while attempting to start process rank 0.
  --
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ # I don't get it.
  [tsakai@ip-10-195-198-31 ~]$
  [tsakai@ip-10-195-198-31 ~]$ exit
  logout
  [tsakai@vixen ec2]$

So, why does it say "pipe function call failed when setting up
I/O forwarding subsystem Node: ip-10-195-198-31" ?
The node it is referring to is not the remote machine.  It is
What I call machine A.  I first thought maybe this is a problem
With PATH variable.  But I don't think so.  I compared root's
Path to that of tsaki's and made them identical and retried.
I got the same behavior.

If you could enlighten me why this is happening, I would really
Appreciate it.

Thank you.

Tena


On 2/10/11 4:12 PM, "Tena Sakai"  wrote:

> Hi jeff,
>
> Thanks for the firewall tip.  I tried it while allowing all tip traffic
> and got interesting and preplexing result.  Here's what's interesting
> (BTW, I got rid of 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Tena Sakai
Hi jeff,

Thanks for the firewall tip.  I tried it while allowing all tip traffic
and got interesting and preplexing result.  Here's what's interesting
(BTW, I got rid of "LogLevel DEBUG3" from ./ssh/config on this run):

   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
   Host key verification failed.

--
   A daemon (pid 2743) died unexpectedly with status 255 while attempting
   to launch so we are aborting.

   There may be more information reported by the environment (see above).

   This may be because the daemon was unable to find all the needed shared
   libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the
   location of the shared libraries on the remote nodes and this will
   automatically be forwarded to the remote nodes.

--

--
   mpirun noticed that the job aborted, but has no info as to the process
   that caused that situation.

--
   mpirun: clean termination accomplished

   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ env | grep LD_LIB
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ # Let's set LD_LIBRARY_PATH to
/usr/local/lib
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ # I better to this on machine B as well
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ ssh -i tsakai ip-10-195-171-159
   Warning: Identity file tsakai not accessible: No such file or directory.
   Last login: Thu Feb 10 18:31:20 2011 from 10.203.21.132
   [tsakai@ip-10-195-171-159 ~]$
   [tsakai@ip-10-195-171-159 ~]$ export LD_LIBRARY_PATH='/usr/local/lib'
   [tsakai@ip-10-195-171-159 ~]$
   [tsakai@ip-10-195-171-159 ~]$ env | grep LD_LIB
   LD_LIBRARY_PATH=/usr/local/lib
   [tsakai@ip-10-195-171-159 ~]$
   [tsakai@ip-10-195-171-159 ~]$ # OK, now go bak to machine A
   [tsakai@ip-10-195-171-159 ~]$ exit
   logout
   Connection to ip-10-195-171-159 closed.
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ hostname
   ip-10-203-21-132
   [tsakai@ip-10-203-21-132 ~]$ # try mpirun again
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ mpirun --app app.ac2
   Host key verification failed.

--
   A daemon (pid 2789) died unexpectedly with status 255 while attempting
   to launch so we are aborting.

   There may be more information reported by the environment (see above).

   This may be because the daemon was unable to find all the needed shared
   libraries on the remote node. You may set your LD_LIBRARY_PATH to have
the
   location of the shared libraries on the remote nodes and this will
   automatically be forwarded to the remote nodes.

--

--
   mpirun noticed that the job aborted, but has no info as to the process
   that caused that situation.

--
   mpirun: clean termination accomplished

   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ # I thought openmpi library was in
/usr/local/lib...
   [tsakai@ip-10-203-21-132 ~]$
   [tsakai@ip-10-203-21-132 ~]$ ll -t /usr/local/lib | less
   total 16604
   lrwxrwxrwx 1 root root  16 Feb  8 23:06 libfuse.so ->
libfuse.so.2.8.5
   lrwxrwxrwx 1 root root  16 Feb  8 23:06 libfuse.so.2 ->
libfuse.so.2.8.5
   lrwxrwxrwx 1 root root  25 Feb  8 23:06 libmca_common_sm.so ->
libmca_common_sm.so.1.0.0
   lrwxrwxrwx 1 root root  25 Feb  8 23:06 libmca_common_sm.so.1 ->
libmca_common_sm.so.1.0.0
   lrwxrwxrwx 1 root root  15 Feb  8 23:06 libmpi.so -> libmpi.so.0.0.2
   lrwxrwxrwx 1 root root  15 Feb  8 23:06 libmpi.so.0 ->
libmpi.so.0.0.2
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_cxx.so ->
libmpi_cxx.so.0.0.1
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_cxx.so.0 ->
libmpi_cxx.so.0.0.1
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_f77.so ->
libmpi_f77.so.0.0.1
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_f77.so.0 ->
libmpi_f77.so.0.0.1
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_f90.so ->
libmpi_f90.so.0.0.1
   lrwxrwxrwx 1 root root  19 Feb  8 23:06 libmpi_f90.so.0 ->
libmpi_f90.so.0.0.1
   lrwxrwxrwx 1 root root  20 Feb  8 23:06 libopen-pal.so ->
libopen-pal.so.0.0.0
   lrwxrwxrwx 1 root root  20 Feb  8 23:06 libopen-pal.so.0 ->
libopen-pal.so.0.0.0
   lrwxrwxrwx 1 root root  20 Feb  8 23:06 libopen-rte.so ->
libopen-rte.so.0.0.0
   

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Reuti
Hi,

Am 10.02.2011 um 22:03 schrieb Tena Sakai:

> Hi Reuti,
> 
> Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
> session is captured in the attached file.
> 
> What I did is much similar to what I have done before: verify
> that ssh works and then run mpirun command.  In my a bit lengthy
> session log, there are two responses from "LogLevel DEBUG3."  First
> from an scp invocation and then from mpirun invocation.  They both
> say
>debug1: Authentication succeeded (publickey).

yes. I hoped to see the point where "Permission denied." is output, but when I 
reread your post now, it could mean that this was sloved already.

I agree with Jeff, that right now it look like a firewall issue.

-- Reuti


>> From mpirun invocation, I see a line:
> 
>debug1: Sending command:  orted --daemonize -mca ess env -mca
> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
> The IP address at the end of the line is indeed that of machine B.
> After that there was hanging and I controlled-C out of it, which
> gave me more lines.  But the lines after
>debug1: Sending command:  orted bla bla bla
> doesn't look good to me.  But, in truth, I have no idea what they
> mean.
> 
> If you could shed some light, I would appreciate it very much.
> 
> Regards,
> 
> Tena
> 
> 
> On 2/10/11 10:57 AM, "Reuti"  wrote:
> 
>> Hi,
>> 
>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>> 
 your local machine is Linux like, but the execution hosts
 are Macs? I saw the /Users/tsakai/... in your output.
>>> 
>>> No, my environment is entirely linux.  The path to my home
>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>> despite it is an nfs mount from vixen (which is known to
>>> itself as /home/tsakai).  For historical reasons, I have
>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>> so that I can use consistent path for both vixen and blitzen.
>> 
>> okay. Sometimes the protection of the home directory must be adjusted too, 
>> but
>> as you can do it from the command line this shouldn't be an issue.
>> 
>> 
 Is this a private cluster (or at least private interfaces)?
 It would also be an option to use hostbased authentication,
 which will avoid setting any known_hosts file or passphraseless
 ssh-keys for each user.
>>> 
>>> No, it is not a private cluster.  It is Amazon EC2.  When I
>>> Ssh from my local machine (vixen) I use its public interface,
>>> but to address from one amazon cluster node to the other I
>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>> domU-12-31-39-06-74-E2.  Both public and private dns names
>>> change from a launch to another.  I am using passphrasesless
>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>> Amazon node A, from amazon node A to amazon node B, and from
>>> Amazon node B back to A.  (Please see my initail post.  There
>>> is a session dialogue for this.)  They all work without authen-
>>> tication dialogue, except a brief initial dialogue:
>>>   The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>   can't be established.
>>>RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>Are you sure you want to continue connecting (yes/no)?
>>> to which I say "yes."
>>> But I am unclear with what you mean by "hostbased authentication"?
>>> Doesn't that mean with password?  If so, it is not an option.
>> 
>> No. It's convenient inside a private cluster as it won't fill each users'
>> known_hosts file and you don't need to create any ssh-keys. But when the
>> hostname changes every time it might also create new hostkeys. It uses
>> hostkeys (private and public), this way it works for all users. Just for
>> reference:
>> 
>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>> 
>> You could look into it later.
>> 
>> ==
>> 
>> - Can you try to use a command when connecting from A to B? E.g. ssh
>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>> 
>> - What about putting:
>> 
>> LogLevel DEBUG3
>> 
>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate before
>> it fails in verbose mode.
>> 
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Regards,
>>> 
>>> Tena
>>> 
>>> 
>>> On 2/10/11 2:27 AM, "Reuti"  wrote:
>>> 
 Hi,
 
 your local machine is Linux like, but the execution hosts are Macs? I saw
 the
 /Users/tsakai/... in your output.
 
 a) executing a command on them is also working, e.g.: ssh
 domU-12-31-39-07-35-21 ls
 
 Am 10.02.2011 um 07:08 schrieb Tena Sakai:
 
> Hi,
> 
> I have made a bit of progress(?)...
> I made a config file in my .ssh directory on the cloud.  It looks like:
>   # machine A
>   Host domU-12-31-39-07-35-21.compute-1.internal
 
 This is just an abbreviation or nickname above. To use the specified
 settings,
 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Jeff Squyres
Your prior mails were about ssh issues, but this one sounds like you might have 
firewall issues.

That is, the "orted" command attempts to open a TCP socket back to mpirun for 
various command and control reasons.  If it is blocked from doing so by a 
firewall, Open MPI won't run.  In general, you can either disable your firewall 
or you can setup a trust relationship for TCP connections within your cluster.



On Feb 10, 2011, at 1:03 PM, Tena Sakai wrote:

> Hi Reuti,
> 
> Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
> session is captured in the attached file.
> 
> What I did is much similar to what I have done before: verify
> that ssh works and then run mpirun command.  In my a bit lengthy
> session log, there are two responses from "LogLevel DEBUG3."  First
> from an scp invocation and then from mpirun invocation.  They both
> say
>debug1: Authentication succeeded (publickey).
> 
>> From mpirun invocation, I see a line:
> 
>debug1: Sending command:  orted --daemonize -mca ess env -mca
> orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
>2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
> The IP address at the end of the line is indeed that of machine B.
> After that there was hanging and I controlled-C out of it, which
> gave me more lines.  But the lines after
>debug1: Sending command:  orted bla bla bla
> doesn't look good to me.  But, in truth, I have no idea what they
> mean.
> 
> If you could shed some light, I would appreciate it very much.
> 
> Regards,
> 
> Tena
> 
> 
> On 2/10/11 10:57 AM, "Reuti"  wrote:
> 
>> Hi,
>> 
>> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
>> 
 your local machine is Linux like, but the execution hosts
 are Macs? I saw the /Users/tsakai/... in your output.
>>> 
>>> No, my environment is entirely linux.  The path to my home
>>> directory on one host (blitzen) has been known as /Users/tsakai,
>>> despite it is an nfs mount from vixen (which is known to
>>> itself as /home/tsakai).  For historical reasons, I have
>>> chosen to give a symbolic link named /Users to vixen's /Home,
>>> so that I can use consistent path for both vixen and blitzen.
>> 
>> okay. Sometimes the protection of the home directory must be adjusted too, 
>> but
>> as you can do it from the command line this shouldn't be an issue.
>> 
>> 
 Is this a private cluster (or at least private interfaces)?
 It would also be an option to use hostbased authentication,
 which will avoid setting any known_hosts file or passphraseless
 ssh-keys for each user.
>>> 
>>> No, it is not a private cluster.  It is Amazon EC2.  When I
>>> Ssh from my local machine (vixen) I use its public interface,
>>> but to address from one amazon cluster node to the other I
>>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>>> domU-12-31-39-06-74-E2.  Both public and private dns names
>>> change from a launch to another.  I am using passphrasesless
>>> ssh-keys for authentication in all cases, i.e., from vixen to
>>> Amazon node A, from amazon node A to amazon node B, and from
>>> Amazon node B back to A.  (Please see my initail post.  There
>>> is a session dialogue for this.)  They all work without authen-
>>> tication dialogue, except a brief initial dialogue:
>>>   The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>>   can't be established.
>>>RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>>>Are you sure you want to continue connecting (yes/no)?
>>> to which I say "yes."
>>> But I am unclear with what you mean by "hostbased authentication"?
>>> Doesn't that mean with password?  If so, it is not an option.
>> 
>> No. It's convenient inside a private cluster as it won't fill each users'
>> known_hosts file and you don't need to create any ssh-keys. But when the
>> hostname changes every time it might also create new hostkeys. It uses
>> hostkeys (private and public), this way it works for all users. Just for
>> reference:
>> 
>> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
>> 
>> You could look into it later.
>> 
>> ==
>> 
>> - Can you try to use a command when connecting from A to B? E.g. ssh
>> `domU-12-31-39-06-74-E2 ls`. Is this working too?
>> 
>> - What about putting:
>> 
>> LogLevel DEBUG3
>> 
>> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate before
>> it fails in verbose mode.
>> 
>> 
>> -- Reuti
>> 
>> 
>> 
>>> Regards,
>>> 
>>> Tena
>>> 
>>> 
>>> On 2/10/11 2:27 AM, "Reuti"  wrote:
>>> 
 Hi,
 
 your local machine is Linux like, but the execution hosts are Macs? I saw
 the
 /Users/tsakai/... in your output.
 
 a) executing a command on them is also working, e.g.: ssh
 domU-12-31-39-07-35-21 ls
 
 Am 10.02.2011 um 07:08 schrieb Tena Sakai:
 
> Hi,
> 
> I have made a bit of progress(?)...
> I made a config file in my .ssh directory on the cloud. 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Tena Sakai
Hi Reuti,

Thanks for suggesting "LogLevel DEBUG3."  I did so and complete
session is captured in the attached file.

What I did is much similar to what I have done before: verify
that ssh works and then run mpirun command.  In my a bit lengthy
session log, there are two responses from "LogLevel DEBUG3."  First
from an scp invocation and then from mpirun invocation.  They both
say
debug1: Authentication succeeded (publickey).

>From mpirun invocation, I see a line:

debug1: Sending command:  orted --daemonize -mca ess env -mca
orte_ess_jobid 3344891904 -mca orte_ess_vpid 1 -mca orte_ess_num_procs
2 --hnp-uri "3344891904.0;tcp://10.194.95.239:54256"
The IP address at the end of the line is indeed that of machine B.
After that there was hanging and I controlled-C out of it, which
gave me more lines.  But the lines after
debug1: Sending command:  orted bla bla bla
doesn't look good to me.  But, in truth, I have no idea what they
mean.

If you could shed some light, I would appreciate it very much.

Regards,

Tena


On 2/10/11 10:57 AM, "Reuti"  wrote:

> Hi,
> 
> Am 10.02.2011 um 19:11 schrieb Tena Sakai:
> 
>>> your local machine is Linux like, but the execution hosts
>>> are Macs? I saw the /Users/tsakai/... in your output.
>> 
>> No, my environment is entirely linux.  The path to my home
>> directory on one host (blitzen) has been known as /Users/tsakai,
>> despite it is an nfs mount from vixen (which is known to
>> itself as /home/tsakai).  For historical reasons, I have
>> chosen to give a symbolic link named /Users to vixen's /Home,
>> so that I can use consistent path for both vixen and blitzen.
> 
> okay. Sometimes the protection of the home directory must be adjusted too, but
> as you can do it from the command line this shouldn't be an issue.
> 
> 
>>> Is this a private cluster (or at least private interfaces)?
>>> It would also be an option to use hostbased authentication,
>>> which will avoid setting any known_hosts file or passphraseless
>>> ssh-keys for each user.
>> 
>> No, it is not a private cluster.  It is Amazon EC2.  When I
>> Ssh from my local machine (vixen) I use its public interface,
>> but to address from one amazon cluster node to the other I
>> use nodes' private dns names: domU-12-31-39-07-35-21 and
>> domU-12-31-39-06-74-E2.  Both public and private dns names
>> change from a launch to another.  I am using passphrasesless
>> ssh-keys for authentication in all cases, i.e., from vixen to
>> Amazon node A, from amazon node A to amazon node B, and from
>> Amazon node B back to A.  (Please see my initail post.  There
>> is a session dialogue for this.)  They all work without authen-
>> tication dialogue, except a brief initial dialogue:
>>The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>>can't be established.
>> RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
>> Are you sure you want to continue connecting (yes/no)?
>> to which I say "yes."
>> But I am unclear with what you mean by "hostbased authentication"?
>> Doesn't that mean with password?  If so, it is not an option.
> 
> No. It's convenient inside a private cluster as it won't fill each users'
> known_hosts file and you don't need to create any ssh-keys. But when the
> hostname changes every time it might also create new hostkeys. It uses
> hostkeys (private and public), this way it works for all users. Just for
> reference:
> 
> http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html
> 
> You could look into it later.
> 
> ==
> 
> - Can you try to use a command when connecting from A to B? E.g. ssh
> `domU-12-31-39-06-74-E2 ls`. Is this working too?
> 
> - What about putting:
> 
> LogLevel DEBUG3
> 
> In your ~/.ssh/config. Maybe we can see what he's trying to negotiate before
> it fails in verbose mode.
> 
> 
> -- Reuti
> 
> 
> 
>> Regards,
>> 
>> Tena
>> 
>> 
>> On 2/10/11 2:27 AM, "Reuti"  wrote:
>> 
>>> Hi,
>>> 
>>> your local machine is Linux like, but the execution hosts are Macs? I saw
>>> the
>>> /Users/tsakai/... in your output.
>>> 
>>> a) executing a command on them is also working, e.g.: ssh
>>> domU-12-31-39-07-35-21 ls
>>> 
>>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>>> 
 Hi,
 
 I have made a bit of progress(?)...
 I made a config file in my .ssh directory on the cloud.  It looks like:
# machine A
Host domU-12-31-39-07-35-21.compute-1.internal
>>> 
>>> This is just an abbreviation or nickname above. To use the specified
>>> settings,
>>> it's necessary to specify exactly this name. When the settings are the same
>>> anyway for all machines, you can use:
>>> 
>>> Host *
>>>IdentityFile /home/tsakai/.ssh/tsakai
>>>IdentitiesOnly yes
>>>BatchMode yes
>>> 
>>> instead.
>>> 
>>> Is this a private cluster (or at least private interfaces)? It would also be
>>> an option to use hostbased authentication, which will avoid setting any
>>> known_hosts file or 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Reuti
Hi,

Am 10.02.2011 um 19:11 schrieb Tena Sakai:

>> your local machine is Linux like, but the execution hosts
>> are Macs? I saw the /Users/tsakai/... in your output.
> 
> No, my environment is entirely linux.  The path to my home
> directory on one host (blitzen) has been known as /Users/tsakai,
> despite it is an nfs mount from vixen (which is known to
> itself as /home/tsakai).  For historical reasons, I have
> chosen to give a symbolic link named /Users to vixen's /Home,
> so that I can use consistent path for both vixen and blitzen.

okay. Sometimes the protection of the home directory must be adjusted too, but 
as you can do it from the command line this shouldn't be an issue.


>> Is this a private cluster (or at least private interfaces)?
>> It would also be an option to use hostbased authentication,
>> which will avoid setting any known_hosts file or passphraseless
>> ssh-keys for each user.
> 
> No, it is not a private cluster.  It is Amazon EC2.  When I
> Ssh from my local machine (vixen) I use its public interface,
> but to address from one amazon cluster node to the other I
> use nodes' private dns names: domU-12-31-39-07-35-21 and
> domU-12-31-39-06-74-E2.  Both public and private dns names
> change from a launch to another.  I am using passphrasesless
> ssh-keys for authentication in all cases, i.e., from vixen to
> Amazon node A, from amazon node A to amazon node B, and from
> Amazon node B back to A.  (Please see my initail post.  There
> is a session dialogue for this.)  They all work without authen-
> tication dialogue, except a brief initial dialogue:
>The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
>can't be established.
> RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
> Are you sure you want to continue connecting (yes/no)?
> to which I say "yes."
> But I am unclear with what you mean by "hostbased authentication"?
> Doesn't that mean with password?  If so, it is not an option.

No. It's convenient inside a private cluster as it won't fill each users' 
known_hosts file and you don't need to create any ssh-keys. But when the 
hostname changes every time it might also create new hostkeys. It uses hostkeys 
(private and public), this way it works for all users. Just for reference:

http://arc.liv.ac.uk/SGE/howto/hostbased-ssh.html

You could look into it later.

==

- Can you try to use a command when connecting from A to B? E.g. ssh 
`domU-12-31-39-06-74-E2 ls`. Is this working too?

- What about putting:

LogLevel DEBUG3

In your ~/.ssh/config. Maybe we can see what he's trying to negotiate before it 
fails in verbose mode.


-- Reuti



> Regards,
> 
> Tena
> 
> 
> On 2/10/11 2:27 AM, "Reuti"  wrote:
> 
>> Hi,
>> 
>> your local machine is Linux like, but the execution hosts are Macs? I saw the
>> /Users/tsakai/... in your output.
>> 
>> a) executing a command on them is also working, e.g.: ssh
>> domU-12-31-39-07-35-21 ls
>> 
>> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
>> 
>>> Hi,
>>> 
>>> I have made a bit of progress(?)...
>>> I made a config file in my .ssh directory on the cloud.  It looks like:
>>># machine A
>>>Host domU-12-31-39-07-35-21.compute-1.internal
>> 
>> This is just an abbreviation or nickname above. To use the specified 
>> settings,
>> it's necessary to specify exactly this name. When the settings are the same
>> anyway for all machines, you can use:
>> 
>> Host *
>>IdentityFile /home/tsakai/.ssh/tsakai
>>IdentitiesOnly yes
>>BatchMode yes
>> 
>> instead.
>> 
>> Is this a private cluster (or at least private interfaces)? It would also be
>> an option to use hostbased authentication, which will avoid setting any
>> known_hosts file or passphraseless ssh-keys for each user.
>> 
>> -- Reuti
>> 
>> 
>>>HostName domU-12-31-39-07-35-21
>>>BatchMode yes
>>>IdentityFile /home/tsakai/.ssh/tsakai
>>>ChallengeResponseAuthentication no
>>>IdentitiesOnly yes
>>> 
>>># machine B
>>>Host domU-12-31-39-06-74-E2.compute-1.internal
>>>HostName domU-12-31-39-06-74-E2
>>>BatchMode yes
>>>IdentityFile /home/tsakai/.ssh/tsakai
>>>ChallengeResponseAuthentication no
>>>IdentitiesOnly yes
>>> 
>>> This file exists on both machine A and machine B.
>>> 
>>> Now When I issue mpirun command as below:
>>>[tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>>> 
>>> It hungs.  I control-C out of it and I get:
>>>mpirun: killing job...
>>> 
>>> 
>>> --
>>>mpirun noticed that the job aborted, but has no info as to the process
>>>that caused that situation.
>>> 
>>> --
>>> 
>>> --
>>>mpirun was unable to cleanly terminate the daemons on the nodes shown
>>>below. Additional manual cleanup may be required - 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Tena Sakai
Hi David,

Thank you for your reply.

> just to be sure:
> having the ssh public keys in other computer's authorized_key file.
> ssh keys generated without passphrases

Yes, as evidenced by my session dialogue, invoking ssh manually is
not a problem.  I cannot use mpirun command (which I believe
uses ssh as an infrastructure) in the same setting, i.e., with  private
key and public key, the latter in the destination’s authorized_key
file).

Regards,

Tena


On 2/9/11 10:58 PM, "David Zhang"  wrote:

I don't really know what the problem is.  It seems like you're doing things 
correctly.  I'm almost sure you've done all of the following, but just to be 
sure:
having the ssh public keys in other computer's authorized_key file.
ssh keys generated without passphrases

On Wed, Feb 9, 2011 at 10:08 PM, Tena Sakai  wrote:
Hi,

I have made a bit of progress(?)...
I made a config file in my .ssh directory on the cloud.  It looks like:
# machine A
Host domU-12-31-39-07-35-21.compute-1.internal
HostName domU-12-31-39-07-35-21
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

# machine B
Host domU-12-31-39-06-74-E2.compute-1.internal
HostName domU-12-31-39-06-74-E2
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

This file exists on both machine A and machine B.

Now When I issue mpirun command as below:
[tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2

It hungs.  I control-C out of it and I get:

mpirun: killing job...

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
domU-12-31-39-07-35-21.compute-1.internal - daemon did not report back 
when launched

Am I making progress?

Does this mean I am past authentication and something else is the problem?
Does someone have an example .ssh/config file I can look at?  There are so
many keyword-argument paris for this config file and I would like to look at
some very basic one that works.


Thank you.

Tena Sakai
tsa...@gallo.ucsf.edu 

On 2/9/11 7:52 PM, "Tena Sakai"  > wrote:

Hi

I have an app.ac1 file like below:
[tsakai@vixen local]$ cat app.ac1
-H vixen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
-H vixen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
-H blitzen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
-H blitzen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8

The program I run is
Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
Where x is [5..8].  The machines vixen and blitzen each run 2 runs.

Here’s the program fib.R:
[ tsakai@vixen local]$ cat fib.R
# fib() computes, given index n, fibonacci number iteratively
# here's the first dozen sequence (indexed from 0..11)
# 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

fib <- function( n ) {
a <- 0
b <- 1
for ( i in 1:n ) {
 t <- b
 b <- a
 a <- a + t
}
a

arg <- commandArgs( TRUE )
myHost <- system( 'hostname', intern=TRUE )
cat( fib(arg), myHost, '\n' )

It reads an argument from command line and produces a fibonacci number that
corresponds to that index, followed by the machine name.  Pretty simple stuff.

Here’s the run output:
[tsakai@vixen local]$ mpirun -app app.ac1
5 vixen.egcrc.org 
8 vixen.egcrc.org 
13 blitzen.egcrc.org 
21 blitzen.egcrc.org 

Which is exactly what I expect.  So far so good.

Now I want to run the same thing on cloud.  I launch 2 instances of the same
virtual machine, to which I get to by:
[tsakai@vixen local]$ ssh –A –I ~/.ssh/tsakai machine-instance-A-public-dns

Now I am on machine A:
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B without 
password authentication,
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Tena Sakai
Hi Reuti,

> your local machine is Linux like, but the execution hosts
> are Macs? I saw the /Users/tsakai/... in your output.

No, my environment is entirely linux.  The path to my home
directory on one host (blitzen) has been known as /Users/tsakai,
despite it is an nfs mount from vixen (which is known to
itself as /home/tsakai).  For historical reasons, I have
chosen to give a symbolic link named /Users to vixen's /Home,
so that I can use consistent path for both vixen and blitzen.

> Is this a private cluster (or at least private interfaces)?
> It would also be an option to use hostbased authentication,
> which will avoid setting any known_hosts file or passphraseless
> ssh-keys for each user.

No, it is not a private cluster.  It is Amazon EC2.  When I
Ssh from my local machine (vixen) I use its public interface,
but to address from one amazon cluster node to the other I
use nodes' private dns names: domU-12-31-39-07-35-21 and
domU-12-31-39-06-74-E2.  Both public and private dns names
change from a launch to another.  I am using passphrasesless
ssh-keys for authentication in all cases, i.e., from vixen to
Amazon node A, from amazon node A to amazon node B, and from
Amazon node B back to A.  (Please see my initail post.  There
is a session dialogue for this.)  They all work without authen-
tication dialogue, except a brief initial dialogue:
The authenticity of host 'domu-xx-xx-xx-xx-xx-x (10.xx.xx.xx)'
can't be established.
RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
Are you sure you want to continue connecting (yes/no)?
to which I say "yes."
But I am unclear with what you mean by "hostbased authentication"?
Doesn't that mean with password?  If so, it is not an option.

Regards,

Tena


On 2/10/11 2:27 AM, "Reuti"  wrote:

> Hi,
> 
> your local machine is Linux like, but the execution hosts are Macs? I saw the
> /Users/tsakai/... in your output.
> 
> a) executing a command on them is also working, e.g.: ssh
> domU-12-31-39-07-35-21 ls
> 
> Am 10.02.2011 um 07:08 schrieb Tena Sakai:
> 
>> Hi,
>> 
>> I have made a bit of progress(?)...
>> I made a config file in my .ssh directory on the cloud.  It looks like:
>> # machine A
>> Host domU-12-31-39-07-35-21.compute-1.internal
> 
> This is just an abbreviation or nickname above. To use the specified settings,
> it's necessary to specify exactly this name. When the settings are the same
> anyway for all machines, you can use:
> 
> Host *
> IdentityFile /home/tsakai/.ssh/tsakai
> IdentitiesOnly yes
> BatchMode yes
> 
> instead.
> 
> Is this a private cluster (or at least private interfaces)? It would also be
> an option to use hostbased authentication, which will avoid setting any
> known_hosts file or passphraseless ssh-keys for each user.
> 
> -- Reuti
> 
> 
>> HostName domU-12-31-39-07-35-21
>> BatchMode yes
>> IdentityFile /home/tsakai/.ssh/tsakai
>> ChallengeResponseAuthentication no
>> IdentitiesOnly yes
>> 
>> # machine B
>> Host domU-12-31-39-06-74-E2.compute-1.internal
>> HostName domU-12-31-39-06-74-E2
>> BatchMode yes
>> IdentityFile /home/tsakai/.ssh/tsakai
>> ChallengeResponseAuthentication no
>> IdentitiesOnly yes
>> 
>> This file exists on both machine A and machine B.
>> 
>> Now When I issue mpirun command as below:
>> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>> 
>> It hungs.  I control-C out of it and I get:
>> mpirun: killing job...
>> 
>> 
>> --
>> mpirun noticed that the job aborted, but has no info as to the process
>> that caused that situation.
>> 
>> --
>> 
>> --
>> mpirun was unable to cleanly terminate the daemons on the nodes shown
>> below. Additional manual cleanup may be required - please refer to
>> the "orte-clean" tool for assistance.
>> 
>> --
>> domU-12-31-39-07-35-21.compute-1.internal - daemon did not report
>> back when launched
>> 
>> Am I making progress?
>> 
>> Does this mean I am past authentication and something else is the problem?
>> Does someone have an example .ssh/config file I can look at?  There are so
>> many keyword-argument paris for this config file and I would like to look at
>> some very basic one that works.
>> 
>> Thank you.
>> 
>> Tena Sakai
>> tsa...@gallo.ucsf.edu
>> 
>> On 2/9/11 7:52 PM, "Tena Sakai"  wrote:
>> 
>>> Hi
>>> 
>>> I have an app.ac1 file like below:
>>> [tsakai@vixen local]$ cat app.ac1
>>> -H vixen.egcrc.org   -np 1 Rscript
>>> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
>>> -H vixen.egcrc.org   -np 1 Rscript
>>> 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread David Zhang
I don't really know what the problem is.  It seems like you're doing things
correctly.  I'm almost sure you've done all of the following, but just to be
sure:
having the ssh public keys in other computer's authorized_key file.
ssh keys generated without passphrases

On Wed, Feb 9, 2011 at 10:08 PM, Tena Sakai  wrote:

>  Hi,
>
> I have made a bit of progress(?)...
> I made a config file in my .ssh directory on the cloud.  It looks like:
> # machine A
> Host domU-12-31-39-07-35-21.compute-1.internal
> HostName domU-12-31-39-07-35-21
> BatchMode yes
> IdentityFile /home/tsakai/.ssh/tsakai
> ChallengeResponseAuthentication no
> IdentitiesOnly yes
>
> # machine B
> Host domU-12-31-39-06-74-E2.compute-1.internal
> HostName domU-12-31-39-06-74-E2
> BatchMode yes
> IdentityFile /home/tsakai/.ssh/tsakai
> ChallengeResponseAuthentication no
> IdentitiesOnly yes
>
> This file exists on both machine A and machine B.
>
> Now When I issue mpirun command as below:
> [tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2
>
> It hungs.  I control-C out of it and I get:
>
> mpirun: killing job...
>
>
> --
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
>
> --
>
> --
> mpirun was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
>
> --
> domU-12-31-39-07-35-21.compute-1.internal - daemon did not report
> back when launched
>
> Am I making progress?
>
> Does this mean I am past authentication and something else is the problem?
> Does someone have an example .ssh/config file I can look at?  There are so
> many keyword-argument paris for this config file and I would like to look
> at
> some very basic one that works.
>
>
> Thank you.
>
> Tena Sakai
> tsa...@gallo.ucsf.edu
>
> On 2/9/11 7:52 PM, "Tena Sakai"  wrote:
>
> Hi
>
> I have an app.ac1 file like below:
> [tsakai@vixen local]$ cat app.ac1
> -H vixen.egcrc.org   -np 1 Rscript
> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
> -H vixen.egcrc.org   -np 1 Rscript
> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
> -H blitzen.egcrc.org -np 1 Rscript
> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
> -H blitzen.egcrc.org -np 1 Rscript
> /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8
>
> The program I run is
> Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
> Where x is [5..8].  The machines vixen and blitzen each run 2 runs.
>
> Here’s the program fib.R:
> [ tsakai@vixen local]$ cat fib.R
> # fib() computes, given index n, fibonacci number iteratively
> # here's the first dozen sequence (indexed from 0..11)
> # 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89
>
> fib <- function( n ) {
> a <- 0
> b <- 1
> for ( i in 1:n ) {
>  t <- b
>  b <- a
>  a <- a + t
> }
> a
>
> arg <- commandArgs( TRUE )
> myHost <- system( 'hostname', intern=TRUE )
> cat( fib(arg), myHost, '\n' )
>
> It reads an argument from command line and produces a fibonacci number that
> corresponds to that index, followed by the machine name.  Pretty simple
> stuff.
>
> Here’s the run output:
> [tsakai@vixen local]$ mpirun -app app.ac1
> 5 vixen.egcrc.org
> 8 vixen.egcrc.org
> 13 blitzen.egcrc.org
> 21 blitzen.egcrc.org
>
> Which is exactly what I expect.  So far so good.
>
> Now I want to run the same thing on cloud.  I launch 2 instances of the
> same
> virtual machine, to which I get to by:
> [tsakai@vixen local]$ ssh –A –I ~/.ssh/tsakai
> machine-instance-A-public-dns
>
> Now I am on machine A:
> [tsakai@domU-12-31-39-00-D1-F2 ~]$
>
> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B without
> password authentication,
> [tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
> [tsakai@domU-12-31-39-00-D1-F2 ~]$
> [tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
> domU-12-31-39-00-D1-F2
> [tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai
> domU-12-31-39-0C-C8-01
> Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
> [tsakai@domU-12-31-39-0C-C8-01 ~]$
> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
> [tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
> domU-12-31-39-0C-C8-01
> [tsakai@domU-12-31-39-0C-C8-01 ~]$
> [tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine A
> without using password
> 

Re: [OMPI users] How does authentication between nodes work without password? (Newbie alert on)

2011-02-10 Thread Tena Sakai
Hi,

I have made a bit of progress(?)...
I made a config file in my .ssh directory on the cloud.  It looks like:
# machine A
Host domU-12-31-39-07-35-21.compute-1.internal
HostName domU-12-31-39-07-35-21
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

# machine B
Host domU-12-31-39-06-74-E2.compute-1.internal
HostName domU-12-31-39-06-74-E2
BatchMode yes
IdentityFile /home/tsakai/.ssh/tsakai
ChallengeResponseAuthentication no
IdentitiesOnly yes

This file exists on both machine A and machine B.

Now When I issue mpirun command as below:
[tsakai@domU-12-31-39-06-74-E2 ~]$ mpirun -app app.ac2

It hungs.  I control-C out of it and I get:
mpirun: killing job...

--
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--
--
mpirun was unable to cleanly terminate the daemons on the nodes shown
below. Additional manual cleanup may be required - please refer to
the "orte-clean" tool for assistance.
--
domU-12-31-39-07-35-21.compute-1.internal - daemon did not report back 
when launched

Am I making progress?

Does this mean I am past authentication and something else is the problem?
Does someone have an example .ssh/config file I can look at?  There are so
many keyword-argument paris for this config file and I would like to look at
some very basic one that works.

Thank you.

Tena Sakai
tsa...@gallo.ucsf.edu

On 2/9/11 7:52 PM, "Tena Sakai"  wrote:

Hi

I have an app.ac1 file like below:
[tsakai@vixen local]$ cat app.ac1
-H vixen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 5
-H vixen.egcrc.org   -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 6
-H blitzen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 7
-H blitzen.egcrc.org -np 1 Rscript 
/Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R 8

The program I run is
Rscript /Users/tsakai/Notes/R/parallel/Rmpi/local/fib.R x
Where x is [5..8].  The machines vixen and blitzen each run 2 runs.

Here’s the program fib.R:
[ tsakai@vixen local]$ cat fib.R
# fib() computes, given index n, fibonacci number iteratively
# here's the first dozen sequence (indexed from 0..11)
# 1, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89

fib <- function( n ) {
a <- 0
b <- 1
for ( i in 1:n ) {
 t <- b
 b <- a
 a <- a + t
}
a

arg <- commandArgs( TRUE )
myHost <- system( 'hostname', intern=TRUE )
cat( fib(arg), myHost, '\n' )

It reads an argument from command line and produces a fibonacci number that
corresponds to that index, followed by the machine name.  Pretty simple stuff.

Here’s the run output:
[tsakai@vixen local]$ mpirun -app app.ac1
5 vixen.egcrc.org
8 vixen.egcrc.org
13 blitzen.egcrc.org
21 blitzen.egcrc.org

Which is exactly what I expect.  So far so good.

Now I want to run the same thing on cloud.  I launch 2 instances of the same
virtual machine, to which I get to by:
[tsakai@vixen local]$ ssh –A –I ~/.ssh/tsakai machine-instance-A-public-dns

Now I am on machine A:
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # and I can go to machine B without 
password authentication,
[tsakai@domU-12-31-39-00-D1-F2 ~]$ # i.e., use public/private key
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname
domU-12-31-39-00-D1-F2
[tsakai@domU-12-31-39-00-D1-F2 ~]$ ssh -i .ssh/tsakai domU-12-31-39-0C-C8-01
Last login: Wed Feb  9 20:51:48 2011 from 10.254.214.4
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # I am now on machine B
[tsakai@domU-12-31-39-0C-C8-01 ~]$ hostname
domU-12-31-39-0C-C8-01
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ # now show I can get to machine A 
without using password
[tsakai@domU-12-31-39-0C-C8-01 ~]$
[tsakai@domU-12-31-39-0C-C8-01 ~]$ ssh -i .ssh/tsakai domU-12-31-39-00-D1-F2
The authenticity of host 'domu-12-31-39-00-d1-f2 (10.254.214.4)' can't be 
established.
RSA key fingerprint is e3:ad:75:b1:a4:63:7f:0f:c4:0b:10:71:f3:2f:21:81.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'domu-12-31-39-00-d1-f2' (RSA) to the list of 
known hosts.
Last login: Wed Feb  9 20:49:34 2011 from 10.215.203.239
[tsakai@domU-12-31-39-00-D1-F2 ~]$
[tsakai@domU-12-31-39-00-D1-F2 ~]$ hostname