Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-29 Thread Roland Dreier
 > Is it possible that /sys/class/infiniband directory exist and it is 
 > empty ? In which cases ?

Do "modprobe ib_core" on a system with no hardware drivers loaded (or no
RDMA hardware installed)


Re: [OMPI devel] Communication between entities

2008-05-29 Thread Ralph H Castain
I see, thanks for the explanation!

I'm afraid you'll have no choice, though, but to relay the message via the
local daemon. I know that creates a window of vulnerability, but it cannot
be helped.

Passing full contact info for all daemons to all procs would take us back a
few steps and cause a whole lot of sockets to be opened...


On 5/29/08 8:04 AM, "Leonardo Fialho"  wrote:

> Ralph,
> 
> I want to implement a receiver based message log (called RADIC
> architecture) that stores the log file in another node (than no stable
> storage is necessary).
> 
> I developed a wrapper to PML that manage the messages and then store it
> locally (or in a stable storage), but now I need to migrate this "log
> file" to other node. Only PML need this file (to generate and recovery
> after a fail) but ORTE daemon store and manage the files to launch then
> when one node dies.
> 
> In this approach ORTE daemon are treated like application "protectors",
> and the application are the "protected".
> 
> Thanks,
> Leonardo
> 
> 
> Ralph H Castain escribió:
>> There is no way to send a message to a daemon located on another node
>> without relaying it through the local daemon. The application procs have no
>> knowledge of the contact info for any daemon other than their own, so even
>> using the direct routed module would not work.
>> 
>> Can you provide some reason why the normal relay is unacceptable? And why
>> the PML would want to communicate with a daemon, which, after all, is -not-
>> an MPI process and has no idea what a PML is?
>> 
>> 
>> On 5/29/08 7:41 AM, "Leonardo Fialho"  wrote:
>> 
>>   
>>> Hi All,
>>> 
>>> If, inside a PML component I need to send a message to the ORTE daemon
>>> located in other node, how can I do it?
>>> 
>>> It´s safe to create a thread to manage this communication independently
>>> or Open MPI have any service to do it (like RML in ORTE environment)?
>>> 
>>> I saw a socket connection between the application and the local ORTE
>>> daemon, but I don´t want to send the message to local ORTE daemon an
>>> then it send the same message to que remote ORTE daemon...
>>> 
>>> Thanks,
>>> 
>> 
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>   
> 





Re: [OMPI devel] Communication between entities

2008-05-29 Thread Leonardo Fialho

Ralph,

I want to implement a receiver based message log (called RADIC 
architecture) that stores the log file in another node (than no stable 
storage is necessary).


I developed a wrapper to PML that manage the messages and then store it 
locally (or in a stable storage), but now I need to migrate this "log 
file" to other node. Only PML need this file (to generate and recovery 
after a fail) but ORTE daemon store and manage the files to launch then 
when one node dies.


In this approach ORTE daemon are treated like application "protectors", 
and the application are the "protected".


Thanks,
Leonardo


Ralph H Castain escribió:

There is no way to send a message to a daemon located on another node
without relaying it through the local daemon. The application procs have no
knowledge of the contact info for any daemon other than their own, so even
using the direct routed module would not work.

Can you provide some reason why the normal relay is unacceptable? And why
the PML would want to communicate with a daemon, which, after all, is -not-
an MPI process and has no idea what a PML is?


On 5/29/08 7:41 AM, "Leonardo Fialho"  wrote:

  

Hi All,

If, inside a PML component I need to send a message to the ORTE daemon
located in other node, how can I do it?

It´s safe to create a thread to manage this communication independently
or Open MPI have any service to do it (like RML in ORTE environment)?

I saw a socket connection between the application and the local ORTE
daemon, but I don´t want to send the message to local ORTE daemon an
then it send the same message to que remote ORTE daemon...

Thanks,





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
  



--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478



Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-29 Thread Jeff Squyres

On May 29, 2008, at 3:27 AM, Pavel Shamis (Pasha) wrote:

I got some more feedback from Roland off-list explaining that if / 
sys/

class/infiniband does exist and is non-empty and /sys/class/
infiniband_verbs/abi_version does not exist, then this is  
definitely a

case where we want to warn because it implies that config is screwed
up -- RDMA devices are present but not usable.


Is it possible that /sys/class/infiniband directory exist and it is
empty ? In which cases ?


Roland consistently said "...and not empty" in e-mails to me, so  
that's what I assumed.


However, Pasha just did a test: on a machine with a ConnectX HCA, he  
manually removed the mlx4 drive and started the openibd service.  /sys/ 
class/infiniband was created, but it was empty.


I guess this is a situation that we want to warn about -- we can  
simplify the whole deal by making the overriding assumption: if the  
drivers are loaded at all (such that /sys/class/infiniband/ exists at  
all), OMPI should expect to be able to find some RDMA devices.  If it  
doesn't find any, it should issue a warning.


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Communication between entities

2008-05-29 Thread Ralph H Castain
There is no way to send a message to a daemon located on another node
without relaying it through the local daemon. The application procs have no
knowledge of the contact info for any daemon other than their own, so even
using the direct routed module would not work.

Can you provide some reason why the normal relay is unacceptable? And why
the PML would want to communicate with a daemon, which, after all, is -not-
an MPI process and has no idea what a PML is?


On 5/29/08 7:41 AM, "Leonardo Fialho"  wrote:

> Hi All,
> 
> If, inside a PML component I need to send a message to the ORTE daemon
> located in other node, how can I do it?
> 
> It´s safe to create a thread to manage this communication independently
> or Open MPI have any service to do it (like RML in ORTE environment)?
> 
> I saw a socket connection between the application and the local ORTE
> daemon, but I don´t want to send the message to local ORTE daemon an
> then it send the same message to que remote ORTE daemon...
> 
> Thanks,





[OMPI devel] Communication between entities

2008-05-29 Thread Leonardo Fialho

Hi All,

If, inside a PML component I need to send a message to the ORTE daemon 
located in other node, how can I do it?


It´s safe to create a thread to manage this communication independently 
or Open MPI have any service to do it (like RML in ORTE environment)?


I saw a socket connection between the application and the local ORTE 
daemon, but I don´t want to send the message to local ORTE daemon an 
then it send the same message to que remote ORTE daemon...


Thanks,

--
Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478



Re: [OMPI devel] RFC: Linuxes shipping libibverbs

2008-05-29 Thread Pavel Shamis (Pasha)


I got some more feedback from Roland off-list explaining that if /sys/ 
class/infiniband does exist and is non-empty and /sys/class/ 
infiniband_verbs/abi_version does not exist, then this is definitely a  
case where we want to warn because it implies that config is screwed  
up -- RDMA devices are present but not usable.
  
Is it possible that /sys/class/infiniband directory exist and it is 
empty ? In which cases ?


Re: [OMPI devel] Notes from mem hooks call today

2008-05-29 Thread Patrick Geoffray

Hi Roland,

Roland Dreier wrote:

Stick in a separate library then?

I don't think we want the complexity in the kernel -- I personally would
argue against merging it upstream; and given that the userspace solution
is actually faster, it becomes pretty hard to justify.


Memory registration has always been expensive, so it's not in the 
critical path (not used for small messages and a system call overhead is 
nothing for large messages in MPI). Sure, you can have the kernel notify 
the user space through mapped flags, but it's a bit ugly IMHO.


There are cases where the basic registration already uses the same 
infrastructure as a regcache. For example, on Solaris, MacOSX and Linux 
PowerPC, you really want to register segments as large as possible to 
limit the IOMMU overhead. You also don't want to register multiple time 
the same page with overlapping registrations, because the IOMMU space is 
limited. In short, you already have a registration cache in the driver.


However, if the user space is expected to call register/deregister 
often, then I agree that the cache better be in user space.


The big picture is that it's not really important where the regcache 
lives, as long as it's out of MPI.


Patrick