Re: [ewg] Re: [ofa-general] OFED Jan 28 meeting summary onRC3readiness

2008-01-31 Thread Gleb Natapov
On Thu, Jan 31, 2008 at 01:30:23PM -0500, Doug Ledford wrote:
> > And I'm really not trying to come across harsh here, but if the distros are
> > willing to pull the OFED code, why should OFA bother trying to merge 
> > anything
> > upstream? 
> 
> I pull *some* OFED code.  I don't pull it all.  There are things in OFED
> I won't accept until they've gone upstream.  Hence, RDS is not in our
> offering.  We made the mistake of taking SDP long ago and we'll carry
What about XRC?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: [ofa-general] OFED Jan 14 meeting summary on RC2readiness

2008-01-23 Thread Gleb Natapov
On Tue, Jan 22, 2008 at 03:04:39PM -0800, Roland Dreier wrote:
>  > > I guess you mean just implement XRC without allowing multiple
>  > > processes to share an XRC domain?  That actually seems like a sensible
>  > > thing to implement as well...
>  > 
>  > This is part of the current XRC implementation -- just give -1 as the fd 
> value
>  > in ibv_open_xrc_domain().
> 
> I *think* Gleb's point was that the XRC implementation could be much
> simpler if this were the *only* case supported -- you wouldn't need
> all the complexity of kernel receive QPs etc I guess.  Gleb, is that
> what you meant?
Yes that exactly what I meant. Just to clarify my position on XRC API. I
am not against it (even if I think that usefulness of sharing QPs between
processes is overestimated why should I be right?), I just want to be sure
that it will not be changed in a couple of month after release of OFED 1.3
because kernel people will not except it into the kernel as is.

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: [ofa-general] OFED Jan 14 meeting summary on RC2readiness

2008-01-20 Thread Gleb Natapov
On Thu, Jan 17, 2008 at 05:25:14PM -0800, Roland Dreier wrote:
>  > Well, I can't speak for everyone, but in my opinion if someone wants to run
>  > MPI job so huge that XRC absolutely has to be used to be able to actually
>  > finish it then he should seriously rethink his application design.
> 
> But where do you think the crossover is where XRC starts to help MPI?
> In other words do I need a 1 process job on 32-core systems for it
> to matter, or is there a significant advantage for running a 2048
> process job on 256 8-core systems?
Lets do the math:
N - number of processes
C - number of cores
QPS - qp size (assume 4K)
N/C - number of nodes

For non XRC case each process creates QP to each other process so the
number of QPs created by each process is N (well N - 1, but we don't
care) so the memory consumed by QPs from one node is: 
N * C * QPS

For XRC case each process creates send QP for each node and receive QP
for each process so the memory consumed by QPs from one node is:
(N/C * C + N) * QPS => 2 * N * QPS

Looking at your two examples:
1. N=1 C=32
non XRC memory consumption: 1250M
XRC memory consumption: 78.125M

2. N=2048 C=8
non XRC memory consumption: 64M
XRC memory consumption: 16M

As you can see the benefit grows fast with the number of cores.

But it seems that applications, that are running ob big scale, rarely
(if at all) create all to all connections during their run. Just one
fun observation: lets assume that creating of one connection takes 500ms
then in your first example creating of all connection from one process
to all other processes will take 1.4 hour.

Memory consumed by the QPs is not the only thing that limits scalability
BTW. If each process communicates with all other processes it better
be preposting enough receive buffer. With XRC if recv QP is shared by local
processes and one of them goes RNR all other processes can't receive on
this QP either. And with XRC/SRQ we pretty much rely on HW flow control,
so this scenario will happen. Thus if you want to minimize RNRs you should
prepost more buffers as job grows.

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: [ofa-general] OFED Jan 14 meeting summary on RC2readiness

2008-01-17 Thread Gleb Natapov
On Wed, Jan 16, 2008 at 09:35:39PM -0800, Roland Dreier wrote:
>  > Roland, you said that XRC API is ugly, are you going to push it upstream
>  > in its present form?
> 
> That's a good question.  Since there is no 'present form' for XRC as
> far as I can tell, it's hard to make a definitive answer.  Certainly I
There is a proposed API. Jack is working on an implementation. API is
not pretty at all, but it seems that with the way XRC is implemented in HW
it is hard to think about better one. It is very important to decide if
API is good enough for kernel proper before releasing OFED1.3. After
that the damage will be done.

> haven't made up my mind in advance one way or another.  In addition to
> seeing how the code ends up, I think the other big piece of the puzzle
> is to hear from the Open MPI team and other consumers of the API and
> find out how big the benefit is.
> 
Well, I can't speak for everyone, but in my opinion if someone wants to run
MPI job so huge that XRC absolutely has to be used to be able to actually
finish it then he should seriously rethink his application design. This is
only my opinion of cause, I am sure if you'll ask Mellanox they will
tell you that XRC is the best thing that happened to networking since
invention of Infiniband :). XRC can be used not just for scalability BTW.
It can be used as a way to post differently sized buffers to the same QP
and this is very useful, but for this kind of usage the most ugly parts of
the API are not needed. I will be glad to hear other people's opinions too
(I know Mellanox one).

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] RE: [ofa-general] OFED Jan 14 meeting summary on RC2readiness

2008-01-15 Thread Gleb Natapov
On Tue, Jan 15, 2008 at 01:02:12PM -0800, Sean Hefty wrote:
> I would rather see OFED pull code from upstream with patches added
> on only for backports and fixes.
This is very important point actually. Is there any guaranty that XRC
API will be pushed to the kernel as is?  What if kernel maintainers will
refuse to accept it in the present form? Will application using XRC from
OFED have to support two different XRC APIs as a result?

Roland, you said that XRC API is ugly, are you going to push it upstream
in its present form?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: OFED Sep 10 meeting summary on OFED 1.3 development status

2007-09-10 Thread Gleb Natapov
>   XRC - 90% 
When we can expect to see this patch posted to ofa list?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: Scalable reliable connection

2007-07-31 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 03:50:54PM +0300, Michael S. Tsirkin wrote:
> With SRC:
>   O(N ^ 2 * J)
> 
>   This is achived by using a single send queue (per job, out of O(N * J) 
> jobs)
>   to send data to all J jobs running on a specific node (out of O(N) 
> nodes).
>   Hardware uses new "SRQ number" field in packet header to
>   multiplex receive WRs and WCs to private memory of each job.
> 
But since the send queue cannot be used for receiving packets additional
receive QPs have to be created one per job so with SRC it is actually
O(N ^ 2 * J + N * J)
unless I am missing something.

> This is similiar idea to IB RD.
Except that with RD there is no need to jump through hoops and create
separate QP for sending and receiving packets in order to achieve
scalability.

> Q: Why not use RD then?
> A: Because no hardware supports it.
Wrong answer :) There was no HW for SRC too, but Mellanox decided to
implement SRC instead of RD. The reasons Dror provided for this
a) RD is hard to do
 Not really very sounding reason IMO. Not doing RD is just pushing
 the complexity from HW to SW. And there are HW implementation of RD,
 not for IB though.
b) RD, as defined by IB spec, will not achieve good performance
 This reason is serious, but can Spec be changed to allow for high
 performance implementation? Spec compliance not something that stopped
 Mellanox from doing things before :)

Thanks for protocol explanation.

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 03:10:57PM +0300, Michael S. Tsirkin wrote:
> > > It seems what you are missing is what SRC is, not how to use the API.
> > 
> > So tell us.
> 
> This calls for a separate document. From feedback from Sonoma I really assumed
> people have it figured out.
> 
> Let's open a separate thread, and there I will try writing up
> what SRC is from the protocol point of view.
> 
No problem. Start it :)

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 02:21:30PM +0300, Michael S. Tsirkin wrote:
> > Quoting Gleb Natapov <[EMAIL PROTECTED]>:
> > Subject: Re: [ofa-general] RFC: SRC API
> > 
> > On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> > > More code examples:
> > > 
> > > Create an SRC QP, part of SRC domain:
> > > 
> > >   attr.qp_type = IBV_QPT_SRC;
> > >   attr.src_domain = d;
> > >   qp = ibv_create_qp(pd, &attr);
> > > 
> > > Given remote SRQ number, send data to this SRQ over an SRC QP:
> > > 
> > >   wr.src_remote_srq_num = src_remote_srq_num;
> > >   ib_post_send(qp, &wr);
> > > 
> > > Note: SRQ number needs to be exchanged as part of CM private data
> > >   or some other protocol.
> > > 
> > You are too brief. I can come up with one linears based on the API by
> > myself. I am trying to understand how sharing of SRC between processes
> > will work and your example doesn't show this.
> 
> It seems what you are missing is what SRC is, not how to use the API.
So tell us. Because it seems I am not the only one judging by
presentation I've got from Ishai. In this presentation he propose to
create separate receive QPs and send QPs. Is this how it meant to be
working if SRC domain is shared between processes? Because frankly, I don't
see how it can be used in any other way.

> I'll have a working example when I get closer to implementation.
> For now you'll have to look up Dror's preso if you want to
> understand what SRC is.
I looked at Dror's presentation not once. If we are talking about the
same presentation there is no much details there except additional field
in the header with destination SRQ number so HW will be able to demux a packet 
in
the right SRQ.

> 
> > Can I connected the same
> > SRC to different QPs? If yes, can I send packet to any SRQ connected to
> > the SRC through any QP connected to the same SRC?
> 
> Yes to both.
And can I attach SRQ to SRC domain without creating QP? I suppose yes.

> 
> > If yes how is this
> > different from having regular QPs?
> 
> With regular QP you can only send to a single SRQ.
> But again, look at Dror's preso.
> 
Yes but I can use the same QP for sending and receiving (this is a Queue
Pair after all). Now I'll have to create QP for send and QP for receive.
Overall number of QPs may still be smaller though...

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 12:16:39PM +0300, Michael S. Tsirkin wrote:
> More code examples:
> 
> Create an SRC QP, part of SRC domain:
> 
>   attr.qp_type = IBV_QPT_SRC;
>   attr.src_domain = d;
>   qp = ibv_create_qp(pd, &attr);
> 
> Given remote SRQ number, send data to this SRQ over an SRC QP:
> 
>   wr.src_remote_srq_num = src_remote_srq_num;
>   ib_post_send(qp, &wr);
> 
> Note: SRQ number needs to be exchanged as part of CM private data
>   or some other protocol.
> 
You are too brief. I can come up with one linears based on the API by
myself. I am trying to understand how sharing of SRC between processes
will work and your example doesn't show this. Can I connected the same
SRC to different QPs? If yes, can I send packet to any SRQ connected to
the SRC through any QP connected to the same SRC?  If yes how is this
different from having regular QPs?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 12:01:40PM +0300, Michael S. Tsirkin wrote:
> Some code examples:
>   /* create a domain and share it: */
> 
>   struct ibv_src_domain * d = ibv_get_new_src_domain(ctx);
>   int fd = open(path, O_CREAT | O_RDWR, mode);
>   ibv_share_src_domain(d, fd);
> 
>   /* get a reference to a shared domain: */
> 
>   int fd = open(path, O_CREAT | O_RDWR, mode);
>   struct ibv_src_domain * d = ibv_get_shared_src_domain(ctx, fd);
> 
>   /* once done: */
>   ibv_put_src_domain(d);
> 
> Note: when all users do put, domain is destroyed.
> 
OK. I am more interested in how SRC is connected to a QP in different processes.
How a process will be able to do sends to different processes through one QP, 
etc...

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Mon, Jul 30, 2007 at 11:52:21AM +0300, Michael S. Tsirkin wrote:
> > On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote:
> > > Hello!
> > > Here is an API proposal for support of the SRC
> > > (scalable reliable connected) protocol extension in libibverbs.
> > > 
> > > This adds APIs to:
> > > - manage SRC domains
> > > 
> > > - share SRC domains between processes,
> > >   by means of creating a 1:1 association
> > >   between an SRC domain and a file.
> > > 
> > > Notes:
> > > - The file is specified by means of a file descriptor,
> > >   this makes it possible for the user to manage file
> > >   creation/deletion in the most flexible manner
> > >   (e.g. tmpfile can be used).
> > > 
> > > - I envision implementing this sharing mechanism in kernel by means
> > >   of a per-device tree, with inode as a key and domain object
> > >   as a value.
> > >  
> > > Please comment.
> > Can you provide a pseudo code of an application using this API?
> > Especially QP sharing part.
> 
> There's no QP sharing here.
> You mean SRC domain sharing?
> 
Yes. Sorry.

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] RFC: SRC API

2007-07-30 Thread Gleb Natapov
On Sun, Jul 29, 2007 at 05:04:31PM +0300, Michael S. Tsirkin wrote:
> Hello!
> Here is an API proposal for support of the SRC
> (scalable reliable connected) protocol extension in libibverbs.
> 
> This adds APIs to:
> - manage SRC domains
> 
> - share SRC domains between processes,
>   by means of creating a 1:1 association
>   between an SRC domain and a file.
> 
> Notes:
> - The file is specified by means of a file descriptor,
>   this makes it possible for the user to manage file
>   creation/deletion in the most flexible manner
>   (e.g. tmpfile can be used).
> 
> - I envision implementing this sharing mechanism in kernel by means
>   of a per-device tree, with inode as a key and domain object
>   as a value.
>  
> Please comment.
Can you provide a pseudo code of an application using this API?
Especially QP sharing part.

> 
> Signed-off-by: Michael S. Tsirkin <[EMAIL PROTECTED]>
> 
> ---
> 
> diff --git a/include/infiniband/verbs.h b/include/infiniband/verbs.h
> index acc1b82..503f201 100644
> --- a/include/infiniband/verbs.h
> +++ b/include/infiniband/verbs.h
> @@ -370,6 +370,11 @@ struct ibv_ah_attr {
>   uint8_t port_num;
>  };
>  
> +struct ibv_src_domain {
> + struct ibv_context *context;
> + uint32_thandle;
> +};
> +
>  enum ibv_srq_attr_mask {
>   IBV_SRQ_MAX_WR  = 1 << 0,
>   IBV_SRQ_LIMIT   = 1 << 1
> @@ -389,7 +394,8 @@ struct ibv_srq_init_attr {
>  enum ibv_qp_type {
>   IBV_QPT_RC = 2,
>   IBV_QPT_UC,
> - IBV_QPT_UD
> + IBV_QPT_UD,
> + IBV_QPT_SRC
>  };
>  
>  struct ibv_qp_cap {
> @@ -408,6 +414,7 @@ struct ibv_qp_init_attr {
>   struct ibv_qp_cap   cap;
>   enum ibv_qp_typeqp_type;
>   int sq_sig_all;
> + struct ibv_src_domain  *src_domain;
>  };
>  
>  enum ibv_qp_attr_mask {
> @@ -526,6 +533,7 @@ struct ibv_send_wr {
>   uint32_tremote_qkey;
>   } ud;
>   } wr;
> + uint32_tsrc_remote_srq_num;
>  };
>  
>  struct ibv_recv_wr {
> @@ -553,6 +561,10 @@ struct ibv_srq {
>   pthread_mutex_t mutex;
>   pthread_cond_t  cond;
>   uint32_tevents_completed;
> +
> + uint32_tsrc_srq_num;
> + struct ibv_src_domain  *src_domain;
> + struct ibv_cq  *src_cq;
>  };
>  
>  struct ibv_qp {
> @@ -570,6 +582,8 @@ struct ibv_qp {
>   pthread_mutex_t mutex;
>   pthread_cond_t  cond;
>   uint32_tevents_completed;
> +
> + struct ibv_src_domain  *src_domain;
>  };
>  
>  struct ibv_comp_channel {
> @@ -912,6 +926,25 @@ struct ibv_srq *ibv_create_srq(struct ibv_pd *pd,
>  struct ibv_srq_init_attr *srq_init_attr);
>  
>  /**
> + * ibv_create_src_srq - Creates a SRQ associated with the specified 
> protection
> + *   domain and src domain.
> + * @pd: The protection domain associated with the SRQ.
> + * @src_domain: The SRC domain associated with the SRQ.
> + * @src_cq: CQ to report completions for SRC packets on.
> + *
> + * @srq_init_attr: A list of initial attributes required to create the SRQ.
> + *
> + * srq_attr->max_wr and srq_attr->max_sge are read the determine the
> + * requested size of the SRQ, and set to the actual values allocated
> + * on return.  If ibv_create_srq() succeeds, then max_wr and max_sge
> + * will always be at least as large as the requested values.
> + */
> +struct ibv_srq *ibv_create_src_srq(struct ibv_pd *pd,
> +struct ibv_src_domain *src_domain,
> +struct ibv_cq *src_cq,
> +struct ibv_srq_init_attr *srq_init_attr);
> +
> +/**
>   * ibv_modify_srq - Modifies the attributes for the specified SRQ.
>   * @srq: The SRQ to modify.
>   * @srq_attr: On input, specifies the SRQ attributes to modify.  On output,
> @@ -1074,6 +1107,44 @@ int ibv_detach_mcast(struct ibv_qp *qp, union ibv_gid 
> *gid, uint16_t lid);
>   */
>  int ibv_fork_init(void);
>  
> +/**
> + * ibv_alloc_src_domain - Allocate an SRC domain
> + * Returns a reference to an SRC domain.
> + * Use ibv_put_src_domain to free the reference.
> + * @context: Device context
> + */
> +struct ibv_src_domain *ibv_get_new_src_domain(struct ibv_context *context);
> +
> +/**
> + * ibv_share_src_domain - associate the src domain with a file.
> + * Establishes a connection between an SRC domain object and a file 
> descriptor.
> + *
> + * @d: SRC domain to share
> + * @fd: descriptor for a file to associate with the domain
> + */
> +int ibv_share_src_domain(struct ibv_src_domain *d, int fd);
> +
> +/**
> + * ibv_unshare_src_domain - disassociate the src domain from a file.
> + * Subsequent calls to ibv_get_shared_src_domain will fail.
> + * @d: SRC domain to unshare
> + */
> +int ibv_unshare_src_domain(struct ibv_src_domain *d);
> +
> +/**
> + * ibv_get_src_domain - get a reference to shared SRC domain
> + * @

[ewg] Re: [ofa-general] Toward next OFED release (1.3)

2007-06-26 Thread Gleb Natapov
On Tue, Jun 26, 2007 at 10:47:16AM -0700, Roland Dreier wrote:
> 
>  > What about allowing to allocate coherent memory for CQ inside the kernel
>  > to fix issue with Altix machines?
> 
> Sorry... I've been remiss in posting about this.  I would actually
> prefer to see an extension to the dma_map_sg() interface (a new flag
> perhaps?) that would set the right magic bit in the DMA address on
> altix.  The refactoring of ib_umem_get() to be called by low-level
> drivers makes this a fairly clean approach, and it avoids the problems
> with using dma_alloc_coherent() to allocate userspace buffers (for
> example, dma_alloc_coherent() uses up kernel virtual addresses, which
> may be scarce on 32 bit architectures).
> 
While this make sense it would be hard to push into the kernel proper.
Or no? Are you going to do that?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


[ewg] Re: [ofa-general] Toward next OFED release (1.3)

2007-06-26 Thread Gleb Natapov
On Tue, Jun 26, 2007 at 05:27:21PM +0300, Tziporet Koren wrote:
> libibverbs:
> * New verbs: 
> * Scalable Reliable Connected Transport (with Mellanox ConnectX)
> * Shared Send Queue
> * Reliable Multicast ?
> 
What about allowing to allocate coherent memory for CQ inside the kernel
to fix issue with Altix machines?

--
Gleb.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg