Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-26 Thread Doug Ledford
On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
> Thanks again everyone for the review comments. I've updated the patch set
> accordingly. The main changes are in the first patch to use a read-write
> semaphore instead of an SRCU, and with the reference counting of shared
> ib_cm_ids.
> Please let me know if I missed anything, or if there are other issues with
> the series.

Hi Haggai,

I know you are probably busy reworking this right now on the basis of
Jason's comments.  However, my biggest issue with this patch set right
now is not technical (well, it is, but it's only partially technical).

This is a core feature more than anything else.  Namespaces for RDMA
devices is not unique to IB or RoCE in any way.  Yet no thought has been
given to how this will work universally across all of the RDMA capable
devices (mainly I'm talking about iWARP here...I don't think this is an
issue for usNIC as if you want namespace support there, you just start
the user space app in a given namespace and you are probably 90% of the
way there since the user space application gets its own device and so
its own MAC/IP and all of the RDMA transfers are UDP, so the
application's namespace should get inherited by all the rest, but Cisco
would need to confirm that, hence why I say 90% of the way there, it
needs confirmed).

So, while you are reworking things right now, you would ideally contact
Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
I know there won't be a lot of overlap between IB and iWARP, but last
time you were asked you didn't even know if this setup could be extended
to iWARP.

For this next statement, I know I'm directing this to you Haggai, but
please don't take it that way.  I'm really using your patch set to make
a broader point to everyone on the list.

When I look at patches for support for a given feature, one of the
things I'm going to look at is whether or not that feature is specific
to a given hardware type, or if it's a generic feature.  If it's a
generic feature, then I'm going to want to know that the person
submitting it has designed it well.  A pre-requisite of designing a
generic feature well is that it considers all hardware types, not just
your specific hardware type.  So when you come back with the next
version of this patch set, please have an answer for how it should work
on each hardware type even if you don't have implementation patches for
each hardware type.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-26 Thread Jason Gunthorpe
On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote:

> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA
> capable

I think if Haggi is able to follow the perscription I gave then things
will be general.

 - All rdma cm ids are associated with a netdev
 - The output flow uses that netdev to restrict, configure and
   determine the output RDMA device QP
 - The input flow locates the netdev as step one and then uses the
   (netdev,ip,port) tuple to find the rdma listener, which is in turn
   tied to a netdev and is restricted/configured by it.

The technology specific part is the two maps: from (input
device,packet) to netdev, from netdev to (output device,packet)

After the above clean up is done, namespace enabling is basically
providing those two mapping functions for each technology in a way
that can locate delegatable netdevs.

The trivial case for all the ethernet techs is to provide the above
maps that can take the (input device,VLAN) and locate the correct
child VLAN specific netdev. The existing code to support VLAN should
pretty much immediately enable basic namespace support for all the
ethernet families.

The big open question for ethernet is how to work without relying on
VLAN to create delgated netdevs - typically one would use a bridge and
veth's, which do not seem very RDMA compatible. But that doesn't need
to be answered right now.

Remember, this isn't RDMA namespaces, this is netdev namespace support
for RDMA-CM -> very different things.

Basically, I'm happy with the generality story, if the clean up work I
outlined turns out..

> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of
> the

usNIC has no kernel facing functionality, and no interaction with
RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :(

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-26 Thread Doug Ledford
On Tue, 2015-05-26 at 10:59 -0600, Jason Gunthorpe wrote:
> On Tue, May 26, 2015 at 09:34:40AM -0400, Doug Ledford wrote:
> 
> > This is a core feature more than anything else.  Namespaces for RDMA
> > devices is not unique to IB or RoCE in any way.  Yet no thought has been
> > given to how this will work universally across all of the RDMA
> > capable
> 
> I think if Haggi is able to follow the perscription I gave then things
> will be general.
> 
>  - All rdma cm ids are associated with a netdev
>  - The output flow uses that netdev to restrict, configure and
>determine the output RDMA device QP
>  - The input flow locates the netdev as step one and then uses the
>(netdev,ip,port) tuple to find the rdma listener, which is in turn
>tied to a netdev and is restricted/configured by it.
> 
> The technology specific part is the two maps: from (input
> device,packet) to netdev, from netdev to (output device,packet)
> 
> After the above clean up is done, namespace enabling is basically
> providing those two mapping functions for each technology in a way
> that can locate delegatable netdevs.
> 
> The trivial case for all the ethernet techs is to provide the above
> maps that can take the (input device,VLAN) and locate the correct
> child VLAN specific netdev. The existing code to support VLAN should
> pretty much immediately enable basic namespace support for all the
> ethernet families.
> 
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.
> 
> Remember, this isn't RDMA namespaces, this is netdev namespace support
> for RDMA-CM -> very different things.

That was the point of my email.  This is a very myopic view of the
feature.  It *should* at least have an idea of these other things too.

> Basically, I'm happy with the generality story, if the clean up work I
> outlined turns out..
> 
> > issue for usNIC as if you want namespace support there, you just start
> > the user space app in a given namespace and you are probably 90% of
> > the
> 
> usNIC has no kernel facing functionality, and no interaction with
> RDMA-CM, so it is irrelevant to any discussion about RDMA-CM :(

Whether usNIC has a kernel facing functionality or not is irrelevant.
This feature isn't kernel only, it effects user space applications
launched in a namespace too.  And, again, my point was that this
discussion is about RDMA-CM and it should be broader (even if the
implementation isn't broader).  Due to the implementation of usNIC I
suspect it would "just work", but it would be better to know so.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


RE: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-26 Thread Christian Benvenuti (benve)
> -Original Message-
> From: netdev-ow...@vger.kernel.org [mailto:netdev-ow...@vger.kernel.org] On
> Behalf Of Doug Ledford
> Sent: Tuesday, May 26, 2015 6:35 AM
> To: Haggai Eran
> Cc: linux-r...@vger.kernel.org; netdev@vger.kernel.org; Liran Liss; Guy
> Shapiro; Shachar Raindel; Yotam Kenneth
> Subject: Re: [PATCH v4 for-next 00/12] Add network namespace support in the
> RDMA-CM

...

> I don't think this is an issue for usNIC as if you
> want namespace support there, you just start the user space app in a given
> namespace and you are probably 90% of the way there since the user space
> application gets its own device and so its own MAC/IP and all of the RDMA
> transfers are UDP, so the application's namespace should get inherited by all
> the rest, but Cisco would need to confirm that, hence why I say 90% of the way
> there, it needs confirmed).

This is correct. 

Thanks
/Chris

N�r��yb�X��ǧv�^�)޺{.n�+���z�^�)w*jg����ݢj/���z�ޖ��2�ޙ&�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-26 Thread Jason Gunthorpe
On Tue, May 26, 2015 at 01:46:36PM -0400, Doug Ledford wrote:

> > Remember, this isn't RDMA namespaces, this is netdev namespace support
> > for RDMA-CM -> very different things.
> 
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

Everything you talked about seems covered: iwarp/roce/ib now have a
fairly clear uniform story for CM. usNIC doesn't use core code.

I doubt a larger discussion about a 'rdma namespace' is going to
substantially change these patches, they are really netdev focused.
Anyhow, I've been saving that discussion for when the roce and
umad/uverbs namespace stuff is re-sent. It seems more appropriate at
that point.

I don't know about you, but I am exhausted looking at these huge patch
sets, and narrowing the focus is the only way I see to get through.
This series has hopefully narrowed to: 'fix the flow in netdev
handling for rdma-cm'.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran
On 26/05/2015 16:34, Doug Ledford wrote:
> On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
>> Thanks again everyone for the review comments. I've updated the patch set
>> accordingly. The main changes are in the first patch to use a read-write
>> semaphore instead of an SRCU, and with the reference counting of shared
>> ib_cm_ids.
>> Please let me know if I missed anything, or if there are other issues with
>> the series.
> 
> Hi Haggai,
> 
> I know you are probably busy reworking this right now on the basis of
> Jason's comments.  However, my biggest issue with this patch set right
> now is not technical (well, it is, but it's only partially technical).
Hi,

I'm sorry about the late reply. We had a holiday here, and then some
other tasks took precedence. I've only got back to working on this today.

> 
> This is a core feature more than anything else.  Namespaces for RDMA
> devices is not unique to IB or RoCE in any way.  Yet no thought has been
> given to how this will work universally across all of the RDMA capable
> devices (mainly I'm talking about iWARP here...
I don't agree. It is true we have are not planning to provide an iWarp
implementation for network namespaces, as we lack the capacity and the
expertise. However, I think that the changes we proposed to the rdma_cm
module will work with iWarp too. Perhaps with some of Jason's
suggestions it will be smoother, but even in the current design, I think
that if iWarp drivers can provide iw_cm with the network device on which
a request is received, then it should be simple to modify it for
namespace support without significant change to rdma_cm.

> I don't think this is an
> issue for usNIC as if you want namespace support there, you just start
> the user space app in a given namespace and you are probably 90% of the
> way there since the user space application gets its own device and so
> its own MAC/IP and all of the RDMA transfers are UDP, so the
> application's namespace should get inherited by all the rest, but Cisco
> would need to confirm that, hence why I say 90% of the way there, it
> needs confirmed).
> 
> So, while you are reworking things right now, you would ideally contact
> Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this.
> I know there won't be a lot of overlap between IB and iWARP, but last
> time you were asked you didn't even know if this setup could be extended
> to iWARP.
> 
> For this next statement, I know I'm directing this to you Haggai, but
> please don't take it that way.  I'm really using your patch set to make
> a broader point to everyone on the list.
> 
> When I look at patches for support for a given feature, one of the
> things I'm going to look at is whether or not that feature is specific
> to a given hardware type, or if it's a generic feature.  If it's a
> generic feature, then I'm going to want to know that the person
> submitting it has designed it well.  A pre-requisite of designing a
> generic feature well is that it considers all hardware types, not just
> your specific hardware type.  So when you come back with the next
> version of this patch set, please have an answer for how it should work
> on each hardware type even if you don't have implementation patches for
> each hardware type.

Well, because the RDMA subsystem supports a very diverse set of devices,
I think there are few people who know the details of all hardware types
well. If we are going to evolve the generic parts of the stack, we have
to cooperate. We have to rely on the knowledge of people on the mailing
list to say whether the feature is well designed for all hardware types,
or whether changes are warranted. In this specific case, the patches has
been on the list since February. I think it is enough time to allow
anyone who is interested in network namespace support to chime in.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran
On 26/05/2015 19:59, Jason Gunthorpe wrote:
> The big open question for ethernet is how to work without relying on
> VLAN to create delgated netdevs - typically one would use a bridge and
> veth's, which do not seem very RDMA compatible. But that doesn't need
> to be answered right now.

I think in Ethernet the first step would be to support macvlan devices.
Like IPoIB child devices, they are directly attached to an RDMA device,
so they don't require handling a complex virtual bridging topology as
veths do.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Haggai Eran
On 26/05/2015 20:46, Doug Ledford wrote:
>> Remember, this isn't RDMA namespaces, this is netdev namespace support
>> > for RDMA-CM -> very different things.
> That was the point of my email.  This is a very myopic view of the
> feature.  It *should* at least have an idea of these other things too.

We did give some thought to the question of whether an RDMA namespace is
needed, and concluded that it isn't. RDMA resources such as QP numbers,
memory keys, etc. are allocated by the devices. So different containers
wouldn't care if they share the "QP number namespace", etc. RDMA CM
ports are different because they are chosen by the applications, but
they map directly to the network namespace, so they don't require their
own namespace.

Regards,
Haggai

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Doug Ledford
On Thu, 2015-05-28 at 16:07 +0300, Haggai Eran wrote:
> On 26/05/2015 16:34, Doug Ledford wrote:
> > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote:
> > This is a core feature more than anything else.  Namespaces for RDMA
> > devices is not unique to IB or RoCE in any way.  Yet no thought has been
> > given to how this will work universally across all of the RDMA capable
> > devices (mainly I'm talking about iWARP here...
> I don't agree. It is true we have are not planning to provide an iWarp
> implementation for network namespaces, as we lack the capacity and the
> expertise. However, I think that the changes we proposed to the rdma_cm
> module will work with iWarp too. Perhaps with some of Jason's
> suggestions it will be smoother, but even in the current design, I think
> that if iWarp drivers can provide iw_cm with the network device on which
> a request is received, then it should be simple to modify it for
> namespace support without significant change to rdma_cm.

My request wasn't for a functional implementation, just a statement that
you had in fact thought about it and, as you say here, would expect it
to work (and preferably why as well).

> Well, because the RDMA subsystem supports a very diverse set of devices,
> I think there are few people who know the details of all hardware types
> well. If we are going to evolve the generic parts of the stack, we have
> to cooperate. We have to rely on the knowledge of people on the mailing
> list to say whether the feature is well designed for all hardware types,
> or whether changes are warranted. In this specific case, the patches has
> been on the list since February. I think it is enough time to allow
> anyone who is interested in network namespace support to chime in.

You would think that, but sometimes important information comes from
totally different places.  See mine and Jason's comments back and forth
in the SRIOV thread started by Or.

Long story short:

ip link add dev ib0 name ib0.1 type ipoib

is totally broken on at least all Red Hat OSes.  It will require
reworking of the network scripts and NetworkManager assumptions to make
it work.  It will also break DHCP on the interface as pkey/guid are the
only items that uniquely identify DHCP clients.  The net result of our
talks was that it is likely that each interface on the same pkey will
require an alias GUID per child interface in order to keep things
workable.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Jason Gunthorpe
On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
> wouldn't care if they share the "QP number namespace", etc. RDMA CM
> ports are different because they are chosen by the applications, but
> they map directly to the network namespace, so they don't require their
> own namespace.

Different containers should have restricted access to the PKey and GID
tables, and the presence device itself. Just like in the SRIOV
case.

That is what the 'RDMA Namespace' would control.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Or Gerlitz

On 5/28/2015 5:07 PM, Doug Ledford wrote:

You would think that, but sometimes important information comes from
totally different places.  See mine and Jason's comments back and forth
in the SRIOV thread started by Or.

Long story short:

ip link add dev ib0 name ib0.1 type ipoib

is totally broken on at least all Red Hat OSes.  It will require
reworking of the network scripts and NetworkManager assumptions to make
it work.  It will also break DHCP on the interface as pkey/guid are the
only items that uniquely identify DHCP clients.  The net result of our
talks was that it is likely that each interface on the same pkey will
require an alias GUID per child interface in order to keep things workable.



Doug,

Just to make sure we're on the same page, you're saying that the IPoIB 
DHCP scheme (client + server) used on RH product uses Client-ID which is 
eight byte long or 20 byte long the four upper bytes masked out (which 
of them?) and hence is broken when multiple entities use the same ID.


Anything else except for that (you said "reworking of the network 
scripts and NetworkManager assumptions to make it work")??


OTOH we realized that the implementation for same PKEY IPoIB childs 
which exist for a while is broken with the RH DHCP scheme and should be 
enhanced.   OTOH these childs can serve as nice building blocks for 
IPoIB containers or virtio-IPoIB scheme.


Note that out of the eleven patches that make the series, only ONE 
relates directly to IPoIB, the rest are either applicable to all the 
transport supported by the RDMA stack, or to IPoIB + RoCE.


Under some assumptions and changes people can test it with DHCP scheme 
different from RH or with non-DHCP based IP address assignment scheme.


So we have a very nice effort and work done by developers, to bring RDMA 
into containers, accompanied by reviewers providing lots of their brain 
power to make it robust.


I don't see why we should stop the whole RDMA containers support train 
just b/c we found out the IPoIB DHCP bug which was there for few years 
before this effort started.


How about let this series to go after the rest of the reviewers comments 
are addressed, s.t under IPoIB it will work on small set of 
environments, while with macvlan based RoCE support to be introduced 
later it will work on wider set of environments.


Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Jason Gunthorpe
On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote:

> Anything else except for that (you said "reworking of the network scripts
> and NetworkManager assumptions to make it work")??

IPv6 becomes very broken, child interfaces will generate the same IPv6
addreses for radv and link local resulting in duplicate address
scenarios.

About the only thing that will work properly is statically assigned
IPv4 addresses.

> I don't see why we should stop the whole RDMA containers support train just
> b/c we found out the IPoIB DHCP bug which was there for few years before
> this effort started.

I don't think that is what Doug said.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Doug Ledford
On Thu, 2015-05-28 at 11:43 -0600, Jason Gunthorpe wrote:
> On Thu, May 28, 2015 at 07:21:11PM +0300, Or Gerlitz wrote:
> 
> > Anything else except for that (you said "reworking of the network scripts
> > and NetworkManager assumptions to make it work")??
> 
> IPv6 becomes very broken, child interfaces will generate the same IPv6
> addreses for radv and link local resulting in duplicate address
> scenarios.
> 
> About the only thing that will work properly is statically assigned
> IPv4 addresses.
> 
> > I don't see why we should stop the whole RDMA containers support train just
> > b/c we found out the IPoIB DHCP bug which was there for few years before
> > this effort started.
> 
> I don't think that is what Doug said.

Indeed.  There is no need to scrap things, but if the design as it
stands, and the intended means of creating objects for use in
containers, is going to result in an unworkable network, then we have to
re-evaluate how the container constructs are created, and that then has
possible consequences for how we would get from an incoming packet to
the proper container.

I'm not trying to stop the "support train" here, but at the same time,
if the train is headed for a bridge that's out

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Or Gerlitz
On Thu, May 28, 2015 at 9:22 PM, Doug Ledford  wrote:

>> I don't think that is what Doug said.

> Indeed.  There is no need to scrap things, but if the design as it
> stands, and the intended means of creating objects for use in
> containers, is going to result in an unworkable network, then we have to
> re-evaluate how the container constructs are created, and that then has
> possible consequences for how we would get from an incoming packet to
> the proper container.

To be precise, do we agree that the issue here isn't "in the design as
it stands" but rather in a problem we found in the intended way of
assigning IP addresses through DHCP for the containers?

> I'm not trying to stop the "support train" here, but at the same time,
> if the train is headed for a bridge that's out

So what's your concrete saying here? where should we go from here?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-05-28 Thread Doug Ledford
On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote:
> On Thu, May 28, 2015 at 9:22 PM, Doug Ledford  wrote:
> 
> >> I don't think that is what Doug said.
> 
> > Indeed.  There is no need to scrap things, but if the design as it
> > stands, and the intended means of creating objects for use in
> > containers, is going to result in an unworkable network, then we have to
> > re-evaluate how the container constructs are created, and that then has
> > possible consequences for how we would get from an incoming packet to
> > the proper container.
> 
> To be precise, do we agree that the issue here isn't "in the design as
> it stands" but rather in a problem we found in the intended way of
> assigning IP addresses through DHCP for the containers?

No, I would say the problem *is* in the design.  But the problem is the
selected means of identifying the netdev to get to the namespace (and
the proposed means of creating non-default namespace devices to exist in
the container), not the namespace design itself.

> > I'm not trying to stop the "support train" here, but at the same time,
> > if the train is headed for a bridge that's out
> 
> So what's your concrete saying here? where should we go from here?

This excerpt is from the commit log of patch 3/12:

The IB device and port, together with the P_Key and the IP address should
be enough to uniquely identify the ULP net device.

The problem here is that this is wrong.  If we allow more than one
device per pkey with the same GUID, then DHCP breaks, which is bad in
and of itself, but it also breaks ipv6 link local addressing.  Which
means that this hunk in patch 4/12:

+#if IS_ENABLED(CONFIG_IPV6)
+   case AF_INET6:
+   if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
+   return true;
+
+   break;
+#endif

can now be tricked into returning true for incorrect devices.

Where do we go from here?

First, I'm inclined to say we should modify the add_child portion of
IPoIB to refuse to add links to a PKey if that GUID is already present
on that PKey.  You could then use different PKeys on the default GUID
for separate namespaces.  If you need separate namespaces on the same
PKey, then enable alias GUIDs for use on the local adapter and require
one GUID per namespace on the same PKey.

Then I'm inclined to say that we should map for namespaces using device,
port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
any given pkey maps to a unique dhcp identifier and a unique ipv6
lladdr, this becomes freely interchangeable with device, port, pkey,
address mappings that this patchset was built around.

-- 
Doug Ledford 
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Haggai Eran
On 29/05/2015 00:55, Doug Ledford wrote:
> On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote:
>> So what's your concrete saying here? where should we go from here?
> 
> This excerpt is from the commit log of patch 3/12:
> 
> The IB device and port, together with the P_Key and the IP address should
> be enough to uniquely identify the ULP net device.
> 
> The problem here is that this is wrong.  If we allow more than one
> device per pkey with the same GUID, then DHCP breaks, which is bad in
> and of itself, but it also breaks ipv6 link local addressing.  Which
> means that this hunk in patch 4/12:
> 
> +#if IS_ENABLED(CONFIG_IPV6)
> +   case AF_INET6:
> +   if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1))
> +   return true;
> +
> +   break;
> +#endif
> 
> can now be tricked into returning true for incorrect devices.
> 
> Where do we go from here?
> 
> First, I'm inclined to say we should modify the add_child portion of
> IPoIB to refuse to add links to a PKey if that GUID is already present
> on that PKey.  You could then use different PKeys on the default GUID
> for separate namespaces.  If you need separate namespaces on the same
> PKey, then enable alias GUIDs for use on the local adapter and require
> one GUID per namespace on the same PKey.
I don't think blocking the current add_child implementation is needed. I
agree IPv6 SLAAC and DHCP currently don't work well, and adding alias
GUID for child interfaces is important, but the current implementation
can be used with static IPv4 addresses, so I don't think it must be
disabled.

> Then I'm inclined to say that we should map for namespaces using device,
> port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
> any given pkey maps to a unique dhcp identifier and a unique ipv6
> lladdr, this becomes freely interchangeable with device, port, pkey,
> address mappings that this patchset was built around.

What if we change the namespaces patches to map (device, port, GID,
P_Key, IP) to netdev / namespace? That is, to use both the GID and the
IP address. This would allow people to use namespaces with the current
implementation (provided they have a valid configuration with no
conflicting IP addresses), and once alias GUIDs are added, the GUIDs
will be used to uniquely resolve the namespace even with such
misconfigurations.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Haggai Eran
On 28/05/2015 18:46, Jason Gunthorpe wrote:
> On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote:
>> wouldn't care if they share the "QP number namespace", etc. RDMA CM
>> ports are different because they are chosen by the applications, but
>> they map directly to the network namespace, so they don't require their
>> own namespace.
> 
> Different containers should have restricted access to the PKey and GID
> tables, and the presence device itself. Just like in the SRIOV
> case.
> 
> That is what the 'RDMA Namespace' would control.

We were thinking here that there is a room for an RDMA cgroup. It would
limit the amount of RDMA resources a container can use. It can also be
used for the restrictions you mentioned, but maybe they are more
suitable for a namespace. I'm not sure. In RoCE for instance, a
restricted access to the GID table can be derived from the network
namespace directly, but perhaps not in InfiniBand.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Jason Gunthorpe
On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote:
> > Then I'm inclined to say that we should map for namespaces using device,
> > port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
> > any given pkey maps to a unique dhcp identifier and a unique ipv6
> > lladdr, this becomes freely interchangeable with device, port, pkey,
> > address mappings that this patchset was built around.
> 
> What if we change the namespaces patches to map (device, port, GID,
> P_Key, IP) to netdev / namespace? That is, to use both the GID and the
> IP address.

As I keep saying, you are not supposed to use the IP address as a key
to find the netdev, that is the wrong way to use the Linux netdev
model.

Requiring unique GID/PKey allows the implementation to avoid this
wrongness, which would be simplifying and more correct.

That is the appeal to blocking this scenario when children are created.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Or Gerlitz
On Wed, Jun 3, 2015 at 7:14 PM, Jason Gunthorpe
 wrote:
> On Wed, Jun 03, 2015 at 01:03:01PM +0300, Haggai Eran wrote:
>> > Then I'm inclined to say that we should map for namespaces using device,
>> > port, guid/gid, pkey.  And in this situation, since a unique guid/gid on
>> > any given pkey maps to a unique dhcp identifier and a unique ipv6
>> > lladdr, this becomes freely interchangeable with device, port, pkey,
>> > address mappings that this patchset was built around.
>>
>> What if we change the namespaces patches to map (device, port, GID,
>> P_Key, IP) to netdev / namespace? That is, to use both the GID and the
>> IP address.
>
> As I keep saying, you are not supposed to use the IP address as a key
> to find the netdev, that is the wrong way to use the Linux netdev
> model.
>
> Requiring unique GID/PKey allows the implementation to avoid this
> wrongness, which would be simplifying and more correct.
>
> That is the appeal to blocking this scenario when children are created.

Jason,

The IPoIB RTNL childs were added around release 3.6/7 of the upstream
kernel and are part of the kernel UAPI. They are perfectly used in
bunch of schemes:

1.  when static IP address assignment is used

2. under PV scheme, when the guest has para-virtual Eth NIC and the
host does routing between the back-end (e.g tap or alike) and  the
IPoIB child. Or when the host does tunneling (vxlan) and alike and
sends down the encapsulated packet through a host IP address assigned
to the IPoIB child

3. etc few more

Indeed the DHCP story isn't working there and to get DHCP work
something has to be done. But this issue can't serve for blocking the
existing UAPI and introduce regression to working systems.

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Jason Gunthorpe
On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote:

> Indeed the DHCP story isn't working there and to get DHCP work
> something has to be done. But this issue can't serve for blocking the
> existing UAPI and introduce regression to working systems.

It is not DHCP that concerns me, it is the fact we can't combine net
namespaces, RDMA-CM and duplicate GUID IPoIB children together without
adding hacks to the kernel. Searching netdevs by IP is a hack.

I'm mostly fine with it as an optional capability, similar to macvlan,
I just don't see how to cleanly integrate it with RDMA CM and
namespaces. And I don't see what RDMA CM is supposed to do when
it hits this case.

So, any ideas that don't involve the searching for IP hack??

[And yes, as discussed with Haggie, it is not the worst hack in the
 world, and maybe we can live with it, but lets understand the trade
 offs carefully]

Also, now that this has been brought up, I think you need to make a
patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
modify addrconf_ifid_infiniband to return error if the IPoIB child is
sharing a guid. It was not good at all to push the child patches
forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Or Gerlitz
On Wed, Jun 3, 2015 at 10:53 PM, Jason Gunthorpe
 wrote:
> On Wed, Jun 03, 2015 at 10:05:34PM +0300, Or Gerlitz wrote:
>
>> Indeed the DHCP story isn't working there and to get DHCP work
>> something has to be done. But this issue can't serve for blocking the
>> existing UAPI and introduce regression to working systems.
>
> It is not DHCP that concerns me, it is the fact we can't combine net
> namespaces, RDMA-CM and duplicate GUID IPoIB children together without
> adding hacks to the kernel. Searching netdevs by IP is a hack.
>
> I'm mostly fine with it as an optional capability, similar to macvlan,
> I just don't see how to cleanly integrate it with RDMA CM and
> namespaces. And I don't see what RDMA CM is supposed to do when
> it hits this case.
>
> So, any ideas that don't involve the searching for IP hack??
>
> [And yes, as discussed with Haggie, it is not the worst hack in the
>  world, and maybe we can live with it, but lets understand the trade
>  offs carefully]

As Haggai wrote, if we let the using IP address thing to fly up, we have
support for RDMA in containers using the RDMA-CM at IPoIB environments.
This will let people test, use, experiment, fix, interact (and even
production-it when static IP address assignment scheme is used).

Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
remove the IP thing.

Later, the next stage/s in Matan's work on the RoCE GID table would
allow to support MACVLAN and hence RoCE too.

This is how the Linux kernel being evolved since the 2.5 failure to
come up with giant releases -- doing things in relativity small steps.


> Also, now that this has been brought up, I think you need to make a
> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
> modify addrconf_ifid_infiniband to return error if the IPoIB child is
> sharing a guid. It was not good at all to push the child patches
> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.

Till the alias GUID thing is introduced, maybe we can patch
addrconf_ifid_infiniband to use the QPN value from the device HW
address to come up with unique IPv6 link local address, agree? where
you think we can place the 24 bits QPN?

Or.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Jason Gunthorpe
On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
> As Haggai wrote, if we let the using IP address thing to fly up, we have
> support for RDMA in containers using the RDMA-CM at IPoIB environments.
> This will let people test, use, experiment, fix, interact (and even
> production-it when static IP address assignment scheme is used).

Sure, I think we all understand the goal, and you've explained some
reasonable use cases for the child support.

> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
> remove the IP thing.

How do we remove it? Along with same-guid child support? What is your
idea here?

> > Also, now that this has been brought up, I think you need to make a
> > patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
> > modify addrconf_ifid_infiniband to return error if the IPoIB child is
> > sharing a guid. It was not good at all to push the child patches
> > forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.
> 
> Till the alias GUID thing is introduced, maybe we can patch
> addrconf_ifid_infiniband to use the QPN value from the device HW
> address to come up with unique IPv6 link local address, agree? where
> you think we can place the 24 bits QPN?

I don't know if that is a good idea, an unstable SLAAC is not in
spirit with the RFCs. The safest bet is to return error and disable
SLAAC completely.

But I'm just guessing here - I'm only feel strongly that something
should be done to address this issue in the existing kernel.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Jason Gunthorpe
On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:

> > I'm mostly fine with it as an optional capability, similar to macvlan,
> > I just don't see how to cleanly integrate it with RDMA CM and
> > namespaces. And I don't see what RDMA CM is supposed to do when
> > it hits this case.
> >
> > So, any ideas that don't involve the searching for IP hack??
> >
> > [And yes, as discussed with Haggie, it is not the worst hack in the
> >  world, and maybe we can live with it, but lets understand the trade
> >  offs carefully]
> 
> As Haggai wrote, if we let the using IP address thing to fly up, we have
> support for RDMA in containers using the RDMA-CM at IPoIB environments.
> This will let people test, use, experiment, fix, interact (and even
> production-it when static IP address assignment scheme is used).

I just noticed ipvlan got merged a few months ago.. That certainly
changed my view on this topic. It is basically a software
version of the same-guid ipoib children scheme. Similar issues: Same MAC
address as the parent, IPv6 SLAAC is disabled (?),  DHCP has similar
issue (solved with RFC4361, and broadcasting fallback, it seems)..

The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
solves some of the security type issues. What do you think Haggi?

Is there any chance standard things like ipvlan and macvlan could be
used with rdma-cm if their master devices are IPoIB? Are we even on
the right path to do that someday? Is that the plan for roce?

Any thoughts on the idea we still need ipoib same-guid children if
ipvlan is available?

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-03 Thread Haggai Eran
On 04/06/2015 02:48, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
> 
>>> I'm mostly fine with it as an optional capability, similar to macvlan,
>>> I just don't see how to cleanly integrate it with RDMA CM and
>>> namespaces. And I don't see what RDMA CM is supposed to do when
>>> it hits this case.
>>>
>>> So, any ideas that don't involve the searching for IP hack??
>>>
>>> [And yes, as discussed with Haggie, it is not the worst hack in the
>>>  world, and maybe we can live with it, but lets understand the trade
>>>  offs carefully]
>>
>> As Haggai wrote, if we let the using IP address thing to fly up, we have
>> support for RDMA in containers using the RDMA-CM at IPoIB environments.
>> This will let people test, use, experiment, fix, interact (and even
>> production-it when static IP address assignment scheme is used).
> 
> I just noticed ipvlan got merged a few months ago.. That certainly
> changed my view on this topic. It is basically a software
> version of the same-guid ipoib children scheme. Similar issues: Same MAC
> address as the parent, IPv6 SLAAC is disabled (?),  DHCP has similar
> issue (solved with RFC4361, and broadcasting fallback, it seems)..
> 
> The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
> solves some of the security type issues. What do you think Haggi?
I think some issues ipvlan is trying to solve would also affect us using
the alias GUIDs solution. ipvlan tries to solve among other the problem
of a limited MAC filter table in NICs, and avoid using promiscuous mode.
But the GID table is also limited, and we don't have something like
promiscuous mode for GIDs in InfiniBand. For large scale use of
containers we would need to also allow the current model.

As for L3 mode, it does seem more restrictive, as all routing decisions
are done in the controlling namespace. Our current ipoib child interface
implementation is more like the L2 version of ipvlan.

> 
> Is there any chance standard things like ipvlan and macvlan could be
> used with rdma-cm if their master devices are IPoIB? 
These standard interfaces seem very much connected with Ethernet (both
have an ARPHDR_ETHER-only check for their upper devices). I think
macvlan's functionality would be covered by adding alias GUIDs to ipoib,
and ipvlan L2 is covered by the current behavior. Perhaps it would be
beneficial to try and make ipvlan more generic so that it would work
over ipoib, giving us support for L3 mode.

As for rdma-cm support, the patch I had for ipoib attempts to scan each
child's upper devices in order to support such topologies. We only
tested it with bonding, but I think it would also work with such devices.

> Are we even on
> the right path to do that someday? Is that the plan for roce?
Yes, for RoCE our goal for the start was to support namespaces in RDMA
CM through macvlan devices. As long as we can update the RoCE gid table
correctly for macvlan and ipvlan devices, the RDMA CM implementation
shouldn't care where the details come from.

> Any thoughts on the idea we still need ipoib same-guid children if
> ipvlan is available?
If we port ipvlan to work over IPoIB interfaces and not just Ethernet,
then ipvlan L2 would provide exactly the same functionality. There onyl
difference I can think of is that ipvlan would use a single UD QP for
all devices (and in connected-mode, a single RC QP between a pair of
hosts), while ipoib would use a QP per child device, and multiple RC QPs
for such pairs.

Regards,
Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-04 Thread Haggai Eran
On 04/06/2015 00:45, Jason Gunthorpe wrote:
> On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote:
>> As Haggai wrote, if we let the using IP address thing to fly up, we have
>> support for RDMA in containers using the RDMA-CM at IPoIB environments.
>> This will let people test, use, experiment, fix, interact (and even
>> production-it when static IP address assignment scheme is used).
> 
> Sure, I think we all understand the goal, and you've explained some
> reasonable use cases for the child support.
> 
>> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to
>> remove the IP thing.
> 
> How do we remove it? Along with same-guid child support? What is your
> idea here?
> 
>>> Also, now that this has been brought up, I think you need to make a
>>> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to
>>> modify addrconf_ifid_infiniband to return error if the IPoIB child is
>>> sharing a guid. It was not good at all to push the child patches
>>> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them.
>>
>> Till the alias GUID thing is introduced, maybe we can patch
>> addrconf_ifid_infiniband to use the QPN value from the device HW
>> address to come up with unique IPv6 link local address, agree? where
>> you think we can place the 24 bits QPN?
> 
> I don't know if that is a good idea, an unstable SLAAC is not in
> spirit with the RFCs. The safest bet is to return error and disable
> SLAAC completely.
Maybe this is a silly question, but doesn't DAD already disable SLAAC
addresses when there's a conflict?

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-04 Thread Jason Gunthorpe
On Thu, Jun 04, 2015 at 12:41:33PM +0300, Haggai Eran wrote:
> On 04/06/2015 00:45, Jason Gunthorpe wrote:

> > I don't know if that is a good idea, an unstable SLAAC is not in
> > spirit with the RFCs. The safest bet is to return error and disable
> > SLAAC completely.

> Maybe this is a silly question, but doesn't DAD already disable SLAAC
> addresses when there's a conflict?

Yes, DAD should certainly trigger and disable the child, but the
kernel should not rely on DAD for correctness, it is a safety net, and
it isn't guarenteed 100% reliable.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-04 Thread Jason Gunthorpe
On Thu, Jun 04, 2015 at 09:24:37AM +0300, Haggai Eran wrote:
> > The l2/l3 distinction in ipvlan is also very interesting. The L3 mode
> > solves some of the security type issues. What do you think Haggi?

> I think some issues ipvlan is trying to solve would also affect us using
> the alias GUIDs solution. ipvlan tries to solve among other the problem
> of a limited MAC filter table in NICs, and avoid using promiscuous mode.
> But the GID table is also limited, and we don't have something like
> promiscuous mode for GIDs in InfiniBand. For large scale use of
> containers we would need to also allow the current model.

Yes, that is certainly true.

> As for L3 mode, it does seem more restrictive, as all routing decisions
> are done in the controlling namespace. Our current ipoib child interface
> implementation is more like the L2 version of ipvlan.

The ipoib children are exactly like macvlan, because they all have
unique LLADDRs.

It doesn't start acting like ipvlan until we reach the rdma-cm patches,
and where we see the IP stack side act like macvlan and the rdma-cm
side try to act like ipvlan - that is why it is so ugly/hacky,

> > Is there any chance standard things like ipvlan and macvlan could be
> > used with rdma-cm if their master devices are IPoIB?

> These standard interfaces seem very much connected with Ethernet (both
> have an ARPHDR_ETHER-only check for their upper devices). I think
> macvlan's functionality would be covered by adding alias GUIDs to ipoib,
> and ipvlan L2 is covered by the current behavior. Perhaps it would be
> beneficial to try and make ipvlan more generic so that it would work
> over ipoib, giving us support for L3 mode.

Yes, macvlan seems very well covered already by IPoIB child
interfaces, and I don't see too many reasons to worry about changing
that.

ipvlan on the other hand, as you observe, is valuable for many reasons.

> As for rdma-cm support, the patch I had for ipoib attempts to scan each
> child's upper devices in order to support such topologies. We only
> tested it with bonding, but I think it would also work with such devices.

.. it is so sketchy :|

Firstly: I still think the prior discussion is right, and proceeding
along the reworking of the ingress side of rdma-cm and focusing on the
device,guid,pkey makes 100% sense and will progress things right
away. Every other variation seems to build on that.

But when we get into bonding and the various vlan things, we loose
encapsulation - snooping the children list to guess what the bonding
driver is doing seems very hacky.

Discussion idea: Can we actually use the netstack to process the
RDMA-CM packets? It looks like the netstack wants a skb to do this
mid-layer work, so rdma-cm would have to synthesize a skb for the CM
packets and pass it through netdev to apply all the transformations
and access the various internal states (eg from ipvlan, bonding,
etc). rdma-cm would have to 'catch' the skb once it is done traveling
and resume its normal processing. Very similar to your notion of using
UDP, but without any on-the-wire change.

This would fit in that same ingress spot I suggested adding the
routing lookup, instead of routing we want the full stack to have a go
at figuring out the final netdev.

This seems the most general because it will work for all the *vlan
type drivers, bonding, and all of the RDMA technologies. (each would
have a slightly different way to make the skb, but same basic idea)

Lots and lots of details to do that, but conceptually it seems pretty
solid?

> Yes, for RoCE our goal for the start was to support namespaces in RDMA
> CM through macvlan devices. As long as we can update the RoCE gid table
> correctly for macvlan and ipvlan devices, the RDMA CM implementation
> shouldn't care where the details come from.

Hurm, the gid index tagged on the QP1 packet should not be directly
used for much on ingress. rdma-cm will have to recover the mac address
and vlan to use that as a guide.

Synchronizing the gid table and all the internal state in macvlan,
ipvlan, bonding seems very hard, I do not envy your task :(

> > Any thoughts on the idea we still need ipoib same-guid children if
> > ipvlan is available?
> If we port ipvlan to work over IPoIB interfaces and not just Ethernet,
> then ipvlan L2 would provide exactly the same functionality. There onyl
> difference I can think of is that ipvlan would use a single UD QP for
> all devices (and in connected-mode, a single RC QP between a pair of
> hosts), while ipoib would use a QP per child device, and multiple RC QPs
> for such pairs.

Agree with this.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-08 Thread Haggai Eran
On 04/06/2015 19:40, Jason Gunthorpe wrote:
> Discussion idea: Can we actually use the netstack to process the
> RDMA-CM packets? It looks like the netstack wants a skb to do this
> mid-layer work, so rdma-cm would have to synthesize a skb for the CM
> packets and pass it through netdev to apply all the transformations
> and access the various internal states (eg from ipvlan, bonding,
> etc). rdma-cm would have to 'catch' the skb once it is done traveling
> and resume its normal processing. Very similar to your notion of using
> UDP, but without any on-the-wire change.
> 
> This would fit in that same ingress spot I suggested adding the
> routing lookup, instead of routing we want the full stack to have a go
> at figuring out the final netdev.
> 
> This seems the most general because it will work for all the *vlan
> type drivers, bonding, and all of the RDMA technologies. (each would
> have a slightly different way to make the skb, but same basic idea)
> 
> Lots and lots of details to do that, but conceptually it seems pretty
> solid?

The problem is that the network stack can do all sort of changes to the
packets (like NAT), and it may be the case that the hardware can't
reflect these changes later on when creating a QP. I think it would be
best to stick with resolving the net_dev using the request parameters,
and the simpler routing lookup. This way RDMA CM remains in control, and
if the user configures routing in an unexpected way, it can just block
the request.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM

2015-06-08 Thread Jason Gunthorpe
On Mon, Jun 08, 2015 at 10:52:34AM +0300, Haggai Eran wrote:
> On 04/06/2015 19:40, Jason Gunthorpe wrote:
> > Discussion idea: Can we actually use the netstack to process the
> > RDMA-CM packets? It looks like the netstack wants a skb to do this
> > mid-layer work, so rdma-cm would have to synthesize a skb for the CM
> > packets and pass it through netdev to apply all the transformations
> > and access the various internal states (eg from ipvlan, bonding,
> > etc). rdma-cm would have to 'catch' the skb once it is done traveling
> > and resume its normal processing. Very similar to your notion of using
> > UDP, but without any on-the-wire change.
> > 
> > This would fit in that same ingress spot I suggested adding the
> > routing lookup, instead of routing we want the full stack to have a go
> > at figuring out the final netdev.
> > 
> > This seems the most general because it will work for all the *vlan
> > type drivers, bonding, and all of the RDMA technologies. (each would
> > have a slightly different way to make the skb, but same basic idea)
> > 
> > Lots and lots of details to do that, but conceptually it seems pretty
> > solid?
> 
> The problem is that the network stack can do all sort of changes to the
> packets (like NAT), and it may be the case that the hardware can't
> reflect these changes later on when creating a QP.

Yes, I am aware of that, but there are also alot of things netdev can
do that we can realize, like netfilter rules to block packets, for
instance

Ignoring NAT is a bad choice as well, the best would be to drop on
NAT. It would be easy to detect if the netstack mangled the REQ skb
packet, for instance.

We can't track netdev after the QP is created, but totally ignoring
one thing and while re-implementing others seems like a bad idea, long
term...

> I think it would be best to stick with resolving the net_dev using
> the request parameters, and the simpler routing lookup. This way
> RDMA CM remains in control, and if the user configures routing in an
> unexpected way, it can just block the request.

As I said, I think that is fine for the immediate IB support, but when
you start talking about roce and emulating macvlan and ipvlan.. Then
it starts to look really bad. At least think it through carefully
before posting those series.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html