Re: rdma kernel tree

2015-05-27 Thread Or Gerlitz
Doug,

Do you have everything in place such that what you stage for upstream
is subject to linux-next auto merge tests and such?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Or Gerlitz

On 5/27/2015 12:11 AM, Jason Gunthorpe wrote:

On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote:


  - ifcfg/udev/networkmanager: So what happens when I do
 ip link add link ib0 name ib0.1 type ipoib
And get two IPoIB interfaces with the same GUID? I doubt any sane
user would want to apply the same config to those two interfaces.

No, they probably don't want to apply the same rules to both interfaces.
I'm not entirely sure I agree with the argument though.  I fully
expected this to fail without a pkey argument on the ip command
line.

Does that matter to the above tools? Are they using PKey,GUID as their
key?


The net stack doesn't allow users to do the same thing with Ethernet
devices, so I'm not sure we shouldn't be disallowing this as opposed to
creating duplicate devices that are identical in all ways except name.

The netstack doesn't allow it for ethernet because it would create a
2nd identical LLADDR, and LLADDRs must be unique.

Because the QPN is part of the LLADDR IB can create two interfaces on
the same physical port that are completely separated by hardware. Read
Haggi's email, he explains how they plan to use this to create
interfaces that can be delegated to namespaces. It is not a bad idea
really..

So prepare for a world where each namespace has a child IPoIB
interface with a unique QPN, but the same Pkey and GUID as the
host. The breakage from assuming GUID == unique will become a problem.


Unbreaking it is a UAPI change, not impossible, but do we really care
enough about 8 or 20 to push for that?

In truth, at least right now, it's all moot.  Since we can't set the
subnet prefix, the qpn, or the flags, anything above 8 bytes is
immutable regardless of how many bytes we pass in.  So even if we say we
aren't going to change the UAPI and for everything to 20, the real world
result is that 8 works exactly the same and has no functional
difference.

Not quite, in the 20 byte format the 8 bytes of the GUID are in the
last 8/20 bytes, so the app would have to place 12 zeros and then the
GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)

This is why the question of 'what is ILFA_VF_MAC' is so important,
every option presented (MAC,GUID,LLADDR) are incompatible with each
other.


I agree with Doug that to be practical here, libvirt and Co. would 
really want to use rtnetlink based provisioning of IB VFs, at least in a 
similar manner done for Eth VFs.


So with this assumption at hand, my vote goes to having user-space to 
provide the eight bytes of vGUID through the ndo_set_vf_mac call into 
IPoIB.


I don't see the real value of user space providing the four zero bytes 
(19-16) and the 8 bytes of the subnet prefix provided by the SM.


My personal thinking is that the important thing to address is 
consistency between what the virtualization system provisions on the 
host (ndo_set_vf_mac) to the DHCP server scheme they build.


Do we have a go here?

Also few comments on DHCP:

If we're talking on different vlans/Eth or pkey/IB - it's totally OK for 
two entities (== IPoIB instances under IB) on the physical subnet to use 
the same identifier (IB/GUID, Eth/MAC) if they are on two different L2 
broadcast domains. The DHCP server is expected to have a different 
mapping scheme per such virtual L2 subnet.


For SRIOV, we don't expect two VFs on the network to use the same vGUID, 
so DHCP wise we should be OK. Today the Client-ID works fine for SRIOV 
schemes which are based on 8byte vGUIDs.


Re two IPoIB child devices using the same GUID and the same pkey, we can 
enhance the system and take advantage of IB Alias GUIDs which today are 
only used for SRIOV for Para-Virtual and other environments too, thanks 
for the heads up on the necessity of doing so.





What does get return? If we accept 8 or 20, then get must return 20.

The get has to return 20 regardless.  It's the only accepted means of
getting all 20 bytes of the LLADDR.

You are conflating IFLA_ADDRESS and IFLA_VF_MAC.

IFLA_VF_MAC could be 8 byte and IFLA_ADDRESS could be 20, I think that
makes no sense, but it wouldn't break existing stuff.



Just to make sure we're on the same page, this thread deals with using 
rtnetlink's IFLA_VF_MAC(== struct ifla_vf_mac) for provisioning vGUID 
for IB VFs, through the PF IPoIB interface, not attempting to use 
IFLA_ADDRESS.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-05-27 Thread Or Gerlitz
On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:
 On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
  Jason, you (U2 BTW) play really, really hard - refusing to say **one**
  word on your approach towards the built-in udata mechanism for uverbs
  which I asked you to comment on.

  And I asked to see the user space side and you have angrily refused
  every time.

 AFAIR I never ever refused to show any piece of code which went under
 my hands towards Linux to any-one.

 For future reference, when someone asks a question and you go off on
 an tangental rant and ignore the question, then that process repeats,
 still without answering the question - most english speakers would
 call that refusing to answer the question. It is not looked upon kindly.

Jason,

It's not that you asked to see the code ala hey, do you happen to
have a git with
the user space code for people to inspect while doing the review on
the kernel part, but
rather U2 saying in a definitive manner that posting the user space
code should be
imposed as pre-requirement to acceptance of the kernel parts.

In parallel, U2 totally rejected our usage of udata @ on the spot and
when I mentioned
that it's a feature which was designed for that purpose exactly and
from day one, it took
me three reminders to get a you know what, maybe that can fly
comment from you.

So here I started to realize that there's something in the attitude
that goes beyond
the details, and I made the you're not the boss comment.


 I'm really confused why you didn't just post the github links last
 week, the patches are all a month old on there. Was it really so
 offensive to you that we wanted to review the kernel UAPI patches and
 verbs patches together?

  So I guess we are both playing hard.

 I disagree, you act as sort of being the boss here, stating every now
 and then your preferences and way of engineering things as the
 ultimate guidelines for Linux and/or RDMA engineering.

 Lets be clear Or, I have given you (and others) some very pointed
 comments and advice, privately and publicly. That is not 'being the
 boss' that is contributing to fix our community.

 When it comes to my patch comments, I give direction on what I want to
 see to provide my Reviewed-By.

 If you don't like it, then find someone else to review your code.

 I'm busy, and I don't work for you. If I don't want to review some
 patches because my questions have been ignored, then that is entirely
 my perogative.

 Pinging me *three times this week* on this stupid timestamp thing, is
 somewhere between annoying and offensive.

See above, why I made these pings.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Jesus Camacho Villanueva
Hello,

I am trying to send a port info request to all the ports in a switch,
one after the other.

I am sending all requests to the same switch in sequence through the
same directed routing path:

Request port number 1
Request port number 2
Request port number 3
Request port number 4

But sometimes I get responses out of order:

Response port number 2
Response port number 1
Response port number 3
Response port number 4

Is it possible that the switch is processing port requests out of
order (or maybe I am doing something wrong)?

Best regards,
Jesus
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Hal Rosenstock
Hi Jesus,

On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote:
 Hello,
 
 I am trying to send a port info request to all the ports in a switch,
 one after the other.
 
 I am sending all requests to the same switch in sequence through the
 same directed routing path:
 
 Request port number 1
 Request port number 2
 Request port number 3
 Request port number 4
 
 But sometimes I get responses out of order:
 
 Response port number 2
 Response port number 1
 Response port number 3
 Response port number 4
 
 Is it possible that the switch is processing port requests out of
 order (or maybe I am doing something wrong)?

One way this could occur is that request to switch for port number 1
either does not make it to switch or response does not make it back and
it times out and is retransmitted by kernel. This depends on how umad is
setup in terms of retries and timeout.

Is it always the same switch or switch type ? What switch is it ?

-- Hal

 Best regards,
 Jesus
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma kernel tree

2015-05-27 Thread Doug Ledford
On Wed, 2015-05-27 at 12:50 +0300, Or Gerlitz wrote:
 Doug,
 
 Do you have everything in place such that what you stage for upstream
 is subject to linux-next auto merge tests and such?
 
 Or.

No, I have 0day in place, but not linux-next to my knowledge (unless
Stephen already set it up, but I haven't asked him to yet).

Stephen, can you set it up so that the linux-next tag from
git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added
to the linux-next testing please?

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 02/14] IB/mad: Create an RMPP Base header

2015-05-27 Thread ira.weiny
On Wed, May 20, 2015 at 04:03:33PM -0600, Jason Gunthorpe wrote:
 On Wed, May 20, 2015 at 05:23:45PM -0400, ira.weiny wrote:
  On Wed, May 20, 2015 at 12:27:35PM -0600, Jason Gunthorpe wrote:
   On Wed, May 20, 2015 at 04:13:23AM -0400, ira.we...@intel.com wrote:
-struct ib_rmpp_mad {
+struct ib_rmpp_base {
struct ib_mad_hdr   mad_hdr;
struct ib_rmpp_hdr  rmpp_hdr;
+} __packed;
+
+struct ib_rmpp_mad {
+   struct ib_rmpp_base base;
u8  data[IB_MGMT_RMPP_DATA];
 };
   
   Why can't we just use:
   
u8 data[];
   
   And replace various sizeof(ib_rmpp_mad) with a rmpp_payload_size()
   call?
  
  I don't think it makes much difference.  I think there is just 1 place we 
  use
  that.
 
 There are many lines that are just churning rmpp_mad to
 rmpp_base, using a flex array avoids that entirely.

Ok.

 
 Similarly patch 12 doesn't have to introduce a opa specific structure anymore.

Ok.

 
 The same sort of comment probably applies to the mad_hdr/etc as well,
 but I didn't look very closely.

This would mean changing all the existing drivers which I'm not really keen on
doing.  Existing drivers using IBTA MADs expect the MADs to be exactly 256
bytes and that is exactly how struct ib_mad is defined.  So I think we need to
keep the definitions for drivers to use.

However, I think we can clean up the use of sizeof(ib_mad) in the MAD code such
that it is agnostic to the device/port type once the MAD size patch is in
place.  I'll look at doing that early in the series.

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Jason Gunthorpe
On Wed, May 27, 2015 at 01:53:11PM -0400, Doug Ledford wrote:

  This basic idea is exactly why I think we should stick with the 20
  byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
  VF what QPN to use for IPoIB (if we ever see HW support to implement that)
  
  If we use 8 bytes then that route is closed off forever.
 
 And that's exactly as it should be.  If we allow setting all 20 bytes
 via the VF_MAC calls, then we violate the guests should behave like
 they are on bare metal as much as possible rule.  As a host, we get a
 GUID and if we want to control the QPNs for IPoIB (and indeed if we
 want

No, that doesn't really work. The PF's QPN pool is shared between all
VF guests. The VF HCA does a call to the PF driver to get a free QPN
to use. The VM cannot unilaterally select a QPN without global
coordination.

In other words, if a QPN is to be forced for IPoIB then it must come
from the PF, which must get it from the orchestration software, which
must globally coordinate to ensure the QPN is available on every PF
the VM may want to run on. Similar to how the MAC must be globally
unique.

Will this ever be done? Probably not. But if it was, this interface
would be the sanest way to go.

Anyhow - as far as I'm concerned, fixing the IFLA_VF_MAC to have a
proper length nullifies my concern. If we really needed we can switch
to a 20 byte format with good compatability options in the future once
we have a proper length.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-05-27 Thread Jason Gunthorpe
On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote:
 On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
  On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
   Jason, you (U2 BTW) play really, really hard - refusing to say **one**
   word on your approach towards the built-in udata mechanism for uverbs
   which I asked you to comment on.
 
   And I asked to see the user space side and you have angrily refused
   every time.
 
  AFAIR I never ever refused to show any piece of code which went under
  my hands towards Linux to any-one.
 
  For future reference, when someone asks a question and you go off on
  an tangental rant and ignore the question, then that process repeats,
  still without answering the question - most english speakers would
  call that refusing to answer the question. It is not looked upon kindly.
 
 Jason,
 
 It's not that you asked to see the code ala hey, do you happen to
 have a git with the user space code for people to inspect while
 doing the review on the kernel part, but rather U2 saying in a
 definitive manner that posting the user space code should be imposed
 as pre-requirement to acceptance of the kernel parts.

I really didn't Or:

First ask:
 'We can't really look at the uapi changes here without also seeing the
  verbs side changes.'
 (I know others on the list feel the same, so I use 'we')

Second ask:
 'Lets see the verbs side and I'll let you know.'
(.. to your questions based on my review comments ..)

Third ask:
 'I think we need to have the same policy.'
(.. To Yann's point that other kernel communities have a mandatory
  UAPI policy)

Fourth (exasperated) ask:
 'Was it not clear? Yann and I asked to see the user space side before
  reviewing this series further.'

I know you are ESL, and I cut you alot of slack, but *come on* - that
is incredibly soft language, and certainly not bossing and imposing in
a definitive manner a blanket requirement on all patches.

Advice: You would be well served to spend a bit more time on your
emails. I have no idea what 'but rather U2 saying' means, for
instance. Sometimes I just guess at what you are trying to say :|

That is the price we pay for an inclusive international community, but
everyone needs to be careful before starting a flame war based on
percived slight in the text and phrasing of a message. email is hard.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Doug Ledford
On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
 On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote:
   Because the QPN is part of the LLADDR IB can create two interfaces on
   the same physical port that are completely separated by hardware. Read
   Haggi's email, he explains how they plan to use this to create
   interfaces that can be delegated to namespaces. It is not a bad idea
   really.. 
  
  Yes, it is actually.  The whole reason we went to GUID matching long ago
  was because of this exact issue.
 
 I reflected on this some more last night, and yes, I am leaning toward
 'bad idea' direction too.
 
 Too much stuff breaks if you create multiple children with the same
 pkey/guid:
  - RDMA CM cannot disambiguate CM packets between them
  - DHCP cannot tell them apart
  - Net scripts/network manager won't work
  - IPv6 becomes totally broken
 
 That means the namespace stuff will have to create children using GUID
 aliases..

Glad we agree on that ;-)

  The *only* way this will ever be a workable item is if we A) reserve a
  number of queue pairs from the driver specifically for IPoIB use and B)
  specify which queue pairs go to which IPoIB devices at IPoIB module
  load
 
 This basic idea is exactly why I think we should stick with the 20
 byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
 VF what QPN to use for IPoIB (if we ever see HW support to implement that)
 
 If we use 8 bytes then that route is closed off forever.

And that's exactly as it should be.  If we allow setting all 20 bytes
via the VF_MAC calls, then we violate the guests should behave like
they are on bare metal as much as possible rule.  As a host, we get a
GUID and if we want to control the QPNs for IPoIB (and indeed if we want
to control how many IPoIB interfaces and on what P_Keys) then we must
create config files in /etc/sysconfig/network-scripts (on Red Hat,
similar requirements on other distros) that would instruct the OS to
create exactly what we want.  But, the key point is that we are only
given a GUID, and we must create everything else from our config files.
Guests should be the same way.  They only get the GUID to start, then
the guest disk image with its self contained configuration will take
over and control the rest.

   Not quite, in the 20 byte format the 8 bytes of the GUID are in the
   last 8/20 bytes, so the app would have to place 12 zeros and then the
   GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
   
   This is why the question of 'what is ILFA_VF_MAC' is so important,
   every option presented (MAC,GUID,LLADDR) are incompatible with each
   other.
  
  For Ethernet devices, it's the MAC.  The equivalent of MAC on IB is the
  GUID.  I would leave it at that.
 
 Yes, both arguments can be made:
   - Our netlink end point is targetting an IPoIB interface, and
 the equivelent to an Ethernet MAC in IPoIB language is the LLADDR.
   - Our netlink interface is targetting the hardware under the IPoIB
 interface and that MAC equivilent is the GUID
 
  IPoIB devices are constructs on top of
  the GUID/link, and you can have 10 IPoIB interfaces between the parent
  and children, but we don't need to specify all of those LLADDRs, we just
  need to give a unique GUID and allow the guest OS to create their own
  IPoIB devices on top of that.
 
 As I've said, I would like to see netdev review that idea before we
 merge any patches..
 
 There are pragmatic downsides to the 8 byte choice: Userspace
 completely looses the ability to size the address without a table
 based on link type. That is terrible in the context of netlink's
 design.

Well, let's just be clear: netlink/iproute2 screwed the pooch on their
implementation of this stuff.  Any solution that doesn't include fixing
this up in some way is not really a good solution.

  For instance iproute2 would need IB specific code to format
 the 'ip link show' (review print_vfinfo in iproute2) and to length
 check 'ip link set vf mac'
 
 If we do use 8, then it would be ideal (and my strong preference) to
 also fix the IFLA_VF_MAC message to have a working length. I think
 that could be done compatibly with a bit of work. At least that way
 iproute2 can be kept clean when it learns to do IB, and we could have
 the option again of using 20 someday if we need.
 
 So to be clear, to go with the 8 byte option I suggest:
  - Engage netdev/iproute and confirm they are philosophically OK
with IFLA_VF_MAC != IFLA_ADDRESS
  - Make a kernel patch to properly size the IFLA_VF_MAC message
  - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
  - Drop in the IB patch

Sounds like a reasonable plan.

Or, this is your patch set, are you going to pick up these action items?

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Jesus Camacho Villanueva
The switches are unmanaged.
Is it possible to ensure in-order packets with managed switches?

Jesus

On Wed, May 27, 2015 at 4:30 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote:
 On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote:
 Hi Hal,

 Thanks for your quick response :)

 What you say here makes sense. I have seen this behaviour in one of
 the switches, but I don't discard that this problem can arise in other
 switches.

 Note that for MADs, transaction ID is looser than traditional
 transaction ID semantics in that the combination of TID, SGID, and
 MgmtClass is different from that of any currently executing operation
 and does not imply ordering. If there is ordering required, it is the
 responsibility of the manager to enforce that.

 The switch is: Infiniscale-IV Mellanox Technologies 4xQDR

 Is it a managed or unmanaged switch ?

 -- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RESEND PATCH V3 for-next 0/3] HW Device hot-removal support

2015-05-27 Thread Doug Ledford
On Tue, 2015-05-19 at 16:17 +, Liran Liss wrote:
  From: Hefty, Sean [mailto:sean.he...@intel.com]
 
   these remaining resources may be device-specific.
   The proposed framework first of all allows a provider to indicate
   whether hot-removal is supported (i.e., the presence of the
   'disassociate_ucontext' callback), and if so, allow the provider to
   perform the proper cleanup so that the corresponding user-space driver
   will continue to function.
  
  The approach seems strange.  The driver knows that it is being removed.  It
  was informed up front which open contexts were associated with user space
  processes.  But the driver calls up to indicate that it is being removed, 
  so that
  the upper layer can call back down to tell the driver to process the 
  removal.
  
  I wasn't asking what this series did.  I was asking why the uverbs driver 
  just
  can't delete the underlying resources.  That's the natural thing to expect.
 
 I suppose that the main issue would be handling existing user memory mappings,
 which cannot be just invalidated -- the user-space driver may not be aware of 
 the
 device removal and may access this memory concurrently, and we don't want it
 to crash.

In this case, you are mapping it out of the device BAR space and into a
random kernel page, yes?  So, if the driver doesn't catch the
DRIVER_FATAL event and process that to mean don't bother touching this
RDMA device any more, it's going to write to a mailbox that no longer
responds and have infinite timeouts, yes?  Essentially meaning all
mailbox type operations just go into lala land from here on out, right?

 The meaning of these mappings is device specific: some devices only write 
 them,
 while others read them, expecting some specific format. That's why we need
 device-specific code for this.
 
 While it is true that the device initiates the removal process, the current 
 flow is
 to let every ib_client first detach itself before device-specific cleanups. 
 In this regard,
 ib_uverbs is just another client.
 In addition, the per-open (fd) state is held in ib_ucontext, which is 
 maintained by
 ib_uverbs. The device driver doesn't hold a duplicate list of open HCA 
 handles, so
 it relies on ib_uverbs to iterate the relevant contexts and trigger the 
 device-specific
 stuff.
 


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Hal Rosenstock
On 5/27/2015 11:31 AM, Jesus Camacho Villanueva wrote:
 The switches are unmanaged.
 Is it possible to ensure in-order packets with managed switches?

No; same issue as I previously mentioned exists with both.

Unmanaged switches just use firmware whereas managed switches have CPU
and kernel in addition to firmware so the system is more complex.

-- Hal

 Jesus
 
 On Wed, May 27, 2015 at 4:30 PM, Hal Rosenstock h...@dev.mellanox.co.il 
 wrote:
 On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote:
 Hi Hal,

 Thanks for your quick response :)

 What you say here makes sense. I have seen this behaviour in one of
 the switches, but I don't discard that this problem can arise in other
 switches.

 Note that for MADs, transaction ID is looser than traditional
 transaction ID semantics in that the combination of TID, SGID, and
 MgmtClass is different from that of any currently executing operation
 and does not imply ordering. If there is ordering required, it is the
 responsibility of the manager to enforce that.

 The switch is: Infiniscale-IV Mellanox Technologies 4xQDR

 Is it a managed or unmanaged switch ?

 -- Hal
 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications

2015-05-27 Thread Doug Ledford
On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
 On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
 
  +   struct srcu_struct  disassociate_srcu;
 
 There is no need for rcu for this, use a rw sem.

The rcu was used becuase it's on the hot path I assume.  Do we have
numbers on whether a rwsem vs. an rcu matters performance wise?  If the
rcu actually helps performance, then I'm inclined to leave it, but if it
doesn't, then I'd agree that a rwsem is simpler and easier to deal with.

  @@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct 
  ib_uverbs_file *file,
  return -EFAULT;
  }
   
  +   /* Taking ref count on uverbs_file to make sure that file won't be
  +* freed till that event file is closed. It will enable accessing the
  +* uverbs_device fields as part of closing the events file and making
  +* sure that uverbs device is available by that time as well.
  +* Note: similar is already done for the async event file.
  +   */
  +   kref_get(file-ref);
 
 Is this a bug today? It doesn't look like it, but this stuff does look wrong.
 
 Woulnd't this would make more sense for ib_uverbs_alloc_event_file to
 unconditionally grab the kref and unconditionally release it on
 release? 
 
 The existing code for this looks broken, in ib_uverbs_get_context all
 the error paths between ib_uverbs_alloc_event_file and the
 kref_get(file-ref) are wrong - the will result in fput() which will
 call ib_uverbs_event_close, which will try to do kref_put and
 ib_unregister_event_handler - which are no longer paired.
 
 [I recommend moving the kref_get and ib_register_event_handler into
  ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths
  are clearly paired instead of being partially open coded in call
  sites]
 
 Fix all this in a seperate patch to add the needed change in kref
 semantics please.

Seconded.

  -   if (!try_module_get(dev-ib_dev-owner)) {
  -   ret = -ENODEV;
  +   mutex_lock(dev-disassociate_mutex);
  +   if (dev-disassociated) {
  +   ret = -EIO;
  goto err;
  }
   
  -   file = kmalloc(sizeof *file, GFP_KERNEL);
  +   /* In case IB device supports disassociate ucontext, there is no hard
  +* dependency between uverbs device and its low level device.
  +*/
  +   module_dependent = !(dev-flags  UVERBS_FLAG_DISASSOCIATE);
  +
  +   if (module_dependent) {
  +   if (!try_module_get(dev-ib_dev-owner)) {
  +   ret = -ENODEV;
  +   goto err;
 
 Again? Why I do I keep pointing this same basic thing to Mellanox
 people:
 
  If you hold a X then you hold the ref to X as well.
 
 So, if the core code is holding function pointers to module code, then
 the core code holds a module ref. When the core code null's those
 function pointers, then it can release the module ref.

Seconded.

 This might work today like this (I'm not entirely sure), but it makes
 no sense at all.
 
 I'll look more closely in a few weeks once the rwsem change is done.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: [PATCH 05/36] HMM: introduce heterogeneous memory management v3.

2015-05-27 Thread Jerome Glisse
On Wed, May 27, 2015 at 11:20:05AM +0530, Aneesh Kumar K.V wrote:
 j.gli...@gmail.com writes:

Noted your grammar fixes.

  diff --git a/mm/Kconfig b/mm/Kconfig
  index 52ffb86..189e48f 100644
  --- a/mm/Kconfig
  +++ b/mm/Kconfig
  @@ -653,3 +653,18 @@ config DEFERRED_STRUCT_PAGE_INIT
when kswapd starts. This has a potential performance impact on
processes running early in the lifetime of the systemm until kswapd
finishes the initialisation.
  +
  +if STAGING
  +config HMM
  +   bool Enable heterogeneous memory management (HMM)
  +   depends on MMU
  +   select MMU_NOTIFIER
  +   select GENERIC_PAGE_TABLE
 
 What is GENERIC_PAGE_TABLE ?

Let over of when patch 0006 what a seperate feature that was introduced
before this patch. I failed to remove that chunk. Just ignore it.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Jesus Camacho Villanueva
Hi Hal,

Thanks for your quick response :)

What you say here makes sense. I have seen this behaviour in one of
the switches, but I don't discard that this problem can arise in other
switches.

The switch is: Infiniscale-IV Mellanox Technologies 4xQDR

Jesus

On Wed, May 27, 2015 at 3:55 PM, Hal Rosenstock h...@dev.mellanox.co.il wrote:
 Hi Jesus,

 On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote:
 Hello,

 I am trying to send a port info request to all the ports in a switch,
 one after the other.

 I am sending all requests to the same switch in sequence through the
 same directed routing path:

 Request port number 1
 Request port number 2
 Request port number 3
 Request port number 4

 But sometimes I get responses out of order:

 Response port number 2
 Response port number 1
 Response port number 3
 Response port number 4

 Is it possible that the switch is processing port requests out of
 order (or maybe I am doing something wrong)?

 One way this could occur is that request to switch for port number 1
 either does not make it to switch or response does not make it back and
 it times out and is retransmitted by kernel. This depends on how umad is
 setup in terms of retries and timeout.

 Is it always the same switch or switch type ? What switch is it ?

 -- Hal

 Best regards,
 Jesus
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: out of order packets when querying all ports in a switch through OpenSM

2015-05-27 Thread Hal Rosenstock
On 5/27/2015 10:16 AM, Jesus Camacho Villanueva wrote:
 Hi Hal,
 
 Thanks for your quick response :)
 
 What you say here makes sense. I have seen this behaviour in one of
 the switches, but I don't discard that this problem can arise in other
 switches.

Note that for MADs, transaction ID is looser than traditional
transaction ID semantics in that the combination of TID, SGID, and
MgmtClass is different from that of any currently executing operation
and does not imply ordering. If there is ordering required, it is the
responsibility of the manager to enforce that.

 The switch is: Infiniscale-IV Mellanox Technologies 4xQDR

Is it a managed or unmanaged switch ?

-- Hal

 
 Jesus
 
 On Wed, May 27, 2015 at 3:55 PM, Hal Rosenstock h...@dev.mellanox.co.il 
 wrote:
 Hi Jesus,

 On 5/27/2015 8:56 AM, Jesus Camacho Villanueva wrote:
 Hello,

 I am trying to send a port info request to all the ports in a switch,
 one after the other.

 I am sending all requests to the same switch in sequence through the
 same directed routing path:

 Request port number 1
 Request port number 2
 Request port number 3
 Request port number 4

 But sometimes I get responses out of order:

 Response port number 2
 Response port number 1
 Response port number 3
 Response port number 4

 Is it possible that the switch is processing port requests out of
 order (or maybe I am doing something wrong)?

 One way this could occur is that request to switch for port number 1
 either does not make it to switch or response does not make it back and
 it times out and is retransmitted by kernel. This depends on how umad is
 setup in terms of retries and timeout.

 Is it always the same switch or switch type ? What switch is it ?

 -- Hal

 Best regards,
 Jesus
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


 

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Doug Ledford
On Tue, 2015-05-26 at 15:11 -0600, Jason Gunthorpe wrote:
 On Tue, May 26, 2015 at 04:32:58PM -0400, Doug Ledford wrote:
 
  Not so much ethernet world as netdevice world.  The iproute2 program is
  used to configure any and all netdevices, ethernet or otherwise.  Right
  now, we can abuse it to do the same here, but it uses the netdevice ndo_
  ops, not rtnetlink to accomplish what it does, so we are limited in how
  we do thing if we want to maintain tool usage.
 
 Hmm? iproute2 does it over rtnetlink?

Yes, it does.  Sorry for being imprecise.  It uses the specific netlink
packet that maps back to the predefined ndo_ entry point and so you
can't use that netlink on anything other than a netdevice with a defined
ndo_ entry point.  We can't use it, for example, to extend the new RDMA
netlink operations that Kaike posted without also making modifications
to iproute2 to know about the new netlink.

   The LLADDR for IPoIB *is* 20 bytes.
   
   Truncating it down is *broken userspace*:
- DHCP: Not sending the full 20 bytes in the client request means the
  server cannot unicast the reply. This causes all sorts of problems
  and is discouraged in the RFCs these days.
  
  Reference?  The RFCs I've read (4390 - 4361 - 3315) list a number of
  options (three at the moment), but the LLADDR options all call for using
 
 I'm talking about this part from RFC 4390:
 
As described above, the link-layer address is unavailable to the DHCP
server because the link-layer address is larger than the chaddr
field length.  As a result, the server cannot unicast its reply to
the client.  Therefore, a DHCP client MUST request that the server
send a broadcast reply by setting the BROADCAST flag when IPoIB
Address Resolution Protocol (ARP) is not possible, i.e., in
situations where the client does not know its IP address.
 
 AFAIK, nobody ever solved this, and it actually does cause real world
 problems for cloud stuff as there is limited randomness in the
 TID. This is the network side of DHCP.
 
  a LLADDR from a device that is a permanent part of the machine (not
  common with add in cards), so the option most commonly used in IB is
  option 2, DUID Assigned by Vendor, aka GUID.  According to that,
  truncating to 8 bytes is precisely what you are supposed to do.  And, at
  least in all current Red Hat products, that's exactly how dhcp client
  creates the client-id.
 
 Using the GUID as the client-id is sort of OK from a policy
 perspective (ie what IP should I use), but it doesn't help the network
 side, and it breaks down completely when you create child interfaces.

Actually, no, it doesn't.  My standard test network inside Red Hat uses
child interfaces and dhcp exclusively and it all works as expected.
This is because each child interface has its own unique pkey and all of
the devices include the pkey in the broadcast address, so one child does
not see another child's broadcast reply, not the parent's.

It does, however, mean that you only want a single link on a given pkey
for a given device, which is why adding a second ipoib link without a
unique pkey seems so broken to me.

 Basically, the dhcp server not having the LLADDR at all is a pretty
 big hack.. No other network I know of runs DHCP like that.

That's how they decided to solve the issue in the RFCs, so that's what
we have.

- ifcfg/udev/networkmanager: So what happens when I do
   ip link add link ib0 name ib0.1 type ipoib
  And get two IPoIB interfaces with the same GUID? I doubt any sane
  user would want to apply the same config to those two interfaces.
  
  No, they probably don't want to apply the same rules to both interfaces.
  I'm not entirely sure I agree with the argument though.  I fully
  expected this to fail without a pkey argument on the ip command
  line.
 
 Does that matter to the above tools? Are they using PKey,GUID as their
 key?

They are pkey aware, yes.  There is the parent device, and all
non-default pkey devices are listed as a pkey device and given a pkey
number and they share the same GUID.  They are considered children of
the parent, just like with vlan setups.

  The net stack doesn't allow users to do the same thing with Ethernet
  devices, so I'm not sure we shouldn't be disallowing this as opposed to
  creating duplicate devices that are identical in all ways except name.
 
 The netstack doesn't allow it for ethernet because it would create a
 2nd identical LLADDR, and LLADDRs must be unique.

And as far as the configuration scripts (at least on Red Hat) as well as
NetworkManager is concerned, the unique requirements are GUID/P_Key.
But, the P_Key doesn't show up in the LLADDR, only in the broadcast
address, so only the GUID is checked in the LLADDR.  All IPoIB devices
are either parent devices (meaning no PKEY field specified in the config
file) in which case they are the first started and match the GUID, or
they are PKEY devices and are started only after their parent 

Re: rdma kernel tree

2015-05-27 Thread Or Gerlitz
On Wed, May 27, 2015 at 4:13 PM, Doug Ledford dledf...@redhat.com wrote:
 On Wed, 2015-05-27 at 12:50 +0300, Or Gerlitz wrote:
 Doug,

 Do you have everything in place such that what you stage for upstream
 is subject to linux-next auto merge tests and such?

 Or.

 No, I have 0day in place, but not linux-next to my knowledge (unless
 Stephen already set it up, but I haven't asked him to yet).

Oh, please rush to make sure this is in place,


 Stephen, can you set it up so that the linux-next tag from
 git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added
 to the linux-next testing please?

 --
 Doug Ledford dledf...@redhat.com
   GPG KeyID: 0E572FDD

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications

2015-05-27 Thread Jason Gunthorpe
On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote:
 On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
  On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
  
   + struct srcu_struct  disassociate_srcu;
  
  There is no need for rcu for this, use a rw sem.
 
 The rcu was used becuase it's on the hot path I assume.

Perhaps, I looked at that a bit, it was used on syscall paths, but
that wasn't even the big reason I made the comment..

The use of RCU is *strange*. There is no pointer, there is no call to
rcu_derference. Stuff protected by the write lock is being touched
without any read side locking (ie ib_uverbs_event_read sure looks
funky). There is no Copy or Update phase as far as I can see. It fails
two of the tests in Documentation/RCU/checklist.txt.

I'm pretty sure what is intended here is that disassociated is
actually an unlocked atomic boolean and the RCU is being used
(abused?) to create both exclusion and a rendezvous..

I gave up at that point, if RCU is not being used properly, asking for
rwsem is a way to force the author to deal with their locking
sanely.

Jasno
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Jason Gunthorpe
On Wed, May 27, 2015 at 10:14:06AM -0400, Doug Ledford wrote:
  Because the QPN is part of the LLADDR IB can create two interfaces on
  the same physical port that are completely separated by hardware. Read
  Haggi's email, he explains how they plan to use this to create
  interfaces that can be delegated to namespaces. It is not a bad idea
  really.. 
 
 Yes, it is actually.  The whole reason we went to GUID matching long ago
 was because of this exact issue.

I reflected on this some more last night, and yes, I am leaning toward
'bad idea' direction too.

Too much stuff breaks if you create multiple children with the same
pkey/guid:
 - RDMA CM cannot disambiguate CM packets between them
 - DHCP cannot tell them apart
 - Net scripts/network manager won't work
 - IPv6 becomes totally broken

That means the namespace stuff will have to create children using GUID
aliases..

 The *only* way this will ever be a workable item is if we A) reserve a
 number of queue pairs from the driver specifically for IPoIB use and B)
 specify which queue pairs go to which IPoIB devices at IPoIB module
 load

This basic idea is exactly why I think we should stick with the 20
byte LLADDR for ILFA_VF_MAC. It gives a route for the PF to tell the
VF what QPN to use for IPoIB (if we ever see HW support to implement that)

If we use 8 bytes then that route is closed off forever.

  Not quite, in the 20 byte format the 8 bytes of the GUID are in the
  last 8/20 bytes, so the app would have to place 12 zeros and then the
  GUID to follow the 20 byte format (or 4 zeros, the prefix, then the GUID)
  
  This is why the question of 'what is ILFA_VF_MAC' is so important,
  every option presented (MAC,GUID,LLADDR) are incompatible with each
  other.
 
 For Ethernet devices, it's the MAC.  The equivalent of MAC on IB is the
 GUID.  I would leave it at that.

Yes, both arguments can be made:
  - Our netlink end point is targetting an IPoIB interface, and
the equivelent to an Ethernet MAC in IPoIB language is the LLADDR.
  - Our netlink interface is targetting the hardware under the IPoIB
interface and that MAC equivilent is the GUID

 IPoIB devices are constructs on top of
 the GUID/link, and you can have 10 IPoIB interfaces between the parent
 and children, but we don't need to specify all of those LLADDRs, we just
 need to give a unique GUID and allow the guest OS to create their own
 IPoIB devices on top of that.

As I've said, I would like to see netdev review that idea before we
merge any patches..

There are pragmatic downsides to the 8 byte choice: Userspace
completely looses the ability to size the address without a table
based on link type. That is terrible in the context of netlink's
design. For instance iproute2 would need IB specific code to format
the 'ip link show' (review print_vfinfo in iproute2) and to length
check 'ip link set vf mac'

If we do use 8, then it would be ideal (and my strong preference) to
also fix the IFLA_VF_MAC message to have a working length. I think
that could be done compatibly with a bit of work. At least that way
iproute2 can be kept clean when it learns to do IB, and we could have
the option again of using 20 someday if we need.

So to be clear, to go with the 8 byte option I suggest:
 - Engage netdev/iproute and confirm they are philosophically OK
   with IFLA_VF_MAC != IFLA_ADDRESS
 - Make a kernel patch to properly size the IFLA_VF_MAC message
 - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
   instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
 - Drop in the IB patch
 
Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-05-27 Thread Or Gerlitz
On Wed, May 27, 2015 at 9:48 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:
 On Wed, May 27, 2015 at 02:54:12PM +0300, Or Gerlitz wrote:
 On Wed, May 27, 2015 at 1:07 AM, Jason Gunthorpe
 jguntho...@obsidianresearch.com wrote:
  On Tue, May 26, 2015 at 11:39:04PM +0300, Or Gerlitz wrote:
   Jason, you (U2 BTW) play really, really hard - refusing to say **one**
   word on your approach towards the built-in udata mechanism for uverbs
   which I asked you to comment on.
 
   And I asked to see the user space side and you have angrily refused
   every time.
 
  AFAIR I never ever refused to show any piece of code which went under
  my hands towards Linux to any-one.

  For future reference, when someone asks a question and you go off on
  an tangental rant and ignore the question, then that process repeats,
  still without answering the question - most english speakers would
  call that refusing to answer the question. It is not looked upon kindly.

 Jason,

 It's not that you asked to see the code ala hey, do you happen to
 have a git with the user space code for people to inspect while
 doing the review on the kernel part, but rather U2 saying in a
 definitive manner that posting the user space code should be imposed
 as pre-requirement to acceptance of the kernel parts.

 I really didn't Or:

 First ask:
  'We can't really look at the uapi changes here without also seeing the
   verbs side changes.'
  (I know others on the list feel the same, so I use 'we')

 Second ask:
  'Lets see the verbs side and I'll let you know.'
 (.. to your questions based on my review comments ..)

 Third ask:
  'I think we need to have the same policy.'
 (.. To Yann's point that other kernel communities have a mandatory
   UAPI policy)

 Fourth (exasperated) ask:
  'Was it not clear? Yann and I asked to see the user space side before
   reviewing this series further.'

 I know you are ESL, and I cut you alot of slack, but *come on* - that
 is incredibly soft language, and certainly not bossing and imposing in
 a definitive manner a blanket requirement on all patches.

Jason,

ESL indeed am I, and in that respect, this clarification, even if
being tedious to set or read, helps.

Still, I'd like to further try and get you from where the bossing thing came:

(1) show me the user space code prior to acceptance of the kernel
part never was a requirement on this community since the day we were
born (Q4/2004)

(2) instantly rejecting a usage of a mechanism existing just for that
use case since the first year of our life (2005)

is something perceived by me as two people (that's the U2) that come
and say, game's over, the old, non-functioning boss is gone, new boss
(== we) in town and forget about everything you knew before.

So you say that is was wrong perception, I hope so. Let's see how Doug
see your feedback, namely either as community reviewer feedback or as
new rules being set overnight, waiting.

This series is (1) simple compared to other stuff being reviewed here
nowadays, and (2) has very nice value to latency sensitive
applications, so two wins, lets get it done.

 Advice: You would be well served to spend a bit more time on your
 emails. I have no idea what 'but rather U2 saying' means, for
 instance. Sometimes I just guess at what you are trying to say :|

point taken

 That is the price we pay for an inclusive international community, but
 everyone needs to be careful before starting a flame war based on
 percived slight in the text and phrasing of a message. email is hard.

point taken. In this case (as you can see from my response above) I am
not convinced yet that this was false positive.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications

2015-05-27 Thread Yishai Hadas

On 5/27/2015 7:04 PM, Doug Ledford wrote:

On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:

On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:


+   struct srcu_struct  disassociate_srcu;


There is no need for rcu for this, use a rw sem.


The rcu was used becuase it's on the hot path I assume.  Do we have
numbers on whether a rwsem vs. an rcu matters performance wise?  If the
rcu actually helps performance, then I'm inclined to leave it, but if it
doesn't, then I'd agree that a rwsem is simpler and easier to deal with.



That's correct, it was chosen from performance reasons to enable 
parallel commands as part of ib_uverbs_write with minimum 
synchronization overhead comparing the rwsem.


Most of its usage is for read, upon low level device removal there is 
only one point when synchronize_srcu is used.



@@ -1326,6 +1327,13 @@ ssize_t ib_uverbs_create_comp_channel(struct 
ib_uverbs_file *file,
return -EFAULT;
}

+   /* Taking ref count on uverbs_file to make sure that file won't be
+* freed till that event file is closed. It will enable accessing the
+* uverbs_device fields as part of closing the events file and making
+* sure that uverbs device is available by that time as well.
+* Note: similar is already done for the async event file.
+   */
+   kref_get(file-ref);


Is this a bug today? It doesn't look like it, but this stuff does look wrong.

Woulnd't this would make more sense for ib_uverbs_alloc_event_file to
unconditionally grab the kref and unconditionally release it on
release?

The existing code for this looks broken, in ib_uverbs_get_context all
the error paths between ib_uverbs_alloc_event_file and the
kref_get(file-ref) are wrong - the will result in fput() which will
call ib_uverbs_event_close, which will try to do kref_put and
ib_unregister_event_handler - which are no longer paired.

[I recommend moving the kref_get and ib_register_event_handler into
  ib_uverbs_alloc_event_file, so the 'create' and 'destroy' code paths
  are clearly paired instead of being partially open coded in call
  sites]

Fix all this in a seperate patch to add the needed change in kref
semantics please.


Seconded.


OK, will add a separate patch to handle that.



-   if (!try_module_get(dev-ib_dev-owner)) {
-   ret = -ENODEV;
+   mutex_lock(dev-disassociate_mutex);
+   if (dev-disassociated) {
+   ret = -EIO;
goto err;
}

-   file = kmalloc(sizeof *file, GFP_KERNEL);
+   /* In case IB device supports disassociate ucontext, there is no hard
+* dependency between uverbs device and its low level device.
+*/
+   module_dependent = !(dev-flags  UVERBS_FLAG_DISASSOCIATE);
+
+   if (module_dependent) {
+   if (!try_module_get(dev-ib_dev-owner)) {
+   ret = -ENODEV;
+   goto err;


Again? Why I do I keep pointing this same basic thing to Mellanox
people:

  If you hold a X then you hold the ref to X as well.

So, if the core code is holding function pointers to module code, then
the core code holds a module ref. When the core code null's those
function pointers, then it can release the module ref.


Seconded.


The module get/put mechanism was previously used here and wasn't 
introduced by that patch.


In case the low level driver can be unloaded (e.g. rmmod mlx4_ib) 
unconditionally to active uverbs clients we should prevent the module 
dependency by ignoring the get/put mechanism. Hope that it clarifies the 
usage here.





This might work today like this (I'm not entirely sure), but it makes
no sense at all.

I'll look more closely in a few weeks once the rwsem change is done.





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 1/3] IB/uverbs: Enable device removal when there are active user space applications

2015-05-27 Thread Or Gerlitz
On Wed, May 27, 2015 at 8:43 PM, Jason Gunthorpe
jguntho...@obsidianresearch.com wrote:
 On Wed, May 27, 2015 at 12:04:29PM -0400, Doug Ledford wrote:
 On Mon, 2015-05-25 at 10:54 -0600, Jason Gunthorpe wrote:
  On Wed, May 13, 2015 at 02:10:36PM +0300, Yishai Hadas wrote:
 
   + struct srcu_struct  disassociate_srcu;
 
  There is no need for rcu for this, use a rw sem.

 The rcu was used becuase it's on the hot path I assume.

 Perhaps, I looked at that a bit, it was used on syscall paths, but
 that wasn't even the big reason I made the comment..

Doug, what hot path we have in uverbs?! IB's stack hot path goes from
user-space to the HW, right?

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma kernel tree

2015-05-27 Thread Doug Ledford
On Thu, 2015-05-28 at 11:02 +1000, Stephen Rothwell wrote:
 Hi Doug,
 
 On Wed, 27 May 2015 09:13:29 -0400 Doug Ledford dledf...@redhat.com wrote:
 
  Stephen, can you set it up so that the linux-next tag from
  git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added
  to the linux-next testing please?
 
 I haven't fetched tags in the past, only branches (it is a little
 tricky, but I can work on it) and there is no linux-next tag in that
 tree anyway (I assume that you mean for-next).

Sorry, you are correct, for-next is the right tag.

 Also, how does this relate to Roland's infiniband tree
 (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next)
 that is already in linux-next?

Roland has stepped down as maintainer.  The community selected me to try
to pick up where he has left off.

 Added from today.
 Thanks for adding your subsystem tree as a participant of linux-next.  As
 you may know, this is not a judgment of your code.  The purpose of
 linux-next is for integration testing and to lower the impact of
 conflicts between subsystems in the next merge window. 
 
 You will need to ensure that the patches/commits in your tree/series have
 been:
  * submitted under GPL v2 (or later) and include the Contributor's
 Signed-off-by,
  * posted to the relevant mailing list,
  * reviewed by you (or another maintainer of your subsystem tree),
  * successfully unit tested, and 
  * destined for the current or next Linux merge window.
 
 Basically, this should be just what you would send to Linus (or ask him
 to fetch).  It is allowed to be rebased if you deem it necessary.
 


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


Re: rdma kernel tree

2015-05-27 Thread Doug Ledford
On Thu, 2015-05-28 at 11:40 +1000, Stephen Rothwell wrote:
 Hi Doug, Roland,
 
 On Wed, 27 May 2015 21:14:03 -0400 Doug Ledford dledf...@redhat.com wrote:
 
   Also, how does this relate to Roland's infiniband tree
   (git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next)
   that is already in linux-next?
  
  Roland has stepped down as maintainer.  The community selected me to try
  to pick up where he has left off.
 
 So, I assume that I should remove the infiniband tree from linux-next?

Here's the link to Roland's announcement:

http://marc.info/?l=linux-rdmam=142860130010529w=2

Based upon his comments there, I would say it's a good bet that he
doesn't plan to continue hosting his own tree now that the switch over
has happened.  But, of course, he can speak for himself.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD



signature.asc
Description: This is a digitally signed message part


RE: [PATCH for-4.1 10/11] RDMA/ocrdma: Prevent returning failures in the resource destroy path

2015-05-27 Thread Selvin Xavier
Hi Jason
 My reply is delayed since  I was on vacation.

 -Original Message-
 From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com]
 Sent: Tuesday, May 19, 2015 11:53 PM
 To: Selvin Xavier
 Cc: Doug Ledford; linux-rdma@vger.kernel.org
 Subject: Re: [PATCH for-4.1 10/11] RDMA/ocrdma: Prevent returning
failures
 in the resource destroy path

 On Mon, May 18, 2015 at 05:16:14PM +0530, Selvin Xavier wrote:

  The driver is returning success for other destroy routines also and
  free the corresponding driver resources.  As part of the changes
  suggested by you, we will change all the destroy routines to return
  error status for all the FW command failures and avoid freeing the
  driver resources in case of failure.  Since this change need more
  testing, I will drop this patch from the series for 4.1 RC.  I will
  cut a patch for 4.2 after our testing.

 Didn't we discuss this already? Destroy routines cannot fail.

 If the card's firmware fails during destroy then that is an
unrecoverable error
 and your driver and firmware are no longer synchronized. Panic the
kernel or
 do some kind of (synchronous) card-reset recovery.

 Don't expect the caller to handle this, and don't leak IB core memory
when it
 happens.

Got it .   Will do the same in all destroy APIs.

 Jason
Thanks,
Selvin
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rdma kernel tree

2015-05-27 Thread Stephen Rothwell
Hi Doug,

On Wed, 27 May 2015 09:13:29 -0400 Doug Ledford dledf...@redhat.com wrote:

 Stephen, can you set it up so that the linux-next tag from
 git://git.kernel.org/pub/scm/linux/kernel/git/dledford/rdma.git is added
 to the linux-next testing please?

I haven't fetched tags in the past, only branches (it is a little
tricky, but I can work on it) and there is no linux-next tag in that
tree anyway (I assume that you mean for-next).

Also, how does this relate to Roland's infiniband tree
(git://git.kernel.org/pub/scm/linux/kernel/git/roland/infiniband.git#for-next)
that is already in linux-next?

Added from today.
Thanks for adding your subsystem tree as a participant of linux-next.  As
you may know, this is not a judgment of your code.  The purpose of
linux-next is for integration testing and to lower the impact of
conflicts between subsystems in the next merge window. 

You will need to ensure that the patches/commits in your tree/series have
been:
 * submitted under GPL v2 (or later) and include the Contributor's
Signed-off-by,
 * posted to the relevant mailing list,
 * reviewed by you (or another maintainer of your subsystem tree),
 * successfully unit tested, and 
 * destined for the current or next Linux merge window.

Basically, this should be just what you would send to Linus (or ask him
to fetch).  It is allowed to be rebased if you deem it necessary.

-- 
Cheers,
Stephen Rothwell 
s...@canb.auug.org.au


pgpwzMf1pUkqU.pgp
Description: OpenPGP digital signature


Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Jason Gunthorpe
On Thu, May 28, 2015 at 12:58:59AM +0300, Or Gerlitz wrote:
 On Wed, May 27, 2015 at 8:53 PM, Doug Ledford dledf...@redhat.com wrote:
 
  On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
  Well, let's just be clear: netlink/iproute2 screwed the pooch on their
  implementation of this stuff.  Any solution that doesn't include fixing
  this up in some way is not really a good solution.
 
   For instance iproute2 would need IB specific code to format
  the 'ip link show' (review print_vfinfo in iproute2) and to length
  check 'ip link set vf mac'
 
  If we do use 8, then it would be ideal (and my strong preference) to
  also fix the IFLA_VF_MAC message to have a working length. I think
  that could be done compatibly with a bit of work. At least that way
  iproute2 can be kept clean when it learns to do IB, and we could have
  the option again of using 20 someday if we need.
 
 
  Sounds like a reasonable plan.
  Or, this is your patch set, are you going to pick up these action items?
 
 Let see
 
  So to be clear, to go with the 8 byte option I suggest:
 
   - Engage netdev/iproute and confirm they are philosophically OK
 with IFLA_VF_MAC != IFLA_ADDRESS
 
 the last thing netdev are is having philosophical views, they are very
 practical (and non-perfect and happy and lively community), the one
 thing before the last  netdev are is caring for the rdma subsystem. If
 something has to change @ their direct part, we should come with
 patches.

Yes, you need to talk in patches, I will suggest you start with this
(totally untested, quickly thrown together):

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 8de36824018d..715b59e9925a 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1172,7 +1172,7 @@ static int rtnl_fill_ifinfo(struct sk_buff *skb, struct 
net_device *dev,
nla_nest_cancel(skb, vfinfo);
goto nla_put_failure;
}
-   if (nla_put(skb, IFLA_VF_MAC, sizeof(vf_mac), vf_mac) 
||
+   if (nla_put(skb, IFLA_VF_MAC, sizeof(u32) + 
dev-addr_len, vf_mac) ||
nla_put(skb, IFLA_VF_VLAN, sizeof(vf_vlan), 
vf_vlan) ||
nla_put(skb, IFLA_VF_RATE, sizeof(vf_rate),
vf_rate) ||
@@ -1292,7 +1292,8 @@ static const struct nla_policy 
ifla_vfinfo_policy[IFLA_VF_INFO_MAX+1] = {
 };
 
 static const struct nla_policy ifla_vf_policy[IFLA_VF_MAX+1] = {
-   [IFLA_VF_MAC]   = { .len = sizeof(struct ifla_vf_mac) },
+   [IFLA_VF_MAC]   = { .type = NLA_BINARY,
+   .len = sizeof(struct ifla_vf_mac) },
[IFLA_VF_VLAN]  = { .len = sizeof(struct ifla_vf_vlan) },
[IFLA_VF_TX_RATE]   = { .len = sizeof(struct ifla_vf_tx_rate) },
[IFLA_VF_SPOOFCHK]  = { .len = sizeof(struct ifla_vf_spoofchk) },
@@ -1447,6 +1448,21 @@ static int do_setvfinfo(struct net_device *dev, struct 
nlattr *attr)
switch (nla_type(vf)) {
case IFLA_VF_MAC: {
struct ifla_vf_mac *ivm;
+   int sz;
+
+   /* Legacy compatability, old iproute2 will pass in 32
+* bytes for everything, assume it is a 6 byte MAC
+* because that is all old iproute really supported.
+*/
+   sz = nla_len(attr);
+   if (sz == 36)
+   sz = 10;
+
+   if (sz != (sizeof(u32) + dev-addr_len)) {
+   err = -EINVAL;
+   break;
+   }
+
ivm = nla_data(vf);
err = -EOPNOTSUPP;
if (ops-ndo_set_vf_mac)

And make the argument why the above is OK.
 - Reader side only ever read 6 bytes anyhow, so it doesn't matter
   that ifla_vf_mac is undersized in the RTA_DATA
 - Confirm any other users of this (openstack?) you know are similarly
   OK.
 - Old write side will always send the full 36 byte struct, so
   assume 6 bytes.
 - Justify this as needing to add IB support next which does not use a
   6 byte addr_len
   New code that wants to work with IB will have to use correct lengths.

Then based on that iproute2 can be fixed in the two areas I pointed
out, doing that at the same time will probably earn some rep.

Finally you can make a patch that switches addr_len to something else
- that would trigger the philosophical question of 'what actually is a
IFLA_VF_MAC', if everyone is OK with changing it then golden.

   - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
 instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)
 
 I was thinking to patch iproute to sense the link type: if eth print
 six bytes, if ipoib print 8 bytes, simple.

This 

Re: [PATCH RFC 0/3] Support standard SRIOV configuration for IB VFs

2015-05-27 Thread Or Gerlitz
On Wed, May 27, 2015 at 8:53 PM, Doug Ledford dledf...@redhat.com wrote:

 On Wed, 2015-05-27 at 11:11 -0600, Jason Gunthorpe wrote:
 Well, let's just be clear: netlink/iproute2 screwed the pooch on their
 implementation of this stuff.  Any solution that doesn't include fixing
 this up in some way is not really a good solution.

  For instance iproute2 would need IB specific code to format
 the 'ip link show' (review print_vfinfo in iproute2) and to length
 check 'ip link set vf mac'

 If we do use 8, then it would be ideal (and my strong preference) to
 also fix the IFLA_VF_MAC message to have a working length. I think
 that could be done compatibly with a bit of work. At least that way
 iproute2 can be kept clean when it learns to do IB, and we could have
 the option again of using 20 someday if we need.


 Sounds like a reasonable plan.
 Or, this is your patch set, are you going to pick up these action items?

Let see

 So to be clear, to go with the 8 byte option I suggest:

  - Engage netdev/iproute and confirm they are philosophically OK
with IFLA_VF_MAC != IFLA_ADDRESS

the last thing netdev are is having philosophical views, they are very
practical (and non-perfect and happy and lively community), the one
thing before the last  netdev are is caring for the rdma subsystem. If
something has to change @ their direct part, we should come with
patches.

  - Make a kernel patch to properly size the IFLA_VF_MAC message

You mean the below structure which is expected by the kernel after
they see the IFLA_VF_MAC NL attribute

struct ifla_vf_mac {
 __u32 vf;
 __u8 mac[32]; /* MAX_ADDR_LEN */
 };

How you thought to patch things such that the size of the address
provided by user-space will propagate into the kernel w.o breaking the
NL ABI here?

Why not just take the eight lower bytes and set them NL -- ipoib --  PF driver

  - Make a iproute patch to use the IFLA_VF_MAC size in print_vfinfo
instead of hardcoded ETH_ALEN (using len == 32 mean len 6 for compat)

I was thinking to patch iproute to sense the link type: if eth print
six bytes, if ipoib print 8 bytes, simple.

  - Drop in the IB patch

sounds good.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 09/10] IB/mlx4: Add timestamp_mask and hca_core_clock to query_device

2015-05-27 Thread Jason Gunthorpe
On Thu, May 28, 2015 at 12:33:58AM +0300, Or Gerlitz wrote:

 (2) instantly rejecting a usage of a mechanism existing just for that
 use case since the first year of our life (2005)

I didn't reject anything, I said I wanted to see the user side to
continue my review.

 This series is (1) simple compared to other stuff being reviewed here
 nowadays,

What? It is 11 patches, long, introduces several UAPI changes, does
not implement a standardized feature, adds new uses of latent kernel
functions and exists to support a unique feature of a single hardware
vendor that few understand the usecase for.

I would describe this as one of the toughest to review series on
patchworks right now.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html