Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Wed, Apr 10, 2013 at 12:32:31AM -0400, Michael R. Hines wrote:
> On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote:
> >Which mechanism do you refer to? You patches still seem to pin
> >each page in guest memory at some point, which will break all COW.
> >In particular any pagemap tricks to detect duplicates on source
> >that I suggested won't work.
> 
> Sorry, I mispoke. I'm reffering to dynamic server page registration.
> 
> Of course it does not eliminate pinning - but it does mitigate the
> foot print of the VM as a feature that was requested.
> 
> I have implemented it and documented it.
> 
> - Michael

Okay, but GIFT is supposed to be used on send side: it's only allowed
with local/remote read access, and serves to reduce memory usage
on send side.
For example, disable zero page detection and look at memory usage
on send side before and after migration.
Dynamic registration on the receive side is nice but seems
completely unrelated ...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

On 04/09/2013 11:24 PM, Michael S. Tsirkin wrote:
Which mechanism do you refer to? You patches still seem to pin each 
page in guest memory at some point, which will break all COW. In 
particular any pagemap tricks to detect duplicates on source that I 
suggested won't work. 


Sorry, I mispoke. I'm reffering to dynamic server page registration.

Of course it does not eliminate pinning - but it does mitigate the foot 
print of the VM as a feature that was requested.


I have implemented it and documented it.

- Michael


On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:

presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:

Well, I have the "is_dup_page()" commented out...when RDMA is
activated.

Is there something else in QEMU that could be touching the page that
I don't know about?

- Michael


On 04/05/2013 05:03 PM, Roland Dreier wrote:

On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 wrote:

Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
(i.e. trying to register RDMA memory while using a cgroup swap limit cause
the process get killed).

But the GIFT flag patch works (my understanding is that GIFT flag allows the
adapter to transmit stale memory information, it does not have anything to
do with cgroups specifically).

The point of the GIFT patch is to avoid triggering copy-on-write so
that memory doesn't blow up during migration.  If that doesn't work
then there's no point to the patch.

  - R.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin

On Tue, Apr 09, 2013 at 09:26:59PM -0400, Michael R. Hines wrote:
> With respect, I'm going to offload testing this patch back to the author =)
> because I'm trying to address all of Paolo's other minor issues
> with the RDMA patch before we can merge.

Fair enough, this likely means it won't happen anytime soon though.

> Since dynamic page registration (as you requested) is now fully
> implemented, this patch is less urgent since we now have a
> mechanism in place to avoid page pinning on both sides of the migration.
> 
> - Michael
> 

Which mechanism do you refer to? You patches still seem to pin
each page in guest memory at some point, which will break all
COW. In particular any pagemap tricks to detect duplicates
on source that I suggested won't work.

> On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:
> >presumably is_dup_page reads the page, so should not break COW ...
> >
> >I'm not sure about the cgroups swap limit - you might have
> >too many non COW pages so attempting to fault them all in
> >makes you exceed the limit. You really should look at
> >what is going on in the pagemap, to see if there's
> >measureable gain from the patch.
> >
> >
> >On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
> >>Well, I have the "is_dup_page()" commented out...when RDMA is
> >>activated.
> >>
> >>Is there something else in QEMU that could be touching the page that
> >>I don't know about?
> >>
> >>- Michael
> >>
> >>
> >>On 04/05/2013 05:03 PM, Roland Dreier wrote:
> >>>On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
> >>> wrote:
> Sorry, I was wrong. ignore the comments about cgroups. That's still 
> broken.
> (i.e. trying to register RDMA memory while using a cgroup swap limit cause
> the process get killed).
> 
> But the GIFT flag patch works (my understanding is that GIFT flag allows 
> the
> adapter to transmit stale memory information, it does not have anything to
> do with cgroups specifically).
> >>>The point of the GIFT patch is to avoid triggering copy-on-write so
> >>>that memory doesn't blow up during migration.  If that doesn't work
> >>>then there's no point to the patch.
> >>>
> >>>  - R.
> >>>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
> -Original Message-
> From: Hefty, Sean
> Sent: Tuesday, April 09, 2013 6:30 PM
> To: Weiny, Ira; Jeff Squyres (jsquyres)
> Cc: Hal Rosenstock; Roland Dreier; linux-rdma@vger.kernel.org; Upinder
> Malhi (umalhi)
> Subject: RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
> 
> > If the IBTA were to release new MTU enumerations which values would
> > you recommend then?
> 
> I don't think there's a great solution here.  We're mixing IBTA encoded values
> with non-IBTA values.  We could reserve the 6-bit encoded values for IB, and
> use direct values for others (or at least jump beyond the 6-bit range).  Or we
> can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6).  Or we
> go back in time and make mtu an int.
> 

I thought reserving the 6 bit's for IB and allowing the enum values to match 
the MTU was a pretty good compromise.  Especially since PathRecord is defined 
in sa.h which is provided by libibverbs.  That allows for that IB MTU enum to 
be used there.

OTOH, now that we have moved toward decent defines in the libibumad  library we 
could define the MTU enum there.  But then we again go down the path of 
defining things multiple places and confusing the users...  :-(

As an aside I like the use of RDMA_MTU_* for these values.  Again to 
distinguish them from the IBTA values.  But I know that is poor form.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Hefty, Sean
> If the IBTA were to release new MTU enumerations which values would you
> recommend then?

I don't think there's a great solution here.  We're mixing IBTA encoded values 
with non-IBTA values.  We could reserve the 6-bit encoded values for IB, and 
use direct values for others (or at least jump beyond the 6-bit range).  Or we 
can stop matching new IBTA MTU encodings (e.g. IB_MTU_1500 = 6).  Or we go back 
in time and make mtu an int.

- Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

With respect, I'm going to offload testing this patch back to the author =)
because I'm trying to address all of Paolo's other minor issues
with the RDMA patch before we can merge.

Since dynamic page registration (as you requested) is now fully
implemented, this patch is less urgent since we now have a
mechanism in place to avoid page pinning on both sides of the migration.

- Michael

On 04/09/2013 03:03 PM, Michael S. Tsirkin wrote:

presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:

Well, I have the "is_dup_page()" commented out...when RDMA is
activated.

Is there something else in QEMU that could be touching the page that
I don't know about?

- Michael


On 04/05/2013 05:03 PM, Roland Dreier wrote:

On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
 wrote:

Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
(i.e. trying to register RDMA memory while using a cgroup swap limit cause
the process get killed).

But the GIFT flag patch works (my understanding is that GIFT flag allows the
adapter to transmit stale memory information, it does not have anything to
do with cgroups specifically).

The point of the GIFT patch is to avoid triggering copy-on-write so
that memory doesn't blow up during migration.  If that doesn't work
then there's no point to the patch.

  - R.



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
> -Original Message-
> From: Jeff Squyres (jsquyres) [mailto:jsquy...@cisco.com]
> Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
> 
> On Apr 8, 2013, at 6:16 PM, "Hefty, Sean"  wrote:
> 
> > Why can't IB_MTU_1500 = 1500?
> 

Sean,

If the IBTA were to release new MTU enumerations which values would you 
recommend then?

Ira

> 
> It certainly could.  Additionally, since Roland was a little concerned about 
> the
> "IB" prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could
> have a different prefix -- perhaps RDMA_MTU_1500.
> 
> Although I admit that it would be weird to have an enum that contains values
> with different prefixes:
> 
> enum ib_mtu {
> IB_MTU_256  = 1,
> IB_MTU_512  = 2,
> IB_MTU_1024 = 3,
> IB_MTU_2048 = 4,
> IB_MTU_4096 = 5,
> RDMA_MTU_1500 = 1500,
> RDMA_MTU_9000 =   9000
> };
> 
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Jeff Squyres (jsquyres)
On Apr 9, 2013, at 4:10 PM, "Weiny, Ira"  wrote:

>> Just to re-state: our issue is that there does not seem to be any other way 
>> to
>> get the max UD message size without knowing the actual MTU (are we
>> incorrect about that?).  Hence, using the IB-defined values is not really
>> sufficient.
> 
> I guess I am confused.  Is this patch trying to support RoCE or a VNIC?


Both, actually.

The RoCE driver lies about its MTU (IIRC, it claims IB_MTU_1024, even if the 
MTU is actually 1500).  So AFAIK, there's no way to know what the UD max 
message size is on RoCE, because the max message size attribute on port refers 
to RC, not UD.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Jeff Squyres (jsquyres)
On Apr 8, 2013, at 6:16 PM, "Hefty, Sean"  wrote:

> Why can't IB_MTU_1500 = 1500?


It certainly could.  Additionally, since Roland was a little concerned about 
the "IB" prefix (since 1500 and 9000 are not IBTA-sanctioned MTUs), they could 
have a different prefix -- perhaps RDMA_MTU_1500.  

Although I admit that it would be weird to have an enum that contains values 
with different prefixes:

enum ib_mtu {
IB_MTU_256  = 1,
IB_MTU_512  = 2,
IB_MTU_1024 = 3,
IB_MTU_2048 = 4,
IB_MTU_4096 = 5,
RDMA_MTU_1500 = 1500,
RDMA_MTU_9000 = 9000
};

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V4 for-next 1/5] IB/core: Add RSS and TSS QP groups

2013-04-09 Thread Or Gerlitz
> This patch introduces the concept of RSS and TSS QP groups which
> allows for implementing them by low level drivers and using it
> by IPoIB and later also by user space ULPs.
>
> A QP group is a set of QPs consists of a parent QP and two disjoint sets
> of RSS and TSS QPs. The creation of a QP group is a two stage process:
>
> In the the 1st stage, the parent QP is created.
>
> In the 2nd stage the children QPs of the parent are created.
>
> Each child QP indicates if its a RSS or TSS QP. Both the TSS
> and RSS sets of QPs should have contiguous QP numbers.
>
> It is forbidden to modify parent QP state before all RSS/TSS children
> were created. In the same manner it is disallowed to destroy the parent
> QP unless all RSS/TSS children were destroyed.
>
> A few new elements/concepts are introduced to support this:
>
> Three new device capabilities that can be set by the low level driver:
>
> - IB_DEVICE_QPG which is set to indicate QP groups are supported.
>
> - IB_DEVICE_UD_RSS which is set to indicate that the device supports
> RSS, that is applying hash function on incoming TCP/UDP/IP packets and
> dispatching them to multiple "rings" (child QPs).
>
> - IB_DEVICE_UD_TSS which is set to indicate that the device supports
> "HW TSS" which means that the HW is capable of over-riding the source
> UD QPN present in sent IB datagram header (DTH) with the parent's QPN.
>
> Low level drivers not supporting HW TSS, could still support QP groups, such
> as combination is referred as "SW TSS". Where in this case, the low level 
> drive
> fills in the qpg_tss_mask_sz field of struct ib_qp_cap returned from
> ib_create_qp. Such that this mask can be used to retrieve the parent QPN from
> incoming packets carrying a child QPN (as of the contiguous QP numbers 
> requirement).
>
> - max rss table size device attribute, which is the maximal size of the RSS
> indirection table  supported by the device
>
> - qp group type attribute for qp creation saying whether this is a parent QP
> or rx/tx (rss/tss) child QP or none of the above for non rss/tss QPs.
>
> - per qp group type, another attribute is added, for parent QPs, the number
> of rx/tx child QPs and for child QPs pointer to the parent.
>
> - IB_QP_GROUP_RSS attribute mask, which should be used when modifying
> the parent QP state from reset to init


On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean  wrote:

> I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
> seems kludgy.

lets try to be more specific

> On a node, this is multiple send/receive queues grouped together to form a 
> larger
> construct.  On the wire, this is a single QP - maybe?  I'm still not clear on 
> that.  From
> what's written, all the send queues appear as a single QPN.  The receive 
> queues
> appear as different QPNs.

Starting with RSS QP groups: its a group made of one parent QP and N
RSS child QPs.

On the wire everything is sent to the RSS parent QP, however, when the
HW receives a packet for which this QP/QPN is the destination, it
applies a hash function on the packet header and subject to the hash
result dispatches the packet to one of the N child QPs.

The design applies for IB UD QPs and Raw Ethernet Packet QP types,
under IB the QPN of the parent is on the wire, under Eth, there are no
QPNs on the wire, but that HW has some "steering rule" which makes
certain packets to be steered to that RSS parent, and the RSS parent
in turn further does dispatching decision (hashing) to determine which
of the child RSS QPs will actually receive that packet.

With IPoIB, the remote side is provided with the RSS parent QPN as
part of the IPoIB HW address provided in the ARP reply payload, so
packets are sent to that QPN. With RAW Packet Eth QPs, the remote side
isn't aware to QPNs at all, all goes through a steering rule who is
directing to the RSS parent.

You can send packets over RSS packet QP but not receive packets.

So for RSS, the remote side isn't aware to that QP group @ all.

Makes sense?

As for TSS QP groups, basically && generally speaking, the only case
that really matters are applications/drivers that care for the source
QPN of a packet.

but lets get there after hopefully agreeing what is RSS QP group.

Or.

>
> Signed-off-by: Shlomo Pongratz 
> ---
>  drivers/infiniband/core/uverbs_cmd.c |1 +
>  drivers/infiniband/core/verbs.c  |  118 
> ++
>  drivers/infiniband/hw/amso1100/c2_provider.c |3 +
>  drivers/infiniband/hw/cxgb3/iwch_provider.c  |2 +
>  drivers/infiniband/hw/cxgb4/qp.c |3 +
>  drivers/infiniband/hw/ehca/ehca_qp.c |3 +
>  drivers/infiniband/hw/ipath/ipath_qp.c   |3 +
>  drivers/infiniband/hw/mlx4/qp.c  |3 +
>  drivers/infiniband/hw/mthca/mthca_provider.c |3 +
>  drivers/infiniband/hw/nes/nes_verbs.c|3 +
>  drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  |5 +
>  drivers/infiniband/hw/qib/qib_qp.c   |5 +
>  include/r

Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Tue, Apr 09, 2013 at 11:34:09PM +0300, Michael S. Tsirkin wrote:
> On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote:
> > The userland part of the patch was missing (IBV_ACCESS_GIFT).
> > 
> > I added flag that to /usr/include in addition to this patch and did
> > a test RDMA migrate and it seems to work without any problems.
> > 
> > I also removed the IBV_*_WRITE flags on the sender-side and
> > activated cgroups with the "memory.memsw.limit_in_bytes" activated
> > and the migration with RDMA also succeeded without any problems
> > (both with *and* without GIFT also worked).
> > 
> > Any additional tests you would like?
> > 
> > 
> > - Michael
> 
> RDMA can't really work with swap so not sure how that's relevant.
> 
> Please check memory.usage_in_bytes - is it lower with
> the GIFT flag?  I think this is what we really care about.

oh and no reason to set memsw.limit_in_bytes I think.

> -- 
> MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:17:36PM -0400, Michael R. Hines wrote:
> The userland part of the patch was missing (IBV_ACCESS_GIFT).
> 
> I added flag that to /usr/include in addition to this patch and did
> a test RDMA migrate and it seems to work without any problems.
> 
> I also removed the IBV_*_WRITE flags on the sender-side and
> activated cgroups with the "memory.memsw.limit_in_bytes" activated
> and the migration with RDMA also succeeded without any problems
> (both with *and* without GIFT also worked).
> 
> Any additional tests you would like?
> 
> 
> - Michael

RDMA can't really work with swap so not sure how that's relevant.

Please check memory.usage_in_bytes - is it lower with
the GIFT flag?  I think this is what we really care about.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Or Gerlitz
On Tue, Apr 9, 2013 at 8:06 PM, Hefty, Sean  wrote:

> I have no issue with RSS/TSS.  But the 'qp group' interface to using this 
> seems kludgy.

OK, so lets take it over the patch that has the QP group description

> On a node, this is multiple send/receive queues grouped together to form a 
> larger
> construct.  On the wire, this is a single QP - maybe?  I'm still not clear on 
> that.  From
> what's written, all the send queues appear as a single QPN.  The receive 
> queues
> appear as different QPNs.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 01:43:49PM -0700, Roland Dreier wrote:
> On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
>  wrote:
> > I also removed the IBV_*_WRITE flags on the sender-side and activated
> > cgroups with the "memory.memsw.limit_in_bytes" activated and the migration
> > with RDMA also succeeded without any problems (both with *and* without GIFT
> > also worked).
> 
> Not sure I'm interpreting this correctly.  Are you saying that things
> worked without actually setting the GIFT flag?   In which case why are
> we adding this flag?
> 
>  - R.

We are adding the flag to reduce memory when there's lots of COW pages.
There's no guarantee there will be COW pages so I expect things to work
both with and without breaking COW, just using much more memory when we
break COW.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.

2013-04-09 Thread Weiny, Ira
> -Original Message-
> From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-
> Subject: Re: [PATCH 2/2] Ad IB_MTU_1500|9000 enums.
> 
> On Apr 4, 2013, at 1:57 PM, "Weiny, Ira"  wrote:
> 
> >> In hindsight, the user space API never should have exposed the mtu as
> >> an enum...
> >>
> >> Since an enum is an int, and we're never going to have anything with
> >> an mtu <= 5 bytes, couldn't we just store all new mtu values directly
> >> as their byte value?
> >
> > That seems like a pretty good idea.
> 
> 
> Agreed, but changing to an int would seem to have some fairly serious
> backwards compatibility issues.
> 
> What is the right way to move forward here?
> 
> Just to re-state: our issue is that there does not seem to be any other way to
> get the max UD message size without knowing the actual MTU (are we
> incorrect about that?).  Hence, using the IB-defined values is not really
> sufficient.

I guess I am confused.  Is this patch trying to support RoCE or a VNIC?

Ira


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 02:03:33PM -0700, Roland Dreier wrote:
> On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
>  wrote:
> > Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
> > (i.e. trying to register RDMA memory while using a cgroup swap limit cause
> > the process get killed).
> >
> > But the GIFT flag patch works (my understanding is that GIFT flag allows the
> > adapter to transmit stale memory information, it does not have anything to
> > do with cgroups specifically).
> 
> The point of the GIFT patch is to avoid triggering copy-on-write so
> that memory doesn't blow up during migration.  If that doesn't work
> then there's no point to the patch.
> 
>  - R.

Absolutely. Checking whether an OOM gets triggered looks like a heavy
handed approach to testing the feature though.
It's relevant, but there could be many other reasons for it to trigger.
See Documentation/cgroups/memory.txt section "Troubleshooting".

It's easier to just check whether this patch reduces the memory consumption,
that's the point really.

-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
presumably is_dup_page reads the page, so should not break COW ...

I'm not sure about the cgroups swap limit - you might have
too many non COW pages so attempting to fault them all in
makes you exceed the limit. You really should look at
what is going on in the pagemap, to see if there's
measureable gain from the patch.


On Fri, Apr 05, 2013 at 05:32:30PM -0400, Michael R. Hines wrote:
> Well, I have the "is_dup_page()" commented out...when RDMA is
> activated.
> 
> Is there something else in QEMU that could be touching the page that
> I don't know about?
> 
> - Michael
> 
> 
> On 04/05/2013 05:03 PM, Roland Dreier wrote:
> >On Fri, Apr 5, 2013 at 1:51 PM, Michael R. Hines
> > wrote:
> >>Sorry, I was wrong. ignore the comments about cgroups. That's still broken.
> >>(i.e. trying to register RDMA memory while using a cgroup swap limit cause
> >>the process get killed).
> >>
> >>But the GIFT flag patch works (my understanding is that GIFT flag allows the
> >>adapter to transmit stale memory information, it does not have anything to
> >>do with cgroups specifically).
> >The point of the GIFT patch is to avoid triggering copy-on-write so
> >that memory doesn't blow up during migration.  If that doesn't work
> >then there's no point to the patch.
> >
> >  - R.
> >
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Tue, Apr 09, 2013 at 01:56:00PM -0400, Michael R. Hines wrote:
> On 04/09/2013 12:39 PM, Michael S. Tsirkin wrote:
> >On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:
> >>To be more specific, here's what I did:
> >>
> >>1. apply kernel module patch - re-insert module
> >>1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
> >>2. Start the RDMA migration
> >>3. Migration completes without any errors
> >>
> >>This test does *not* work with a cgroup swap limit, however. The
> >>process gets killed. (Both with and without GIFT)
> >>
> >>- Michael
> >Try to attach a debugger and see where it is when it gets killed?
> >
> 
> It's killed by cgroups - not a CPU exception.
> 
> The same test works fine using TCP migration with cgroups -
> everything is fine there.
> 
> The memory that RDMA attempted to register hits some kind of cgroups policy
> which results in a kernel message saying that the cgroup swap limit was hit
> and then it goes ahead and kills the process altogether.
> 
> It's not a QEMU problem - it seems to be a kernel bug.

Maybe cgroup swap limit really is buggy. That's interesting, but not
really related to this patch.  What's interesting is whether we save
lots memory by using this patch.
Couldn't you dump the pagemap for the qemu process and calculate real
memory usage before and after applying the patch?


-- 
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Markus Stockhausen

> 
>-IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN,
>+/* add 128 bytes of tailroom for IP/TCP headers */
>+IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 128,

Hello,

the version 3 of the patch finally works. I can see the performance
gains but I cannot feel them (in real life). Here are the results
of my testbed:

Test 1:
netperf/netserver message size 16K

kernel 3.5 default :  5.1 GBit/s
kernel 3.5 + patch v3  :  7.7 GBit/s
kernel 3.5 + max MTU 3K: 10.8 GBit/s

Test 2:
Disk write performance
VM with disk mounted on IB async NFS server

block size  | default  | patch v3 | max MTU 3K
+--+--+--
   1 KB |  10 MB/s |  10 MB/s |  10 MB/s
   2 KB |  20 MB/s |  21 MB/s |  20 MB/s
   4 KB |  40 MB/s |  40 MB/s |  43 MB/s
   8 KB |  68 MB/s |  70 MB/s |  78 MB/s
  16 KB | 105 MB/s | 105 MB/s | 120 MB/s
  32 KB | 150 MB/s | 150 MB/s | 170 MB/s
  64 KB | 200 MB/s | 210 MB/s | 260 MB/s
 128 KB | 270 MB/s | 290 MB/s | 400 MB/s
 256 KB | 300 MB/s | 310 MB/s | 430 MB/s
 512 KB | 305 MB/s | 320 MB/s | 470 MB/s
1024 KB | 310 MB/s | 325 MB/s | 500 MB/s
2048 KB | 310 MB/s | 325 MB/s | 510 MB/s
4096 KB | 370 MB/s | 325 MB/s | 510 MB/s
8192 KB | 400 MB/s | 325 MB/s | 520 MB/s


As you can see netperf throughput increases while NFS does not
even care about the optimizations. Maybe it does not work well
with fragmented SKBs. The MAX MTU 3K values once again are
forced through a hack inside ipoib_main.c.

For curiosity I changed the block splitting in your v3 patch
from small head with large fragment to large head with small
fragment in this line.

IPOIB_UD_HEAD_SIZE  = IB_GRH_BYTES + IPOIB_ENCAP_LEN + 3072

In my 2044 MTU case this brings the netperf & NFS throughput to
the same levels as the dirty hack. Of course this no longer
reflects a head but equals more or less to something like a
new constant IPOIB_UD_FIXED_SKB_SIZE.

I guess 4K MTU will not see any further gains but avoiding the
skb_pull calls should improve speed as well. Maybe a final
adaption could put the cherry on the cake.

Markus


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael R. Hines

On 04/09/2013 12:39 PM, Michael S. Tsirkin wrote:

On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:

To be more specific, here's what I did:

1. apply kernel module patch - re-insert module
1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
2. Start the RDMA migration
3. Migration completes without any errors

This test does *not* work with a cgroup swap limit, however. The
process gets killed. (Both with and without GIFT)

- Michael

Try to attach a debugger and see where it is when it gets killed?



It's killed by cgroups - not a CPU exception.

The same test works fine using TCP migration with cgroups - everything 
is fine there.


The memory that RDMA attempted to register hits some kind of cgroups policy
which results in a kernel message saying that the cgroup swap limit was hit
and then it goes ahead and kills the process altogether.

It's not a QEMU problem - it seems to be a kernel bug.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCHv2] rdma: add a new IB_ACCESS_GIFT flag

2013-04-09 Thread Michael S. Tsirkin
On Fri, Apr 05, 2013 at 04:54:39PM -0400, Michael R. Hines wrote:
> To be more specific, here's what I did:
> 
> 1. apply kernel module patch - re-insert module
> 1. QEMU does: ibv_reg_mr(IBV_ACCESS_GIFT | IBV_ACCESS_REMOTE_READ)
> 2. Start the RDMA migration
> 3. Migration completes without any errors
> 
> This test does *not* work with a cgroup swap limit, however. The
> process gets killed. (Both with and without GIFT)
> 
> - Michael

Try to attach a debugger and see where it is when it gets killed?

> On 04/05/2013 04:43 PM, Roland Dreier wrote:
> >On Fri, Apr 5, 2013 at 1:17 PM, Michael R. Hines
> > wrote:
> >>I also removed the IBV_*_WRITE flags on the sender-side and activated
> >>cgroups with the "memory.memsw.limit_in_bytes" activated and the migration
> >>with RDMA also succeeded without any problems (both with *and* without GIFT
> >>also worked).
> >Not sure I'm interpreting this correctly.  Are you saying that things
> >worked without actually setting the GIFT flag?   In which case why are
> >we adding this flag?
> >
> >  - R.
> >
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Hefty, Sean
> any feedback?

I have no issue with RSS/TSS.  But the 'qp group' interface to using this seems 
kludgy.

On a node, this is multiple send/receive queues grouped together to form a 
larger construct.  On the wire, this is a single QP - maybe?  I'm still not 
clear on that.  From what's written, all the send queues appear as a single 
QPN.  The receive queues appear as different QPNs.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Roland Dreier
On Tue, Apr 9, 2013 at 6:13 AM, Luick, Dean  wrote:
> Can you go through the "else" of the first if (page is NULL), then enter the 
> second if? If so, isn't the page lost?

Thanks, good catch.  I'll fix that up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 16:23, Hal Rosenstock wrote:
>> So these values are exactly the same as in "ibv_devinfo" and can be set
>> in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.
>>
>> I've found the PortInfo with the command
>> "smpquery portinfo -C mlx4_0 3 1"
>> where I'm using the first HCA to contact the SM. I tell the SM the
>> destination LID ('3' here in my case) and the destination port ('1').
>>
>> Is there another method to set the max MTU?
> 
> That doesn't set max MTU (MTUCap) but merely reads it (for that port).

Sorry, copy and paste error. I've meant the mlx4 file:
/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

But you've answered that by "vendor specific". Thanks for the valuable
information!

For us most interesting would be if the MTU can be changed live without
any service disruption. Looks like the mlx4 driver can't provide that.
Perhaps switches can do that.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 9:56 AM, Sebastian Riemer wrote:
> On 09.04.2013 15:34, Hal Rosenstock wrote:
>> On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
>>> On 09.04.2013 14:49, Hal Rosenstock wrote:
 On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
> Hello. I have some servers, with mellanox ConnectX-3 and have some 
> questions:
> Why max_mtu differs with active_mtu? 

 What does peer port say for max MTU ?

> How can i set active mtu?

 SM sets active MTU to min of peer ports max MTUs.
>>>
>>> So with "peer port max MTU" do you mean this file?:
>>>
>>> /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
>>
>> I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
>> supported MTU.
> 
> So these values are exactly the same as in "ibv_devinfo" and can be set
> in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.
> 
> I've found the PortInfo with the command
> "smpquery portinfo -C mlx4_0 3 1"
> where I'm using the first HCA to contact the SM. I tell the SM the
> destination LID ('3' here in my case) and the destination port ('1').
> 
> Is there another method to set the max MTU?

That doesn't set max MTU (MTUCap) but merely reads it (for that port).

> I know that switches can also set the max MTU for their switch ports
> where most of them use 2048 as default.

You would need to contact your CA and/or switch vendor(s) (see below).

> How to change these switch port MTUs for unmanaged switches?
> 
> On managed switches this can be done over the web front-end.

Yes. MTUCap is RO in terms of the SM so there are only "out of band"
mechanisms to change this which are vendor specific like a web front end.

-- Hal

> Cheers,
> Sebastian
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 for-next 0/5] IB/IPoIB: Add multi-queue TSS and RSS support

2013-04-09 Thread Or Gerlitz

On 03/04/2013 23:12, Hefty, Sean wrote:

Hi Sean, Ping. You had concerns on the suggested concept, we want to
know if we addressed them, can you comment?

I'm in meetings this week until tomorrow.  I'll try to take a look at the 
updated patches then or Friday.



any feedback?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 15:34, Hal Rosenstock wrote:
> On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
>> On 09.04.2013 14:49, Hal Rosenstock wrote:
>>> On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
 Hello. I have some servers, with mellanox ConnectX-3 and have some 
 questions:
 Why max_mtu differs with active_mtu? 
>>>
>>> What does peer port say for max MTU ?
>>>
 How can i set active mtu?
>>>
>>> SM sets active MTU to min of peer ports max MTUs.
>>
>> So with "peer port max MTU" do you mean this file?:
>>
>> /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
> 
> I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
> supported MTU.

So these values are exactly the same as in "ibv_devinfo" and can be set
in /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu.

I've found the PortInfo with the command
"smpquery portinfo -C mlx4_0 3 1"
where I'm using the first HCA to contact the SM. I tell the SM the
destination LID ('3' here in my case) and the destination port ('1').

Is there another method to set the max MTU?

I know that switches can also set the max MTU for their switch ports
where most of them use 2048 as default.
How to change these switch port MTUs for unmanaged switches?

On managed switches this can be done over the web front-end.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 9:16 AM, Sebastian Riemer wrote:
> On 09.04.2013 14:49, Hal Rosenstock wrote:
>> On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
>>> Hello. I have some servers, with mellanox ConnectX-3 and have some 
>>> questions:
>>> Why max_mtu differs with active_mtu? 
>>
>> What does peer port say for max MTU ?
>>
>>> How can i set active mtu?
>>
>> SM sets active MTU to min of peer ports max MTUs.
> 
> So with "peer port max MTU" do you mean this file?:
> 
> /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

I meant NeighborMTU from PortInfo as active MTU and MTUCap there is
supported MTU.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 14:49, Hal Rosenstock wrote:
> On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
>> Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
>> Why max_mtu differs with active_mtu? 
> 
> What does peer port say for max MTU ?
> 
>> How can i set active mtu?
> 
> SM sets active MTU to min of peer ports max MTUs.

So with "peer port max MTU" do you mean this file?:

/sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

I've seen that it can be set as well. I've got two ConnectX-2 machines
connected back2back. In general these have 4K max and active.

So let's try something:

Host1:
$ echo 2048 > /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
# Port is not active, let's reactivate it.
$ echo 1 > /sys/class/infiniband/mlx4_0/device/enable

ibv_devinfo Host1:
max_mtu:2048 (4)
active_mtu: 2048 (4)

Host2:
max_mtu:4096 (5)
active_mtu: 2048 (4)

Both had "4096 (5)" before everywhere.
So that's the recommended way to reduce the MTU?

I've heard that reducing the MTU in a fabric can help fighting
congestion issues. As congestion control doesn't work yet, could this
help against congestion?

Cheers,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [RFC/PATCH v3] IPoIB: Leave space in skb linear buffer for IP headers

2013-04-09 Thread Luick, Dean
> From: Roland Dreier 
> + if (wc->byte_len > IPOIB_UD_HEAD_SIZE) {
> + page = priv->rx_ring[wr_id].page;
> + priv->rx_ring[wr_id].page = NULL;
> + } else {
> + page = NULL;
> + }
> +
>   /*
>* If we can't allocate a new RX buffer, dump
>* this packet and reuse the old buffer.
>*/
>   if (unlikely(!ipoib_alloc_rx_skb(dev, wr_id))) {
>   ++dev->stats.rx_dropped;
> + priv->rx_ring[wr_id].page = page;
>   goto repost;
>   }


Can you go through the "else" of the first if (page is NULL), then enter the 
second if? If so, isn't the page lost?


Dean
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 8:15 AM, Sebastian Riemer wrote:
> On 09.04.2013 13:51, Vasiliy Tolstov wrote:
>>> Something like this:
>>> echo 4096 > /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
>>
>> After doing this all srp connections down and port is down. I need to
>> restart openibd
> 
> Sorry for that! It's much easier to set the IP MTU. Managed switches
> support setting the RDMA MTU. So it could be possible that it is a
> setting in the SM config. But I'm not sure.

IP MTU is different than link MTU. For UD mode, it's link MTU - 4. For
RC (connected) mode, this can be a much larger number than the link MTU
as the HCA does the segmentation/reassembly down to the path MTU.

> $ man opensm
> says that it can be set in the partitions.conf

Yes, MTU for the IPoIB interface is set in the partition file. This
would need configuring for the larger (4K) MTU assuming all ports
support the 4K MTU. If not, some ports won't be able to join the IPoIB
broadcast (or other) IB multicast groups and IPoIB won't work.

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Hal Rosenstock
On 4/9/2013 7:12 AM, Vasiliy Tolstov wrote:
> Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
> Why max_mtu differs with active_mtu? 

What does peer port say for max MTU ?

> How can i set active mtu?

SM sets active MTU to min of peer ports max MTUs.

> Why ibstatus says that i have only 10 Gb/s ?

Because the link negotiated at 10 Gb/s.

> All cables support 40 Gb/s.

Do ports support 40 Gb/s also ? What do peer ports say for supported and
enabled link speeds ?

-- Hal

> Thanks for any help.
> 
> Linux xen28 3.8.6-1-xen #1 SMP Fri Apr 5 18:48:02 UTC 2013 (713918b)
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80::::0025:90ff:ff17:9b25
> base lid:0x34
> sm lid:  0x4
> state:   4: ACTIVE
> phys state:  5: LinkUp
> rate:10 Gb/sec (4X)
> link_layer:  InfiniBand
> 
> 
> 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
> hca_id: mlx4_0
> transport:  InfiniBand (0)
> fw_ver: 2.10.700
> node_guid:  0025:90ff:ff17:9b24
> sys_image_guid: 0025:90ff:ff17:9b27
> vendor_id:  0x02c9
> vendor_part_id: 4099
> hw_ver: 0x0
> board_id:   SM_218101000
> phys_port_cnt:  1
> port:   1
> state:  PORT_ACTIVE (4)
> max_mtu:4096 (5)
> active_mtu: 2048 (4)
> sm_lid: 4
> port_lid:   52
> port_lmc:   0x00
> link_layer: IB
> 
> 
> --
> Vasiliy Tolstov,
> e-mail: v.tols...@selfip.ru
> jabber: v...@selfip.ru
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 13:51, Vasiliy Tolstov wrote:
>> Something like this:
>> echo 4096 > /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu
> 
> After doing this all srp connections down and port is down. I need to
> restart openibd

Sorry for that! It's much easier to set the IP MTU. Managed switches
support setting the RDMA MTU. So it could be possible that it is a
setting in the SM config. But I'm not sure.

$ man opensm
says that it can be set in the partitions.conf

>> You should see "40 Gb/sec (4X QDR)" here. Perhaps the OFED is too old so
>> that FDR and ConnectX 3 aren't supported, yet. "10 Gb/sec (4X)" seems to
>> be the default case if a rate isn't supported.
> 
> Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 
> Gb

The kernel version is okay. It depends on the user space.
There is a support note in OFED 3.5:
- ConnectX-3 (fw-ConnectX3 Rev 2.11.0500) (FDR and FDR10 Modes are
Supported)

Before OFED 3.5 these HCAs aren't supported. A look at the related
source code could be worth a try.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Vasiliy Tolstov
2013/4/9 Sebastian Riemer :
> Because 2048 is the default and 4096 is the max. supported MTU by the
> hardware.
>
>> How can i set active mtu?
>
> Something like this:
> echo 4096 > /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

After doing this all srp connections down and port is down. I need to
restart openibd

06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Subsystem: Mellanox Technologies Device 0017
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop-
ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort-
SERR-  Could be a bug. Which OFED/Kernel (if using in-tree IB modules) do you use?
> Mine says with ConnectX2 QDR: "40 Gb/sec (4X QDR)"

I'm using stock 3.8.6 kernel and xen patches on top. And i'm use
modules provided with kernel. (only ib_srp i'm use from Bart github
repo)


> You should see "40 Gb/sec (4X QDR)" here. Perhaps the OFED is too old so
> that FDR and ConnectX 3 aren't supported, yet. "10 Gb/sec (4X)" seems to
> be the default case if a rate isn't supported.

Yes, in older card with ConnecX i see this, but in case of ConnectX-3 only 10 Gb

--
Vasiliy Tolstov,
e-mail: v.tols...@selfip.ru
jabber: v...@selfip.ru
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: tune ib stack

2013-04-09 Thread Sebastian Riemer
On 09.04.2013 13:12, Vasiliy Tolstov wrote:
> Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
> Why max_mtu differs with active_mtu?

Because 2048 is the default and 4096 is the max. supported MTU by the
hardware.

> How can i set active mtu?

Something like this:
echo 4096 > /sys/class/infiniband/mlx4_0/device/mlx4_port1_mtu

> Why ibstatus says that i have only 10 Gb/s ?

Could be a bug. Which OFED/Kernel (if using in-tree IB modules) do you use?

Mine says with ConnectX2 QDR: "40 Gb/sec (4X QDR)"

> All cables support 40 Gb/s.
> 
> Thanks for any help.
> 
> Linux xen28 3.8.6-1-xen #1 SMP Fri Apr 5 18:48:02 UTC 2013 (713918b)
> x86_64 x86_64 x86_64 GNU/Linux
> 
> Infiniband device 'mlx4_0' port 1 status:
> default gid: fe80::::0025:90ff:ff17:9b25
> base lid:0x34
> sm lid:  0x4
> state:   4: ACTIVE
> phys state:  5: LinkUp
> rate:10 Gb/sec (4X)
> link_layer:  InfiniBand

You should see "40 Gb/sec (4X QDR)" here. Perhaps the OFED is too old so
that FDR and ConnectX 3 aren't supported, yet. "10 Gb/sec (4X)" seems to
be the default case if a rate isn't supported.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


tune ib stack

2013-04-09 Thread Vasiliy Tolstov
Hello. I have some servers, with mellanox ConnectX-3 and have some questions:
Why max_mtu differs with active_mtu? How can i set active mtu?
Why ibstatus says that i have only 10 Gb/s ?

All cables support 40 Gb/s.

Thanks for any help.

Linux xen28 3.8.6-1-xen #1 SMP Fri Apr 5 18:48:02 UTC 2013 (713918b)
x86_64 x86_64 x86_64 GNU/Linux

Infiniband device 'mlx4_0' port 1 status:
default gid: fe80::::0025:90ff:ff17:9b25
base lid:0x34
sm lid:  0x4
state:   4: ACTIVE
phys state:  5: LinkUp
rate:10 Gb/sec (4X)
link_layer:  InfiniBand


06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.10.700
node_guid:  0025:90ff:ff17:9b24
sys_image_guid: 0025:90ff:ff17:9b27
vendor_id:  0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id:   SM_218101000
phys_port_cnt:  1
port:   1
state:  PORT_ACTIVE (4)
max_mtu:4096 (5)
active_mtu: 2048 (4)
sm_lid: 4
port_lid:   52
port_lmc:   0x00
link_layer: IB


--
Vasiliy Tolstov,
e-mail: v.tols...@selfip.ru
jabber: v...@selfip.ru
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html