Re: [PATCH net] af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag

2015-09-19 Thread Aaron Conole
Sergei Shtylyov  writes:

> Hello.
> ...
>Your patch doesn't comply to the Linux CodingStyle.
> ...

I'll fix and post v2 - apologies for messing up with that check.

-Aaron

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next RFC 3/6] rocker: switch to local transaction phase enum

2015-09-19 Thread Vivien Didelot
Hi Jiri,

On Sep. Saturday 19 (38) 02:29 PM, Jiri Pirko wrote:
> Since switchdev_trans_ph anum is going to be removed, and rocker code is
> way too complicated in this matter to be converted, just introduce local
> enum for transaction phase. Pass it around in local transaction
> structure.

I missed this typo here: s/anum/enum/.

> Signed-off-by: Jiri Pirko 

I found the renaming trick is a bit hard to follow. I am wondering if
this patch could be used first and drop patch 1/6?

That way, you can first add the rocker_trans structure and set its ph
member to obj->trans in obj_add/attr_set, then the following patch
(currently 2/6) would just assign it to the new trans parameter.

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: PATCH: netdev: add a cast NLMSG_OK to avoid a GCC warning in users' code

2015-09-19 Thread D. Hugh Redelmeier
Fixes have been proposed for this problem at least twice before.



(These messages are not presented as a thread so I've put links to
each of them)

Problem report: 
Patch proposal: 
Reply suggesting improved presentation:  
Revised patch proposal: 
Duplicate revised patch proposal: 

Doron Tsur proposed:
-  (nlh)->nlmsg_len <= (len))
+  (int)(nlh)->nlmsg_len <= (len))

This would correctly function as long as nlmsg_length were <= INT_MAX.
I imagine that this would always be the case since Linux isn't used on
machines with ints narrower than 32 bits.

It would cause a GCC warning on 32-bit machines when len has an
unsigned type that is the same width as int.

Programs conforming the the netlink documentation would not get
warnings.

There was no reply to the revised patch proposal.




Mike Frysinger proposed:
-#define NLMSG_OK(nlh,len) ((len) >= (int)sizeof(struct nlmsghdr) && \
+#define NLMSG_OK(nlh,len) ((len) >= sizeof(struct nlmsghdr) && \

This should correctly function as long as len is an unsigned type or
it has a non-negative value.  It won't work correctly if len is size_t
and has the value -1 (the error indicator).

David Miller replied:

I don't think we can change this.  If you get rid of the 'int'
cast then code is going to end up with a signed comparison for
the first test even if 'len' is signed, and that's a potential
security issue.

I don't understand this response.  If you get rid of the int cast, and
the type of len, after the "integral promotions", is the same width as
size_t, the "usual arithmetic conversions" of the C language will
cause the comparison to be done in unsigned.

I would agree with a variant of the reply:

I don't think we can change this.  If you get rid of the 'int'
cast then IN MOST ENVIRONMENTS the code is going to end up
with an UNSIGNED comparison for the first test even if 'len'
is signed, and that's a potential security issue.

Consider the case where len is a signed type with a negative
value, such as -1, the error indicator described in recv(2).

The example code in netlink(7) about reading netlink message is a good
example of code that would misbehave if the cast were removed.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/7] switchdev: update documentation on FDB ageing_time

2015-09-19 Thread Scott Feldman
On Sat, Sep 19, 2015 at 6:21 PM, roopa  wrote:
> On 9/18/15, 12:55 PM, sfel...@gmail.com wrote:
>>
>> From: Scott Feldman 
>>
>> Signed-off-by: Scott Feldman 
>> ---
>>   Documentation/networking/switchdev.txt |   24 
>>   1 file changed, 12 insertions(+), 12 deletions(-)
>>
>> diff --git a/Documentation/networking/switchdev.txt
>> b/Documentation/networking/switchdev.txt
>> index 476df04..67e43ee 100644
>> --- a/Documentation/networking/switchdev.txt
>> +++ b/Documentation/networking/switchdev.txt
>> @@ -239,20 +239,20 @@ The driver should initialize the attributes to the
>> hardware defaults.
>>   FDB Ageing
>>   ^^
>>   -There are two FDB ageing models supported: 1) ageing by the device, and
>> 2)
>> -ageing by the kernel.  Ageing by the device is preferred if many FDB
>> entries
>> -are supported.  The driver calls
>> call_switchdev_notifiers(SWITCHDEV_FDB_DEL,
>> -...) to age out the FDB entry.  In this model, ageing by the kernel
>> should be
>> -turned off.  XXX: how to turn off ageing in kernel on a per-port basis or
>> -otherwise prevent the kernel from ageing out the FDB entry?
>> -
>> -In the kernel ageing model, the standard bridge ageing mechanism is used
>> to age
>> -out stale FDB entries.  To keep an FDB entry "alive", the driver should
>> refresh
>> -the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD,
>> ...).  The
>> +The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and
>> it is
>> +the responsibility of the port driver/device to age out these entries.
>> If the
>> +port device supports ageing, when the FDB entry expires, it will notify
>> the
>> +driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If
>> the
>> +device does not support ageing, the driver can simulate ageing using a
>> +garbage collection timer to monitor FBD entries.  Expired entries will be
>> +notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
>> +example of driver running ageing timer.
>
> We do rely on the bridge driver ageing out entries. We have gone from
> hardware ageing to ageing in the switch driver to ultimately ageing in the
> bridge driver.  :). And we keep the fdb entries in the bridge driver "alive"
> by using 'NTF_USE' from the user-space driver.

Yes, your switch driver is in user-space so you have to use NTF_USE to
refresh the entry since you cannot use the kernel driver model to
call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  Consequently, your
entries are not marked with NTF_EXT_LEARNED, so this patch is a no-op
for you.  You can continue to use the bridge driver to age out your
entries.

>> +To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the
>> FDB
>> +entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
>>
> Even with your current patches, looks like the switch driver will need to
> refresh the entries anyways to keep the "last-used" time to now.
> In which case is there much value in switch driver doing the ageing ?.

"should" not "must".

Value is for the many learned FDB entries case, to move the ageing
function to hardware.

> I am thinking keeping the default behavior of the bridge driver to age and
> anything else configurable might be a better option.

I'd rather someone add that knob when it's actually needed.  When the
first in-kernel switchdev driver that wants to use the bridge driver's
ageing function, then we can make that adjustment.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/15] RDS: fix rds_sock reference bug while doing bind

2015-09-19 Thread Santosh Shilimkar
One need to take rds socket reference while using it and release it
once done with it. rds_add_bind() code path does not do that so
lets fix it.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 16 +++-
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index 01989e2..166c605 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -61,8 +61,10 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
  be16_to_cpu(rs->rs_bound_port);
 
-   if (cmp == needle)
+   if (cmp == needle) {
+   rds_sock_addref(rs);
return rs;
+   }
}
 
if (insert) {
@@ -94,10 +96,10 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
rs = rds_bind_lookup(addr, port, NULL);
read_unlock_irqrestore(_bind_lock, flags);
 
-   if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
-   rds_sock_addref(rs);
-   else
+   if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
+   rds_sock_put(rs);
rs = NULL;
+   }
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, ,
ntohs(port));
@@ -123,14 +125,18 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
write_lock_irqsave(_bind_lock, flags);
 
do {
+   struct rds_sock *rrs;
if (rover == 0)
rover++;
-   if (!rds_bind_lookup(addr, cpu_to_be16(rover), rs)) {
+   rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs);
+   if (!rrs) {
*port = rs->rs_bound_port;
ret = 0;
rdsdebug("rs %p binding to %pI4:%d\n",
  rs, , (int)ntohs(*port));
break;
+   } else {
+   rds_sock_put(rrs);
}
} while (rover++ != last);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/7] switchdev: update documentation on FDB ageing_time

2015-09-19 Thread roopa

On 9/18/15, 12:55 PM, sfel...@gmail.com wrote:

From: Scott Feldman 

Signed-off-by: Scott Feldman 
---
  Documentation/networking/switchdev.txt |   24 
  1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/Documentation/networking/switchdev.txt 
b/Documentation/networking/switchdev.txt
index 476df04..67e43ee 100644
--- a/Documentation/networking/switchdev.txt
+++ b/Documentation/networking/switchdev.txt
@@ -239,20 +239,20 @@ The driver should initialize the attributes to the 
hardware defaults.
  FDB Ageing
  ^^
  
-There are two FDB ageing models supported: 1) ageing by the device, and 2)

-ageing by the kernel.  Ageing by the device is preferred if many FDB entries
-are supported.  The driver calls call_switchdev_notifiers(SWITCHDEV_FDB_DEL,
-...) to age out the FDB entry.  In this model, ageing by the kernel should be
-turned off.  XXX: how to turn off ageing in kernel on a per-port basis or
-otherwise prevent the kernel from ageing out the FDB entry?
-
-In the kernel ageing model, the standard bridge ageing mechanism is used to age
-out stale FDB entries.  To keep an FDB entry "alive", the driver should refresh
-the FDB entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The
+The bridge will skip ageing FDB entries marked with NTF_EXT_LEARNED and it is
+the responsibility of the port driver/device to age out these entries.  If the
+port device supports ageing, when the FDB entry expires, it will notify the
+driver which in turn will notify the bridge with SWITCHDEV_FDB_DEL.  If the
+device does not support ageing, the driver can simulate ageing using a
+garbage collection timer to monitor FBD entries.  Expired entries will be
+notified to the bridge using SWITCHDEV_FDB_DEL.  See rocker driver for
+example of driver running ageing timer.
We do rely on the bridge driver ageing out entries. We have gone from 
hardware ageing to ageing in the switch driver to ultimately ageing in 
the bridge driver.  :). And we keep the fdb entries in the bridge driver 
"alive" by using 'NTF_USE' from the user-space driver.



+
+To keep an NTF_EXT_LEARNED entry "alive", the driver should refresh the FDB
+entry by calling call_switchdev_notifiers(SWITCHDEV_FDB_ADD, ...).  The

Even with your current patches, looks like the switch driver will need 
to refresh the entries anyways to keep the "last-used" time to now.

In which case is there much value in switch driver doing the ageing ?.

I am thinking keeping the default behavior of the bridge driver to age 
and anything else configurable might be a better option.


Thanks,
Roopa



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] Phy and mdiobus fixes

2015-09-19 Thread Florian Fainelli
Le 09/18/15 02:46, Russell King - ARM Linux a écrit :
> Hi,
> 
> While looking at the phy code, I identified a number of weaknesses
> where refcounting on device structures was being leaked, where
> modules could be removed while in-use, and where the fixed-phy could
> end up having unintended consequences caused by incorrect calls to
> fixed_phy_update_state().
> 
> This patch series resolves those issues, some of which were discovered
> with testing on an Armada 388 board.  Not all patches are fully tested,
> particularly the one which touches several network drivers.
> 
> When resolving the struct device refcounting problems, several different
> solutions were considered before settling on the implementation here -
> one of the considerations was to avoid touching many network drivers.
> The solution here is:
> 
>   phy_attach*() - takes a refcount
>   phy_detach*() - drops the phy_attach refcount
> 
> Provided drivers always attach and detach their phys, which they should
> already be doing, this should change nothing, even if they leak a refcount.
> 
>   of_phy_find_device() and of_* functions which use that take
>   a refcount.  Arrange for this refcount to be dropped once
>   the phy is attached.
> 
> This is the reason why the previous change is important - we can't drop
> this refcount taken by of_phy_find_device() until something else holds
> a reference on the device.  This resolves the leaked refcount caused by
> using of_phy_connect() or of_phy_attach().
> 
> Even without the above changes, these drivers are leaking by calling
> of_phy_find_device().  These drivers are addressed by adding the
> appropriate release of that refcount.
> 
> The mdiobus code also suffered from the same kind of leak, but thankfully
> this only happened in one place - the mdio-mux code.
> 
> I also found that the try_module_get() in the phy layer code was utterly
> useless: phydev->dev.driver was guaranteed to always be NULL, so
> try_module_get() was always being called with a NULL argument.  I proved
> this with my SFP code, which declares its own MDIO bus - the module use
> count was never incremented irrespective of how I set the MDIO bus up.
> This allowed the MDIO bus code to be removed from the kernel while there
> were still PHYs attached to it.
> 
> One other bug was discovered: while using in-band-status with mvneta, it
> was found that if a real phy is attached with in-band-status enabled,
> and another ethernet interface is using the fixed-phy infrastructure, the
> interface using the fixed-phy infrastructure is configured according to
> the other interface using the in-band-status - which is caused by the
> fixed-phy code not verifying that the phy_device passed in is actually
> a fixed-phy device, rather than a real MDIO phy.
> 
> Lastly, having mdio_bus reversing phy_device_register() internals seems
> like a layering violation - it's trivial to move that code to the phy
> device layer.

Reviewed-by: Florian Fainelli 

Thanks!
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/15] RDS: mark rds_ib_fmr_wq static

2015-09-19 Thread Santosh Shilimkar
Fix below warning by marking rds_ib_fmr_wq static

net/rds/ib_rdma.c:87:25: warning: symbol 'rds_ib_fmr_wq' was not declared. 
Should it be static?

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 52d889a..bb62024 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -83,7 +83,7 @@ struct rds_ib_mr_pool {
struct ib_fmr_attr  fmr_attr;
 };
 
-struct workqueue_struct *rds_ib_fmr_wq;
+static struct workqueue_struct *rds_ib_fmr_wq;
 
 int rds_ib_fmr_init(void)
 {
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/15] RDS: ack more receive completions to improve performance

2015-09-19 Thread Santosh Shilimkar
For better performance, we split the receive completion IRQ handler. That
lets us acknowledge several WCE events in one call. We also limit the WC
to max 32 to avoid latency. Acknowledging several completions in one call
instead of several calls each time will provide better performance since
less mutual exclusion locks are being performed.

In next patch, send completion is also split which re-uses the poll_cq()
and hence the code is moved to ib_cm.c

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   |  28 +--
 net/rds/ib_cm.c|  70 ++-
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_stats.c |   3 +-
 4 files changed, 132 insertions(+), 105 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index f1fd5ff..727759b 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -24,6 +24,8 @@
 
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
+#define RDS_IB_WC_MAX  32
+
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
 
@@ -89,6 +91,20 @@ struct rds_ib_work_ring {
atomic_tw_free_ctr;
 };
 
+/* Rings are posted with all the allocations they'll need to queue the
+ * incoming message to the receiving socket so this can't fail.
+ * All fragments start with a header, so we can make sure we're not receiving
+ * garbage, and we can tell a small 8 byte fragment from an ACK frame.
+ */
+struct rds_ib_ack_state {
+   u64 ack_next;
+   u64 ack_recv;
+   unsigned intack_required:1;
+   unsigned intack_next_valid:1;
+   unsigned intack_recv_valid:1;
+};
+
+
 struct rds_ib_device;
 
 struct rds_ib_connection {
@@ -102,6 +118,10 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_recv_wc[RDS_IB_WC_MAX];
+
+   /* interrupt handling */
+   struct tasklet_struct   i_recv_tasklet;
 
/* tx */
struct rds_ib_work_ring i_send_ring;
@@ -112,7 +132,6 @@ struct rds_ib_connection {
atomic_ti_signaled_sends;
 
/* rx */
-   struct tasklet_struct   i_recv_tasklet;
struct mutexi_recv_mutex;
struct rds_ib_work_ring i_recv_ring;
struct rds_ib_incoming  *i_ibinc;
@@ -199,13 +218,14 @@ struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
uint64_ts_ib_tx_cq_call;
+   uint64_ts_ib_evt_handler_call;
+   uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
uint64_ts_ib_tx_ring_full;
uint64_ts_ib_tx_throttle;
uint64_ts_ib_tx_sg_mapping_failure;
uint64_ts_ib_tx_stalled;
uint64_ts_ib_tx_credit_updates;
-   uint64_ts_ib_rx_cq_call;
uint64_ts_ib_rx_cq_event;
uint64_ts_ib_rx_ring_empty;
uint64_ts_ib_rx_refill_from_cq;
@@ -324,7 +344,8 @@ void rds_ib_recv_free_caches(struct rds_ib_connection *ic);
 void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp);
 void rds_ib_inc_free(struct rds_incoming *inc);
 int rds_ib_inc_copy_to_user(struct rds_incoming *inc, struct iov_iter *to);
-void rds_ib_recv_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_recv_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc,
+struct rds_ib_ack_state *state);
 void rds_ib_recv_tasklet_fn(unsigned long data);
 void rds_ib_recv_init_ring(struct rds_ib_connection *ic);
 void rds_ib_recv_clear_ring(struct rds_ib_connection *ic);
@@ -332,6 +353,7 @@ void rds_ib_recv_init_ack(struct rds_ib_connection *ic);
 void rds_ib_attempt_ack(struct rds_ib_connection *ic);
 void rds_ib_ack_send_complete(struct rds_ib_connection *ic);
 u64 rds_ib_piggyb_ack(struct rds_ib_connection *ic);
+void rds_ib_set_ack(struct rds_ib_connection *ic, u64 seq, int ack_required);
 
 /* ib_ring.c */
 void rds_ib_ring_init(struct rds_ib_work_ring *ring, u32 nr);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 9043f5c..28e0979 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -216,6 +216,72 @@ static void rds_ib_cq_event_handler(struct ib_event 
*event, void *data)
 event->event, ib_event_msg(event->event), data);
 }
 
+/* Plucking the oldest entry from the ring can be done concurrently with
+ * the thread refilling the ring.  Each ring operation is protected by
+ * spinlocks and the transient state of refilling doesn't change the
+ * recording of which entry is oldest.
+ *
+ * This relies on IB only calling one cq comp_handler for each cq so that
+ * there will only be one caller of rds_recv_incoming() per RDS connection.
+ */
+static void 

[PATCH 11/15] RDS: fix the rds_ib_fmr_wq kick call

2015-09-19 Thread Santosh Shilimkar
RDS IB mr pool has its own workqueue 'rds_ib_fmr_wq', so we need
to use queue_delayed_work() to kick the work. This was hurting
the performance since pool maintenance was less often triggered
from other path.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_rdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 872f523..b6644fa 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -319,7 +319,7 @@ static struct rds_ib_mr *rds_ib_alloc_fmr(struct 
rds_ib_device *rds_ibdev)
int err = 0, iter = 0;
 
if (atomic_read(>dirty_count) >= pool->max_items / 10)
-   schedule_delayed_work(>flush_worker, 10);
+   queue_delayed_work(rds_ib_fmr_wq, >flush_worker, 10);
 
while (1) {
ibmr = rds_ib_reuse_fmr(pool);
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/15] RDS: connection scalability and performance improvements

2015-09-19 Thread Santosh Shilimkar
This series addresses RDS connection bottlenecks on massive workloads and
improve the RDMA performance almost by 3X. RDS TCP also gets a small gain
of about 12%.

RDS is being used in massive systems with high scalability where several
hundred thousand end points and tens of thousands of local processes
are operating in tens of thousand sockets. Being RC(reliable connection),
socket bind and release happens very often and any inefficiencies in
bind hash look ups hurts the overall system performance. RDS bin hash-table
uses global spin-lock which is the biggest bottleneck. To make matter worst,
it uses rcu inside global lock for hash buckets.
This is being addressed by simply using per bucket rw lock which makes the
locking simple and very efficient. The hash table size is also scaled up
accordingly.

For RDS RDMA improvement, the completion handling is revamped so that we
can do batch completions. Both send and receive completion handlers are
split logically to achieve the same. RDS 8K messages being one of the
key usecase, mr pool is adapted to have the 8K mrs along with default 1M
mrs. And while doing this, few fixes and couple of bottlenecks seen with
rds_sendmsg() are addressed.

Series applies against 4.3-rc1 as well as net-next. Its tested on Oracle
hardware with IB fabric for both bcopy as well as RDMA mode. RDS TCP is
tested with iXGB NIC. Like last time, iWARP transport is untested with
these changes.

As a side note, the IB HCA driver I used for testing misses at least 3
important patches in upstream to see the full blown RDS IB performance
and am hoping to get that in mainline with help of them.

Santosh Shilimkar (15):
  RDS: use kfree_rcu in rds_ib_remove_ipaddr
  RDS: make socket bind/release locking scheme simple and more efficient
  RDS: fix rds_sock reference bug while doing bind
  RDS: Use per-bucket rw lock for bind hash-table
  RDS: increase size of hash-table to 8K
  RDS: defer the over_batch work to send worker
  RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL
  RDS: ack more receive completions to improve performance
  RDS: split send completion handling and do batch ack
  RDS: handle rds_ibdev release case instead of crashing the kernel
  RDS: fix the rds_ib_fmr_wq kick call
  RDS: use already available pool handle from ibmr
  RDS: mark rds_ib_fmr_wq static
  RDS: use max_mr from HCA caps than max_fmr
  RDS: split mr pool to improve 8K messages performance

 net/rds/af_rds.c   |   8 +---
 net/rds/bind.c |  78 ++
 net/rds/ib.c   |  47 --
 net/rds/ib.h   |  78 +++---
 net/rds/ib_cm.c| 114 ++--
 net/rds/ib_rdma.c  | 116 ++---
 net/rds/ib_recv.c  | 136 +++--
 net/rds/ib_send.c  | 110 ---
 net/rds/ib_stats.c |  22 +
 net/rds/rds.h  |   1 +
 net/rds/send.c |  15 --
 net/rds/threads.c  |   2 +
 12 files changed, 446 insertions(+), 281 deletions(-)

-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/15] RDS: handle rds_ibdev release case instead of crashing the kernel

2015-09-19 Thread Santosh Shilimkar
From: Santosh Shilimkar 

Just in case we are still handling the QP receive completion while the
rds_ibdev is released, drop the connection instead of crashing the kernel.

Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_cm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 8f51d0d..2b2370e 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -285,7 +285,8 @@ static void rds_ib_tasklet_fn_recv(unsigned long data)
struct rds_ib_device *rds_ibdev = ic->rds_ibdev;
struct rds_ib_ack_state state;
 
-   BUG_ON(!rds_ibdev);
+   if (!rds_ibdev)
+   rds_conn_drop(conn);
 
rds_ib_stats_inc(s_ib_tasklet_call);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/15] RDS: use kfree_rcu in rds_ib_remove_ipaddr

2015-09-19 Thread Santosh Shilimkar
synchronize_rcu() slowing down un-necessarily the socket shutdown
path. It is used just kfree() the ip addresses in rds_ib_remove_ipaddr()
which is perfect usecase for kfree_rcu();

So lets use that to gain some speedup.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h  | 1 +
 net/rds/ib_rdma.c | 6 ++
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index aae60fd..f1fd5ff 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -164,6 +164,7 @@ struct rds_ib_connection {
 struct rds_ib_ipaddr {
struct list_headlist;
__be32  ipaddr;
+   struct rcu_head rcu;
 };
 
 struct rds_ib_device {
diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index 251d1ce..872f523 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -159,10 +159,8 @@ static void rds_ib_remove_ipaddr(struct rds_ib_device 
*rds_ibdev, __be32 ipaddr)
}
spin_unlock_irq(_ibdev->spinlock);
 
-   if (to_free) {
-   synchronize_rcu();
-   kfree(to_free);
-   }
+   if (to_free)
+   kfree_rcu(to_free, rcu);
 }
 
 int rds_ib_update_ipaddr(struct rds_ib_device *rds_ibdev, __be32 ipaddr)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 12/15] RDS: use already available pool handle from ibmr

2015-09-19 Thread Santosh Shilimkar
rds_ib_mr already keeps the pool handle which it associates
with. Lets use that instead of round about way of fetching
it from rds_ib_device.

No functional change.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib_rdma.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/net/rds/ib_rdma.c b/net/rds/ib_rdma.c
index b6644fa..52d889a 100644
--- a/net/rds/ib_rdma.c
+++ b/net/rds/ib_rdma.c
@@ -522,8 +522,7 @@ static void rds_ib_teardown_mr(struct rds_ib_mr *ibmr)
 
__rds_ib_teardown_mr(ibmr);
if (pinned) {
-   struct rds_ib_device *rds_ibdev = ibmr->device;
-   struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
+   struct rds_ib_mr_pool *pool = ibmr->pool;
 
atomic_sub(pinned, >free_pinned);
}
@@ -717,8 +716,8 @@ static void rds_ib_mr_pool_flush_worker(struct work_struct 
*work)
 void rds_ib_free_mr(void *trans_private, int invalidate)
 {
struct rds_ib_mr *ibmr = trans_private;
+   struct rds_ib_mr_pool *pool = ibmr->pool;
struct rds_ib_device *rds_ibdev = ibmr->device;
-   struct rds_ib_mr_pool *pool = rds_ibdev->mr_pool;
 
rdsdebug("RDS/IB: free_mr nents %u\n", ibmr->sg_len);
 
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Vivien Didelot
Hi Jiri,

On Sep. Saturday 19 (38) 06:23 PM, Jiri Pirko wrote:
> Sat, Sep 19, 2015 at 03:35:51PM CEST, rami.ro...@intel.com wrote:
> >Hi,
> >
> >>introduce tranction enfra and for pre-commit split
> >
> >Typo:
> >Instead "tranction enfra" should be "transaction infrastructure".
> 
> Will fix. Thanks!

Just being picky, there are a couple more typos in:

2/6: s/separatelly/separately/
6/6: s/nore/more/ and s/separete/separate/

Thanks,
-v
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rfi: stmmac: creating an of mdio bus for attached dsa

2015-09-19 Thread Florian Fainelli
+Andrew,

Le 09/18/15 00:26, Phil Reid a écrit :
> G'day All,
> 
> Prior to submitting a patch I'd just like to get an idea on what the
> correct way is to create and register an mdio bus for use by the marvell
> dsa driver.
> On our system the cpu ethernet port is connected directly to a switch
> with a fixed link (1Gbit).
> So the driver needs to create and persist the mdio bus for the dsa
> driver using of_mdiobus_register.
> The trunk stmmac driver currently doesn't create the mdio bus if a fixed
> link is found.
> stmmac_probe_config_dt does hte following check.
> if (plat->phy_node || plat->phy_bus_name)
> plat->mdio_bus_data = NULL;
> phy_node is set because a fixed-link is found above and setting
> mdio_bus_data to null skips mdio bus creation.
> removing the phy_node check gets things working.

It seems to me like you should have the stmmac driver always register
its MDIO bus driver, whether or not it will end-up being used depends on
the information provided via OF/platform_data.

Even in topologies where a stmmac block is unused, you would expect
power/clock gating to be applied for the Ethernet MAC, but still have
the ability to use the MDIO bus of this stmmac instance if it connects
to a particular device. In your case, I would assume that the stmmac
Ethernet MAC and MDIO will be used, so this make even more sense.

> 
> I've also modified stmmac_mdio_register to use of_mdiobus_register and
> setup the dt to probe for a phy (that doesn't really exist, the switch
> is on the mdio).
> This cause a fair bit of log spamming but does seem to work as the
> switch is detected.
> eg: eth0: PHY ID  at 27 IRQ POLL (stmmac-0:1b)
> Currently if no phy is found the mdio bus gets de registered so this
> seems to be required.
> 
> In summary what is the correct way to make sure the mdio bus stays
> registered when a fixed-link is in use?
> Am I configuring the link to the switch incorrectly?

This seems reasonable to me. One of the problems with DSA right now is
that you need to provide a fixed-link emulated PHY for your Ethernet MAC
controller to keep "working" (that is: get link status/parameters etc.),
while the switch sits on the MDIO bus and does not look like (yet) a PHY
device.

The "problem" here is that the STMMAC driver treats fixed link as there
is no MDIO device connected here, so why bother with registering a MDIO
bus driver in the first place, which is a valid design shortcut, but not
in the case of DSA right now.

In the future, I hope we can have DSA switch devices look like almost
regular PHY devices, such that we do not have to use the fixed PHY to
keep having the Ethernet MAC controller be happy with link parameters
to/from the switch, see [1].

[1]: https://lwn.net/Articles/643149/
-- 
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 15/15] RDS: split mr pool to improve 8K messages performance

2015-09-19 Thread Santosh Shilimkar
8K message sizes are pretty important usecase for RDS current
workloads so we make provison to have 8K mrs available from the pool.
Based on number of SG's in the RDS message, we pick a pool to use.

Also to make sure that we don't under utlise mrs when say 8k messages
are dominating which could lead to 8k pull being exhausted, we fall-back
to 1m pool till 8k pool recovers for use.

This helps to at least push ~55 kB/s bidirectional data which
is a nice improvement.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c   |  47 +
 net/rds/ib.h   |  43 ---
 net/rds/ib_rdma.c  | 101 +
 net/rds/ib_stats.c |  18 ++
 4 files changed, 147 insertions(+), 62 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 883813a..a833ab7 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -43,14 +43,14 @@
 #include "rds.h"
 #include "ib.h"
 
-static unsigned int fmr_pool_size = RDS_FMR_POOL_SIZE;
-unsigned int fmr_message_size = RDS_FMR_SIZE + 1; /* +1 allows for unaligned 
MRs */
+unsigned int rds_ib_fmr_1m_pool_size = RDS_FMR_1M_POOL_SIZE;
+unsigned int rds_ib_fmr_8k_pool_size = RDS_FMR_8K_POOL_SIZE;
 unsigned int rds_ib_retry_count = RDS_IB_DEFAULT_RETRY_COUNT;
 
-module_param(fmr_pool_size, int, 0444);
-MODULE_PARM_DESC(fmr_pool_size, " Max number of fmr per HCA");
-module_param(fmr_message_size, int, 0444);
-MODULE_PARM_DESC(fmr_message_size, " Max size of a RDMA transfer");
+module_param(rds_ib_fmr_1m_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_1m_pool_size, " Max number of 1M fmr per HCA");
+module_param(rds_ib_fmr_8k_pool_size, int, 0444);
+MODULE_PARM_DESC(rds_ib_fmr_8k_pool_size, " Max number of 8K fmr per HCA");
 module_param(rds_ib_retry_count, int, 0444);
 MODULE_PARM_DESC(rds_ib_retry_count, " Number of hw retries before reporting 
an error");
 
@@ -97,8 +97,10 @@ static void rds_ib_dev_free(struct work_struct *work)
struct rds_ib_device *rds_ibdev = container_of(work,
struct rds_ib_device, free_work);
 
-   if (rds_ibdev->mr_pool)
-   rds_ib_destroy_mr_pool(rds_ibdev->mr_pool);
+   if (rds_ibdev->mr_8k_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_8k_pool);
+   if (rds_ibdev->mr_1m_pool)
+   rds_ib_destroy_mr_pool(rds_ibdev->mr_1m_pool);
if (rds_ibdev->pd)
ib_dealloc_pd(rds_ibdev->pd);
 
@@ -148,9 +150,13 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_mr ?
-   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
-   fmr_pool_size;
+   rds_ibdev->max_1m_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, (dev_attr->max_mr / 2),
+ rds_ib_fmr_1m_pool_size) : rds_ib_fmr_1m_pool_size;
+
+   rds_ibdev->max_8k_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, ((dev_attr->max_mr / 2) * RDS_MR_8K_SCALE),
+ rds_ib_fmr_8k_pool_size) : rds_ib_fmr_8k_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
rds_ibdev->max_responder_resources = dev_attr->max_qp_rd_atom;
@@ -162,12 +168,25 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rds_ibdev->mr_pool = rds_ib_create_mr_pool(rds_ibdev);
-   if (IS_ERR(rds_ibdev->mr_pool)) {
-   rds_ibdev->mr_pool = NULL;
+   rds_ibdev->mr_1m_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_1M_POOL);
+   if (IS_ERR(rds_ibdev->mr_1m_pool)) {
+   rds_ibdev->mr_1m_pool = NULL;
goto put_dev;
}
 
+   rds_ibdev->mr_8k_pool =
+   rds_ib_create_mr_pool(rds_ibdev, RDS_IB_MR_8K_POOL);
+   if (IS_ERR(rds_ibdev->mr_8k_pool)) {
+   rds_ibdev->mr_8k_pool = NULL;
+   goto put_dev;
+   }
+
+   rdsdebug("RDS/IB: max_mr = %d, max_wrs = %d, max_sge = %d, 
fmr_max_remaps = %d, max_1m_fmrs = %d, max_8k_fmrs = %d\n",
+dev_attr->max_fmr, rds_ibdev->max_wrs, rds_ibdev->max_sge,
+rds_ibdev->fmr_max_remaps, rds_ibdev->max_1m_fmrs,
+rds_ibdev->max_8k_fmrs);
+
INIT_LIST_HEAD(_ibdev->ipaddr_list);
INIT_LIST_HEAD(_ibdev->conn_list);
 
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 3a8cd31..f17d095 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -9,8 +9,11 @@
 #include "rds.h"
 #include "rdma_transport.h"
 
-#define RDS_FMR_SIZE   256
-#define RDS_FMR_POOL_SIZE  8192
+#define RDS_FMR_1M_POOL_SIZE   (8192 / 2)
+#define RDS_FMR_1M_MSG_SIZE256
+#define RDS_FMR_8K_MSG_SIZE   

[PATCH 05/15] RDS: increase size of hash-table to 8K

2015-09-19 Thread Santosh Shilimkar
Even with per bucket locking scheme, in a massive parallel
system with active rds sockets which could be in excess of multiple
of 10K, rds_bin_lookup() workload is siginificant because of smaller
hashtable size.

With some tests, it was found that we get modest but still nice
reduction in rds_bind_lookup with bigger bucket.

Hashtable   Baseline(1k)Delta
2048:   8.28%   -2.45%
4096:   8.28%   -4.60%
8192:   8.28%   -6.46%
16384:  8.28%   -6.75%

Based on the data, we set 8K as the bind hash-table size.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/bind.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/rds/bind.c b/net/rds/bind.c
index bc6b93e..fb2d545 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -43,7 +43,7 @@ struct bind_bucket {
struct hlist_head   head;
 };
 
-#define BIND_HASH_SIZE 1024
+#define BIND_HASH_SIZE 8192
 static struct bind_bucket bind_hash_table[BIND_HASH_SIZE];
 
 static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port)
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/15] RDS: defer the over_batch work to send worker

2015-09-19 Thread Santosh Shilimkar
Current process gives up if its send work over the batch limit.
The work queue will get  kicked to finish off any other requests.
This fixes remainder condition from commit 443be0e5affe ("RDS: make
sure not to loop forever inside rds_send_xmit").

The restart condition is only for the case where we reached to
over_batch code for some other reason so just retrying again
before giving up.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index 4df61a5..f1e709c 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -423,7 +423,9 @@ over_batch:
 !list_empty(>c_send_queue)) &&
send_gen == conn->c_send_gen) {
rds_stats_inc(s_send_lock_queue_raced);
-   goto restart;
+   if (batch_count < 1024)
+   goto restart;
+   queue_delayed_work(rds_wq, >c_send_w, 1);
}
}
 out:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/15] RDS: Use per-bucket rw lock for bind hash-table

2015-09-19 Thread Santosh Shilimkar
One global lock protecting hash-tables with 1024 buckets isn't
efficient and it shows up in a massive systems with truck
loads of RDS sockets serving multiple databases. The
perf data clearly highlights the contention on the rw
lock in these massive workloads.

When the contention gets worse, the code gets into a state where
it decides to back off on the lock. So while it has disabled interrupts,
it sits and backs off on this lock get. This causes the system to
become sluggish and eventually all sorts of bad things happen.

The simple fix is to move the lock into the hash bucket and
use per-bucket lock to improve the scalability.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c |  2 ++
 net/rds/bind.c   | 47 ---
 net/rds/rds.h|  1 +
 3 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index dc08766..384ea1e 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -582,6 +582,8 @@ static int rds_init(void)
 {
int ret;
 
+   rds_bind_lock_init();
+
ret = rds_conn_init();
if (ret)
goto out;
diff --git a/net/rds/bind.c b/net/rds/bind.c
index 166c605..bc6b93e 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -38,22 +38,27 @@
 #include 
 #include "rds.h"
 
+struct bind_bucket {
+   rwlock_tlock;
+   struct hlist_head   head;
+};
+
 #define BIND_HASH_SIZE 1024
-static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_RWLOCK(rds_bind_lock);
+static struct bind_bucket bind_hash_table[BIND_HASH_SIZE];
 
-static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
+static struct bind_bucket *hash_to_bucket(__be32 addr, __be16 port)
 {
return bind_hash_table + (jhash_2words((u32)addr, (u32)port, 0) &
  (BIND_HASH_SIZE - 1));
 }
 
 /* must hold either read or write lock (write lock for insert != NULL) */
-static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
+static struct rds_sock *rds_bind_lookup(struct bind_bucket *bucket,
+   __be32 addr, __be16 port,
struct rds_sock *insert)
 {
struct rds_sock *rs;
-   struct hlist_head *head = hash_to_bucket(addr, port);
+   struct hlist_head *head = >head;
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
@@ -91,10 +96,11 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
unsigned long flags;
+   struct bind_bucket *bucket = hash_to_bucket(addr, port);
 
-   read_lock_irqsave(_bind_lock, flags);
-   rs = rds_bind_lookup(addr, port, NULL);
-   read_unlock_irqrestore(_bind_lock, flags);
+   read_lock_irqsave(>lock, flags);
+   rs = rds_bind_lookup(bucket, addr, port, NULL);
+   read_unlock_irqrestore(>lock, flags);
 
if (rs && sock_flag(rds_rs_to_sk(rs), SOCK_DEAD)) {
rds_sock_put(rs);
@@ -113,6 +119,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
unsigned long flags;
int ret = -EADDRINUSE;
u16 rover, last;
+   struct bind_bucket *bucket;
 
if (*port != 0) {
rover = be16_to_cpu(*port);
@@ -122,13 +129,15 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
last = rover - 1;
}
 
-   write_lock_irqsave(_bind_lock, flags);
-
do {
struct rds_sock *rrs;
if (rover == 0)
rover++;
-   rrs = rds_bind_lookup(addr, cpu_to_be16(rover), rs);
+
+   bucket = hash_to_bucket(addr, cpu_to_be16(rover));
+   write_lock_irqsave(>lock, flags);
+   rrs = rds_bind_lookup(bucket, addr, cpu_to_be16(rover), rs);
+   write_unlock_irqrestore(>lock, flags);
if (!rrs) {
*port = rs->rs_bound_port;
ret = 0;
@@ -140,16 +149,16 @@ static int rds_add_bound(struct rds_sock *rs, __be32 
addr, __be16 *port)
}
} while (rover++ != last);
 
-   write_unlock_irqrestore(_bind_lock, flags);
-
return ret;
 }
 
 void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
+   struct bind_bucket *bucket =
+   hash_to_bucket(rs->rs_bound_addr, rs->rs_bound_port);
 
-   write_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(>lock, flags);
 
if (rs->rs_bound_addr) {
rdsdebug("rs %p unbinding from %pI4:%d\n",
@@ -161,7 +170,7 @@ void rds_remove_bound(struct rds_sock *rs)
rs->rs_bound_addr = 0;
}
 
-   write_unlock_irqrestore(_bind_lock, flags);
+   write_unlock_irqrestore(>lock, flags);
 }
 
 int rds_bind(struct 

[PATCH 02/15] RDS: make socket bind/release locking scheme simple and more efficient

2015-09-19 Thread Santosh Shilimkar
RDS bind and release locking scheme is very inefficient. It
uses RCU for maintaining the bind hash-table which is great but
it also needs to hold spinlock for [add/remove]_bound(). So
overall usecase, the hash-table concurrent speedup doesn't pay off.
In fact blocking nature of synchronize_rcu() makes the RDS
socket shutdown too slow which hurts RDS performance since
connection shutdown and re-connect happens quite often to
maintain the RC part of the protocol.

So we make the locking scheme simpler and more efficient by
replacing spin_locks with reader/writer locks and getting rid
off rcu for bind hash-table.

In subsequent patch, we also covert the global lock with per-bucket
lock to reduce the global lock contention.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/af_rds.c |  6 --
 net/rds/bind.c   | 35 +++
 2 files changed, 15 insertions(+), 26 deletions(-)

diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index a2f28a6..dc08766 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -72,13 +72,7 @@ static int rds_release(struct socket *sock)
rds_clear_recv_queue(rs);
rds_cong_remove_socket(rs);
 
-   /*
-* the binding lookup hash uses rcu, we need to
-* make sure we synchronize_rcu before we free our
-* entry
-*/
rds_remove_bound(rs);
-   synchronize_rcu();
 
rds_send_drop_to(rs, NULL);
rds_rdma_drop_keys(rs);
diff --git a/net/rds/bind.c b/net/rds/bind.c
index dd666fb..01989e2 100644
--- a/net/rds/bind.c
+++ b/net/rds/bind.c
@@ -40,7 +40,7 @@
 
 #define BIND_HASH_SIZE 1024
 static struct hlist_head bind_hash_table[BIND_HASH_SIZE];
-static DEFINE_SPINLOCK(rds_bind_lock);
+static DEFINE_RWLOCK(rds_bind_lock);
 
 static struct hlist_head *hash_to_bucket(__be32 addr, __be16 port)
 {
@@ -48,6 +48,7 @@ static struct hlist_head *hash_to_bucket(__be32 addr, __be16 
port)
  (BIND_HASH_SIZE - 1));
 }
 
+/* must hold either read or write lock (write lock for insert != NULL) */
 static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 port,
struct rds_sock *insert)
 {
@@ -56,30 +57,24 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
u64 cmp;
u64 needle = ((u64)be32_to_cpu(addr) << 32) | be16_to_cpu(port);
 
-   rcu_read_lock();
-   hlist_for_each_entry_rcu(rs, head, rs_bound_node) {
+   hlist_for_each_entry(rs, head, rs_bound_node) {
cmp = ((u64)be32_to_cpu(rs->rs_bound_addr) << 32) |
  be16_to_cpu(rs->rs_bound_port);
 
-   if (cmp == needle) {
-   rcu_read_unlock();
+   if (cmp == needle)
return rs;
-   }
}
-   rcu_read_unlock();
 
if (insert) {
/*
 * make sure our addr and port are set before
-* we are added to the list, other people
-* in rcu will find us as soon as the
-* hlist_add_head_rcu is done
+* we are added to the list.
 */
insert->rs_bound_addr = addr;
insert->rs_bound_port = port;
rds_sock_addref(insert);
 
-   hlist_add_head_rcu(>rs_bound_node, head);
+   hlist_add_head(>rs_bound_node, head);
}
return NULL;
 }
@@ -93,8 +88,11 @@ static struct rds_sock *rds_bind_lookup(__be32 addr, __be16 
port,
 struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 {
struct rds_sock *rs;
+   unsigned long flags;
 
+   read_lock_irqsave(_bind_lock, flags);
rs = rds_bind_lookup(addr, port, NULL);
+   read_unlock_irqrestore(_bind_lock, flags);
 
if (rs && !sock_flag(rds_rs_to_sk(rs), SOCK_DEAD))
rds_sock_addref(rs);
@@ -103,6 +101,7 @@ struct rds_sock *rds_find_bound(__be32 addr, __be16 port)
 
rdsdebug("returning rs %p for %pI4:%u\n", rs, ,
ntohs(port));
+
return rs;
 }
 
@@ -121,7 +120,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
last = rover - 1;
}
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
do {
if (rover == 0)
@@ -135,7 +134,7 @@ static int rds_add_bound(struct rds_sock *rs, __be32 addr, 
__be16 *port)
}
} while (rover++ != last);
 
-   spin_unlock_irqrestore(_bind_lock, flags);
+   write_unlock_irqrestore(_bind_lock, flags);
 
return ret;
 }
@@ -144,19 +143,19 @@ void rds_remove_bound(struct rds_sock *rs)
 {
unsigned long flags;
 
-   spin_lock_irqsave(_bind_lock, flags);
+   write_lock_irqsave(_bind_lock, flags);
 
if (rs->rs_bound_addr) {
rdsdebug("rs %p unbinding 

[PATCH 14/15] RDS: use max_mr from HCA caps than max_fmr

2015-09-19 Thread Santosh Shilimkar
From: Santosh Shilimkar 

All HCA drivers seems to popullate max_mr caps and few of
them do both max_mr and max_fmr.

Hence update RDS code to make use of max_mr.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index 2d3f2ab..883813a 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -148,8 +148,8 @@ static void rds_ib_add_one(struct ib_device *device)
rds_ibdev->max_sge = min(dev_attr->max_sge, RDS_IB_MAX_SGE);
 
rds_ibdev->fmr_max_remaps = dev_attr->max_map_per_fmr?: 32;
-   rds_ibdev->max_fmrs = dev_attr->max_fmr ?
-   min_t(unsigned int, dev_attr->max_fmr, fmr_pool_size) :
+   rds_ibdev->max_fmrs = dev_attr->max_mr ?
+   min_t(unsigned int, dev_attr->max_mr, fmr_pool_size) :
fmr_pool_size;
 
rds_ibdev->max_initiator_depth = dev_attr->max_qp_init_rd_atom;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 09/15] RDS: split send completion handling and do batch ack

2015-09-19 Thread Santosh Shilimkar
Similar to what we did with receive CQ completion handling, we split
the transmit completion handler so that it lets us implement batched
work completion handling.

We re-use the cq_poll routine and makes use of RDS_IB_SEND_OP to
identify the send vs receive completion event handler invocation.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/ib.h   |   6 ++-
 net/rds/ib_cm.c|  45 --
 net/rds/ib_send.c  | 110 +
 net/rds/ib_stats.c |   1 -
 net/rds/send.c |   1 +
 5 files changed, 98 insertions(+), 65 deletions(-)

diff --git a/net/rds/ib.h b/net/rds/ib.h
index 727759b..3a8cd31 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -25,6 +25,7 @@
 #define RDS_IB_RECYCLE_BATCH_COUNT 32
 
 #define RDS_IB_WC_MAX  32
+#define RDS_IB_SEND_OP BIT_ULL(63)
 
 extern struct rw_semaphore rds_ib_devices_lock;
 extern struct list_head rds_ib_devices;
@@ -118,9 +119,11 @@ struct rds_ib_connection {
struct ib_pd*i_pd;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
+   struct ib_wci_send_wc[RDS_IB_WC_MAX];
struct ib_wci_recv_wc[RDS_IB_WC_MAX];
 
/* interrupt handling */
+   struct tasklet_struct   i_send_tasklet;
struct tasklet_struct   i_recv_tasklet;
 
/* tx */
@@ -217,7 +220,6 @@ struct rds_ib_device {
 struct rds_ib_statistics {
uint64_ts_ib_connect_raced;
uint64_ts_ib_listen_closed_stale;
-   uint64_ts_ib_tx_cq_call;
uint64_ts_ib_evt_handler_call;
uint64_ts_ib_tasklet_call;
uint64_ts_ib_tx_cq_event;
@@ -371,7 +373,7 @@ extern wait_queue_head_t rds_ib_ring_empty_wait;
 void rds_ib_xmit_complete(struct rds_connection *conn);
 int rds_ib_xmit(struct rds_connection *conn, struct rds_message *rm,
unsigned int hdr_off, unsigned int sg, unsigned int off);
-void rds_ib_send_cq_comp_handler(struct ib_cq *cq, void *context);
+void rds_ib_send_cqe_handler(struct rds_ib_connection *ic, struct ib_wc *wc);
 void rds_ib_send_init_ring(struct rds_ib_connection *ic);
 void rds_ib_send_clear_ring(struct rds_ib_connection *ic);
 int rds_ib_xmit_rdma(struct rds_connection *conn, struct rm_rdma_op *op);
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 28e0979..8f51d0d 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -250,11 +250,34 @@ static void poll_cq(struct rds_ib_connection *ic, struct 
ib_cq *cq,
rdsdebug("wc wr_id 0x%llx status %u byte_len %u 
imm_data %u\n",
 (unsigned long long)wc->wr_id, wc->status,
 wc->byte_len, be32_to_cpu(wc->ex.imm_data));
-   rds_ib_recv_cqe_handler(ic, wc, ack_state);
+
+   if (wc->wr_id & RDS_IB_SEND_OP)
+   rds_ib_send_cqe_handler(ic, wc);
+   else
+   rds_ib_recv_cqe_handler(ic, wc, ack_state);
}
}
 }
 
+static void rds_ib_tasklet_fn_send(unsigned long data)
+{
+   struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
+   struct rds_connection *conn = ic->conn;
+   struct rds_ib_ack_state state;
+
+   rds_ib_stats_inc(s_ib_tasklet_call);
+
+   memset(, 0, sizeof(state));
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+   ib_req_notify_cq(ic->i_send_cq, IB_CQ_NEXT_COMP);
+   poll_cq(ic, ic->i_send_cq, ic->i_send_wc, );
+
+   if (rds_conn_up(conn) &&
+   (!test_bit(RDS_LL_SEND_FULL, >c_flags) ||
+   test_bit(0, >c_map_queued)))
+   rds_send_xmit(ic->conn);
+}
+
 static void rds_ib_tasklet_fn_recv(unsigned long data)
 {
struct rds_ib_connection *ic = (struct rds_ib_connection *)data;
@@ -304,6 +327,18 @@ static void rds_ib_qp_event_handler(struct ib_event 
*event, void *data)
}
 }
 
+static void rds_ib_cq_comp_handler_send(struct ib_cq *cq, void *context)
+{
+   struct rds_connection *conn = context;
+   struct rds_ib_connection *ic = conn->c_transport_data;
+
+   rdsdebug("conn %p cq %p\n", conn, cq);
+
+   rds_ib_stats_inc(s_ib_evt_handler_call);
+
+   tasklet_schedule(>i_send_tasklet);
+}
+
 /*
  * This needs to be very careful to not leave IS_ERR pointers around for
  * cleanup to trip over.
@@ -337,7 +372,8 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
ic->i_pd = rds_ibdev->pd;
 
cq_attr.cqe = ic->i_send_ring.w_nr + 1;
-   ic->i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
+
+   ic->i_send_cq = ib_create_cq(dev, rds_ib_cq_comp_handler_send,
 rds_ib_cq_event_handler, conn,
 _attr);
if (IS_ERR(ic->i_send_cq)) 

[PATCH 07/15] RDS: use rds_send_xmit() state instead of RDS_LL_SEND_FULL

2015-09-19 Thread Santosh Shilimkar
In Transport indepedent rds_sendmsg(), we shouldn't make decisions based
on RDS_LL_SEND_FULL which is used to manage the ring for RDMA based
transports. We can safely issue rds_send_xmit() and the using its
return value take decision on deferred work. This will also fix
the scenario where at times we are seeing connections stuck with
the LL_SEND_FULL bit getting set and never cleared.

We kick krdsd after any time we see -ENOMEM or -EAGAIN from the
ring allocation code.

Signed-off-by: Santosh Shilimkar 
Signed-off-by: Santosh Shilimkar 
---
 net/rds/send.c| 10 ++
 net/rds/threads.c |  2 ++
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/net/rds/send.c b/net/rds/send.c
index f1e709c..9d8b52d 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1122,8 +1122,9 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, 
size_t payload_len)
 */
rds_stats_inc(s_send_queued);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   rds_send_xmit(conn);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return payload_len;
@@ -1179,8 +1180,9 @@ rds_send_pong(struct rds_connection *conn, __be16 dport)
rds_stats_inc(s_send_queued);
rds_stats_inc(s_send_pong);
 
-   if (!test_bit(RDS_LL_SEND_FULL, >c_flags))
-   queue_delayed_work(rds_wq, >c_send_w, 0);
+   ret = rds_send_xmit(conn);
+   if (ret == -ENOMEM || ret == -EAGAIN)
+   queue_delayed_work(rds_wq, >c_send_w, 1);
 
rds_message_put(rm);
return 0;
diff --git a/net/rds/threads.c b/net/rds/threads.c
index dc2402e..454aa6d 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -162,7 +162,9 @@ void rds_send_worker(struct work_struct *work)
int ret;
 
if (rds_conn_state(conn) == RDS_CONN_UP) {
+   clear_bit(RDS_LL_SEND_FULL, >c_flags);
ret = rds_send_xmit(conn);
+   cond_resched();
rdsdebug("conn %p ret %d\n", conn, ret);
switch (ret) {
case -EAGAIN:
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


sr-iov and bridges (mlx4)

2015-09-19 Thread Matthew Monaco
Hello. I have a Mellanox ConnectX-3 Pro EN (MCX314A-BCCT). I'm only using a
single port so it must provide IP for my host as well as connectivity for VMs.
SR-IOV VFs are working great, my KVM VMs have Ethernet and RDMA.

However, I also want to support virtio VMs. Assuming eth0 is the first port on
my mlx nic, I've tried placing VMs on a bridge with the primary physical
interface, and giving an IP for management to a VF:

br0
|--- eth0
|--- VM
|--- VM
vf0 (IP)
vf1 -> VM
vf2 -> VM
vf3 -> VM

I've tried placing VMs on a bridge with one of the VFs and using the primary
iface for IP.

eth0 (IP)
br0
|--- vf0
|--- VM
|--- VM
vf1 -> VM
vf2 -> VM
vf3 -> VM

And I've also tried using a veth pair to really spread things out:

br0 (IP)
|--- eth0
|--- veth-a
br1   |
|--- veth-b
|--- VM
|--- VM
vf1 -> VM
vf2 -> VM
vf3 -> VM

In all cases, VMs with SR-IOV work fine, IP on the host works fine, outbound
DHCP from the virtio VMs work fine, but inbound frames are not making it back to
the VM.

Is there a know limitation of mixing SR-IOV and bridges in general? Does the
SR-IOV switch specific to the mlx4 hardware not work well with linux bridges? 
...?

Thanks!
Matt



signature.asc
Description: OpenPGP digital signature


197b:0250 JMicron JMC250 Gigabit ethernet doesn't work

2015-09-19 Thread Микола Дрючатий
[1.] One line summary of the problem: 197b:0250 JMicron JMC250 Gigabit
ethernet doesn't work

[2.] Full description of the problem/report:
Laptop ASUS X52JU can't connect to the router ASUS RT-AC68U via
ethernet. NetworkManager shows that cable is unplugged. Router has
Gigabit ethernet ports and laptop doesn't see them. I found that it's
a bug of the JMC kernel module.

[4.] Kernel version (from /proc/version):
Linux version 4.3.0-040300rc1-generic (kernel@gomeisa) (gcc version
4.8.4 (Ubuntu 4.8.4-2ubuntu1~14.04) ) #201509160642 SMP Wed Sep 16
10:44:16 UTC 2015

[7.] Environment
Description: Ubuntu 14.04.3 LTS
Release: 14.04

[7.1.] Software
Linux nick-notebook 4.3.0-040300rc1-generic #201509160642 SMP Wed Sep
16 10:44:16 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

Gnu C  4.8
Gnu make   3.81
binutils   2.24
util-linux 2.20.1
mount  support
module-init-tools  15
e2fsprogs  1.42.9
pcmciautils018
PPP2.4.5
Linux C Library2.19
Dynamic linker (ldd)   2.19
Procps 3.3.9
Net-tools  1.60
Kbd1.15.5
Sh-utils   8.21
wireless-tools 30
Modules Loaded nls_iso8859_1 nls_utf8 isofs uas usb_storage
drbg ansi_cprng ctr ccm rfcomm bnep arc4 ath9k intel_powerclamp
coretemp ath9k_common ath9k_hw uvcvideo amdkfd videobuf2_vmalloc
videobuf2_memops amd_iommu_v2 ath radeon videobuf2_core mac80211
v4l2_common kvm_intel videodev hid_logitech_hidpp media ttm
drm_kms_helper snd_hda_codec_conexant snd_hda_codec_generic
snd_hda_codec_hdmi snd_hda_intel snd_hda_codec drm kvm cfg80211
snd_hda_core snd_seq_midi snd_seq_midi_event snd_rawmidi snd_hwdep
i2c_algo_bit btusb fb_sys_fops syscopyarea snd_pcm sysfillrect
sysimgblt snd_seq jmb38x_ms lpc_ich joydev btrtl snd_seq_device
memstick input_leds btbcm btintel serio_raw bluetooth snd_timer shpchp
mei_me snd soundcore mei asus_laptop sparse_keymap input_polldev
parport_pc video ppdev mac_hid lp parport hid_logitech_dj usbhid hid
psmouse ahci libahci jme mii sdhci_pci sdhci fjes

[7.2.] Processor information (from /proc/cpuinfo):
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 37
model name : Intel(R) Core(TM) i3 CPU   M 380  @ 2.53GHz
stepping : 5
microcode : 0x2
cpu MHz : 1066.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt lahf_lm arat dtherm
tpr_shadow vnmi flexpriority ept vpid
bugs :
bogomips : 5053.72
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 37
model name : Intel(R) Core(TM) i3 CPU   M 380  @ 2.53GHz
stepping : 5
microcode : 0x2
cpu MHz : 1066.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 2
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt lahf_lm arat dtherm
tpr_shadow vnmi flexpriority ept vpid
bugs :
bogomips : 5053.72
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 37
model name : Intel(R) Core(TM) i3 CPU   M 380  @ 2.53GHz
stepping : 5
microcode : 0x2
cpu MHz : 933.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology
nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3
cx16 xtpr pdcm pcid sse4_1 sse4_2 popcnt lahf_lm arat dtherm
tpr_shadow vnmi flexpriority ept vpid
bugs :
bogomips : 5053.72
clflush size : 64
cache_alignment : 64
address sizes : 36 bits physical, 48 bits virtual
power management:

processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 37
model name : Intel(R) Core(TM) i3 CPU   M 380  @ 2.53GHz
stepping : 5
microcode : 0x2
cpu MHz : 933.000
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 2
cpu cores : 2
apicid : 5
initial apicid : 5
fpu : yes

Re: [PATCH net-next 1/7] rocker: track when FDB entry is touched.

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:45PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>The entry is touched once when created, and touched again for each update.
>The touched time is used to calculate FDB entry age.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/7] rocker: adding port ageing_time for ageing out FDB entries

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:47PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Follow-up patcheset will allow user to change ageing_time, but for now
>just hard-code it to a fixed value (the same value used as the default
>for the bridge driver).
>
>Signed-off-by: Scott Feldman 
>---
> drivers/net/ethernet/rocker/rocker.c |2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/drivers/net/ethernet/rocker/rocker.c 
>b/drivers/net/ethernet/rocker/rocker.c
>index f55ed2c..eba22f5 100644
>--- a/drivers/net/ethernet/rocker/rocker.c
>+++ b/drivers/net/ethernet/rocker/rocker.c
>@@ -221,6 +221,7 @@ struct rocker_port {
>   __be16 internal_vlan_id;
>   int stp_state;
>   u32 brport_flags;
>+  unsigned long ageing_time;
>   bool ctrls[ROCKER_CTRL_MAX];
>   unsigned long vlan_bitmap[ROCKER_VLAN_BITMAP_LEN];
>   struct napi_struct napi_tx;
>@@ -4975,6 +4976,7 @@ static int rocker_probe_port(struct rocker *rocker, 
>unsigned int port_number)
>   rocker_port->port_number = port_number;
>   rocker_port->pport = port_number + 1;
>   rocker_port->brport_flags = BR_LEARNING | BR_LEARNING_SYNC;
>+  rocker_port->ageing_time = 300 * HZ;

How about to add also "BR_DEFAULT_AGEING_TIME" and use it here?
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 2/7] rocker: store rocker_port in fdb key rather than pport

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:46PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>We'll need more info from rocker_port than just pport when we age out fdb
>entries, so store rocker_port rather than pport in each fdb entry.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 4/7] bridge: define some min/max ageing time constants we'll use next

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:48PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Signed-off-by: Scott Feldman 
>---
> include/linux/if_bridge.h |4 
> 1 file changed, 4 insertions(+)
>
>diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
>index dad8b00..6cc6dbc 100644
>--- a/include/linux/if_bridge.h
>+++ b/include/linux/if_bridge.h
>@@ -46,6 +46,10 @@ struct br_ip_list {
> #define BR_LEARNING_SYNC  BIT(9)
> #define BR_PROXYARP_WIFI  BIT(10)
> 
>+/* values as per ieee8021QBridgeFdbAgingTime */
>+#define BR_MIN_AGEING_TIME(10 * HZ)
>+#define BR_MAX_AGEING_TIME(100 * HZ)

I think that a bridge patch checking against these values should be
introduced along with these values, in the same patchset

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 7/7] switchdev: update documentation on FDB ageing_time

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:51PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 6/7] bridge: don't age externally added FDB entries

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:50PM CEST, sfel...@gmail.com wrote:
>From: Siva Mannem 
>
>Signed-off-by: Siva Mannem 
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 5/7] rocker: add FDB cleanup timer

2015-09-19 Thread Jiri Pirko
Fri, Sep 18, 2015 at 09:55:49PM CEST, sfel...@gmail.com wrote:
>From: Scott Feldman 
>
>Add a timer to each rocker switch to do FDB entry cleanup by ageing out
>expired entries.  The timer scheduling algo is copied from the bridge
>driver, for the most part, to keep the firing of the timer to a minimum.
>
>Signed-off-by: Scott Feldman 

Acked-by: Jiri Pirko 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Revert "net/phy: Add Vitesse 8641 phy ID"

2015-09-19 Thread Kevin Hao
On Fri, Sep 18, 2015 at 09:36:42AM +, Shaohui Xie wrote:
> > -Original Message-
> > From: Kevin Hao [mailto:haoke...@gmail.com]
> > Sent: Friday, September 18, 2015 3:43 PM
> > To: netdev@vger.kernel.org
> > Cc: Florian Fainelli; Xie Shaohui-B21989
> > Subject: [PATCH] Revert "net/phy: Add Vitesse 8641 phy ID"
> > 
> > This reverts commit 1298267b548a78840bd4b3e030993ff8747ca5e6.
> > 
> > That commit claim that the Vitesse VSC8641 is compatible with Vitesse
> > 82xx. But this is not true. It seems that all the registers used in
> > Vitesse phy driver are not compatible between 8641 and 82xx.
> [S.H] There are differences between some register's bit define. 
> But some are not used by driver. 
> 
> > It does cause malfunction of the Ethernet on p1010rdb-pa board.
> [S.H] Which exact register's setting caused problem?
> If it needs different setting,
> It can be handled by distinguishing phy_id, or replacing relative API for 
> VSC8641.

In my case, the malfunction of the Ethernet is caused by writing the wrong
value for the skew timing. The Ethernet can work if I skip the setting of
skew in phy driver. But as I said in the commit log, all the registers used
in the current Vitesse phy driver are not compatible between 8641 and 82xx.
The following are the main differences between these registers:


Auxiliary Control & Status Register (0x1c):
  8641  8244
 6: reserved6: ActiPHY Mode Enable
 1: Sticky Reset Enable 1-0: ActiPHYTM Sleep Timer

Extended PHY Control Set 1 (0x17):
  8641  8244
8: RGMII skew timing11-10: RGMII TX_CLK Skew 
Selection
9 - 8: RGMII RX_CLK Skew 
Selection
5: ActiPHY mode enable  5: RX Idle Clock Enable
3: reserved 3: Far End Loopback Mode Enable
1: GMII transmit pin reversal   2 - 1: MAC/Media Interface Mode 
Select
0: reserved 0: EEPROM Status

MII_VSC82X4_EXT_PAGE_16E (0x10):
  8641  8244
   Enhanced LED Method Select register Reserved register 

MII_VSC82X4_EXT_PAGE_17E (0x11):
  8641  8244
   Enhanced LED Behavior register  CLK125 micro Clock Enable

MII_VSC82X4_EXT_PAGE_18E (0x12):
  8641  8244
   CRC Good Counter register   Reserved register

As you can see, I don't think it is a better option to sprinkle the checking
phy id for each writing of these register. Of course we can add the specific
API for 8641 to fix this problem. But since the generic phy driver works well
with the 8641 phy, this commit does seem a regression for me.  So I prefer a
simple revert to this commit and merge the revert into the stable kernel tree
to fix the regression first. We can add the corresponding 8641 phy API in the
following patches.

Thanks,
Kevin


pgpansPo0zADy.pgp
Description: PGP signature


Re: [PATCH net-next] tcp: Fix CWV being too strict on thin streams

2015-09-19 Thread Neal Cardwell
On Fri, Sep 18, 2015 at 7:38 PM, Bendik Rønning Opstad
 wrote:
>
> Application limited streams such as thin streams, that transmit small
> amounts of payload in relatively few packets per RTT, are prevented from
> growing the CWND after experiencing loss. This leads to increased
> sojourn times for data segments in streams that often transmit
> time-dependent data.
>
> After the CWND is reduced due to loss, and an ACK has made room in the
> send window for more packets to be transmitted, the CWND will not grow
> unless there is more unsent data buffered in the output queue than the
> CWND permits to be sent. That is because tcp_cwnd_validate(), which
> updates tp->is_cwnd_limited, is only called in tcp_write_xmit() when at
> least one packet with new data has been sent. However, if all the
> buffered data in the output queue was sent within the current CWND,
> is_cwnd_limited will remain false even when there is no more room in the
> CWND. While the CWND is fully utilized, any new data put on the output
> queue will be held back (i.e. limited by the CWND), but
> tp->is_cwnd_limited will not be updated as no packets were transmitted.
>
> Fix by updating tp->is_cwnd_limited if no packets are sent due to the
> CWND being fully utilized.

Thanks for this report!

When you say "CWND is reduced due to loss", are you talking about RTO
or Fast Recovery? Do you have any traces you can share that illustrate
this issue?

Have you verified that this patch fixes the issue you identified? I
think the fix may be in the wrong place for that scenario, or at least
incomplete.

In the scenario you describe, you say "all the buffered data in the
output queue was sent within the current CWND", which means that there
is no unsent data, so in tcp_write_xmit() the call to tcp_send_head()
would return NULL, so we would not enter the while loop, and could not
set is_cwnd_limited to true.

In the scenario you describe, I think we'd need to check for being
cwnd-limited in tcp_xmit_retransmit_queue().

How about something like the following patch:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..8e6a772 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,6 +2145,7 @@ repair:
tcp_cwnd_validate(sk, is_cwnd_limited);
return false;
}
+   tp->is_cwnd_limited |= is_cwnd_limited;
return !tp->packets_out && tcp_send_head(sk);
 }

@@ -2762,8 +2763,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
 * packet to be MSS sized and all the
 * packet counting works out.
 */
-   if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
+   if (tcp_packets_in_flight(tp) >= tp->snd_cwnd) {
+   tp->is_cwnd_limited = true;
return;
+   }

if (fwd_rexmitting) {
 begin_fwd:

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch net-next RFC 4/6] switchdev: move transaction phase enum under transaction structure

2015-09-19 Thread Jiri Pirko
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |  4 ++--
 include/net/switchdev.h  |  3 +--
 net/dsa/slave.c  | 18 ++
 net/switchdev/switchdev.c| 12 ++--
 4 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index de1a367..9750840 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4356,7 +4356,7 @@ static int rocker_port_attr_set(struct net_device *dev,
 {
struct rocker_port *rocker_port = netdev_priv(dev);
struct rocker_trans rtrans = {
-   .ph = attr->trans_ph,
+   .ph = trans->ph,
};
int err = 0;
 
@@ -,7 +,7 @@ static int rocker_port_obj_add(struct net_device *dev,
 {
struct rocker_port *rocker_port = netdev_priv(dev);
struct rocker_trans rtrans = {
-   .ph = obj->trans_ph,
+   .ph = trans->ph,
};
const struct switchdev_obj_ipv4_fib *fib4;
int err = 0;
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 1e394f1..368a642 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -32,6 +32,7 @@ struct switchdev_trans_item {
 
 struct switchdev_trans {
struct list_head item_list;
+   enum switchdev_trans_ph ph;
 };
 
 enum switchdev_attr_id {
@@ -43,7 +44,6 @@ enum switchdev_attr_id {
 
 struct switchdev_attr {
enum switchdev_attr_id id;
-   enum switchdev_trans_ph trans_ph;
u32 flags;
union {
struct netdev_phys_item_id ppid;/* PORT_PARENT_ID */
@@ -63,7 +63,6 @@ enum switchdev_obj_id {
 
 struct switchdev_obj {
enum switchdev_obj_id id;
-   enum switchdev_trans_ph trans_ph;
int (*cb)(struct net_device *dev, struct switchdev_obj *obj);
union {
struct switchdev_obj_vlan { /* PORT_VLAN */
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index ac76fd1..748cc63 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -242,7 +242,8 @@ static int dsa_bridge_check_vlan_range(struct dsa_switch 
*ds,
 }
 
 static int dsa_slave_port_vlan_add(struct net_device *dev,
-  struct switchdev_obj *obj)
+  struct switchdev_obj *obj,
+  struct switchdev_trans *trans)
 {
struct switchdev_obj_vlan *vlan = >u.vlan;
struct dsa_slave_priv *p = netdev_priv(dev);
@@ -250,7 +251,7 @@ static int dsa_slave_port_vlan_add(struct net_device *dev,
u16 vid;
int err;
 
-   switch (obj->trans_ph) {
+   switch (trans->ph) {
case SWITCHDEV_TRANS_PREPARE:
if (!ds->drv->port_vlan_add || !ds->drv->port_pvid_set)
return -EOPNOTSUPP;
@@ -347,16 +348,17 @@ static int dsa_slave_port_vlan_dump(struct net_device 
*dev,
 }
 
 static int dsa_slave_port_fdb_add(struct net_device *dev,
- struct switchdev_obj *obj)
+ struct switchdev_obj *obj,
+ struct switchdev_trans *trans)
 {
struct switchdev_obj_fdb *fdb = >u.fdb;
struct dsa_slave_priv *p = netdev_priv(dev);
struct dsa_switch *ds = p->parent;
int ret = -EOPNOTSUPP;
 
-   if (obj->trans_ph == SWITCHDEV_TRANS_PREPARE)
+   if (trans->ph == SWITCHDEV_TRANS_PREPARE)
ret = ds->drv->port_fdb_add ? 0 : -EOPNOTSUPP;
-   else if (obj->trans_ph == SWITCHDEV_TRANS_COMMIT)
+   else if (trans->ph == SWITCHDEV_TRANS_COMMIT)
ret = ds->drv->port_fdb_add(ds, p->port, fdb->addr, fdb->vid);
 
return ret;
@@ -463,7 +465,7 @@ static int dsa_slave_port_attr_set(struct net_device *dev,
 
switch (attr->id) {
case SWITCHDEV_ATTR_PORT_STP_STATE:
-   if (attr->trans_ph == SWITCHDEV_TRANS_COMMIT)
+   if (trans->ph == SWITCHDEV_TRANS_COMMIT)
ret = dsa_slave_stp_update(dev, attr->u.stp_state);
break;
default:
@@ -487,10 +489,10 @@ static int dsa_slave_port_obj_add(struct net_device *dev,
 
switch (obj->id) {
case SWITCHDEV_OBJ_PORT_FDB:
-   err = dsa_slave_port_fdb_add(dev, obj);
+   err = dsa_slave_port_fdb_add(dev, obj, trans);
break;
case SWITCHDEV_OBJ_PORT_VLAN:
-   err = dsa_slave_port_vlan_add(dev, obj);
+   err = dsa_slave_port_vlan_add(dev, obj, trans);
break;
default:
err = -EOPNOTSUPP;
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index a3647bf..82f8bcd 100644
--- a/net/switchdev/switchdev.c

[patch net-next RFC 1/6] switchdev: rename "trans" to "trans_ph".

2015-09-19 Thread Jiri Pirko
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c | 382 +--
 include/net/switchdev.h  |   6 +-
 net/dsa/slave.c  |   8 +-
 net/switchdev/switchdev.c|  12 +-
 4 files changed, 204 insertions(+), 204 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 34ac41a..b5f2ff8 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -340,7 +340,7 @@ static bool rocker_port_is_ovsed(const struct rocker_port 
*rocker_port)
 #define ROCKER_OP_FLAG_REFRESH BIT(3)
 
 static void *__rocker_port_mem_alloc(struct rocker_port *rocker_port,
-enum switchdev_trans trans, int flags,
+enum switchdev_trans_ph trans_ph, int 
flags,
 size_t size)
 {
struct list_head *elem = NULL;
@@ -356,7 +356,7 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 * memory used in the commit phase.
 */
 
-   switch (trans) {
+   switch (trans_ph) {
case SWITCHDEV_TRANS_PREPARE:
elem = kzalloc(size + sizeof(*elem), gfp_flags);
if (!elem)
@@ -381,20 +381,20 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 }
 
 static void *rocker_port_kzalloc(struct rocker_port *rocker_port,
-enum switchdev_trans trans, int flags,
+enum switchdev_trans_ph trans_ph, int flags,
 size_t size)
 {
-   return __rocker_port_mem_alloc(rocker_port, trans, flags, size);
+   return __rocker_port_mem_alloc(rocker_port, trans_ph, flags, size);
 }
 
 static void *rocker_port_kcalloc(struct rocker_port *rocker_port,
-enum switchdev_trans trans, int flags,
+enum switchdev_trans_ph trans_ph, int flags,
 size_t n, size_t size)
 {
-   return __rocker_port_mem_alloc(rocker_port, trans, flags, n * size);
+   return __rocker_port_mem_alloc(rocker_port, trans_ph, flags, n * size);
 }
 
-static void rocker_port_kfree(enum switchdev_trans trans, const void *mem)
+static void rocker_port_kfree(enum switchdev_trans_ph trans_ph, const void 
*mem)
 {
struct list_head *elem;
 
@@ -403,7 +403,7 @@ static void rocker_port_kfree(enum switchdev_trans trans, 
const void *mem)
 * commit phase.
 */
 
-   if (trans == SWITCHDEV_TRANS_PREPARE)
+   if (trans_ph == SWITCHDEV_TRANS_PREPARE)
return;
 
elem = (struct list_head *)mem - 1;
@@ -430,22 +430,22 @@ static void rocker_wait_init(struct rocker_wait *wait)
 }
 
 static struct rocker_wait *rocker_wait_create(struct rocker_port *rocker_port,
- enum switchdev_trans trans,
+ enum switchdev_trans_ph trans_ph,
  int flags)
 {
struct rocker_wait *wait;
 
-   wait = rocker_port_kzalloc(rocker_port, trans, flags, sizeof(*wait));
+   wait = rocker_port_kzalloc(rocker_port, trans_ph, flags, sizeof(*wait));
if (!wait)
return NULL;
rocker_wait_init(wait);
return wait;
 }
 
-static void rocker_wait_destroy(enum switchdev_trans trans,
+static void rocker_wait_destroy(enum switchdev_trans_ph trans_ph,
struct rocker_wait *wait)
 {
-   rocker_port_kfree(trans, wait);
+   rocker_port_kfree(trans_ph, wait);
 }
 
 static bool rocker_wait_event_timeout(struct rocker_wait *wait,
@@ -1463,7 +1463,7 @@ static int rocker_event_link_change(const struct rocker 
*rocker,
 }
 
 static int rocker_port_fdb(struct rocker_port *rocker_port,
-  enum switchdev_trans trans,
+  enum switchdev_trans_ph trans_ph,
   const unsigned char *addr,
   __be16 vlan_id, int flags);
 
@@ -1582,7 +1582,7 @@ typedef int (*rocker_cmd_proc_cb_t)(const struct 
rocker_port *rocker_port,
void *priv);
 
 static int rocker_cmd_exec(struct rocker_port *rocker_port,
-  enum switchdev_trans trans, int flags,
+  enum switchdev_trans_ph trans_ph, int flags,
   rocker_cmd_prep_cb_t prepare, void *prepare_priv,
   rocker_cmd_proc_cb_t process, void *process_priv)
 {
@@ -1593,7 +1593,7 @@ static int rocker_cmd_exec(struct rocker_port 
*rocker_port,
unsigned long lock_flags;
int err;
 
-   wait = rocker_wait_create(rocker_port, trans, flags);
+   

[patch net-next RFC 5/6] rocker: use switchdev transaction queue for allocated memory

2015-09-19 Thread Jiri Pirko
Benefit from previously introduced infra and remove rocker specific
transaction memory management.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c | 64 
 1 file changed, 13 insertions(+), 51 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 9750840..0735d90 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -226,7 +226,6 @@ struct rocker_port {
struct napi_struct napi_rx;
struct rocker_dma_ring_info tx_ring;
struct rocker_dma_ring_info rx_ring;
-   struct list_head trans_mem;
 };
 
 struct rocker {
@@ -347,6 +346,7 @@ enum rocker_trans_ph {
 };
 
 struct rocker_trans {
+   struct switchdev_trans *trans;
enum rocker_trans_ph ph;
 };
 
@@ -354,7 +354,7 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 struct rocker_trans *rtrans, int flags,
 size_t size)
 {
-   struct list_head *elem = NULL;
+   struct switchdev_trans_item *elem = NULL;
gfp_t gfp_flags = (flags & ROCKER_OP_FLAG_NOWAIT) ?
  GFP_ATOMIC : GFP_KERNEL;
 
@@ -369,20 +369,15 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 
switch (rtrans->ph) {
case ROCKER_TRANS_PH_PREPARE:
-   elem = kzalloc(size + sizeof(*elem), gfp_flags);
+   elem = kzalloc(size + sizeof(elem), gfp_flags);
if (!elem)
return NULL;
-   list_add_tail(elem, _port->trans_mem);
-   break;
+   switchdev_trans_item_enqueue(rtrans->trans, elem, kfree, elem);
case ROCKER_TRANS_PH_COMMIT:
-   BUG_ON(list_empty(_port->trans_mem));
-   elem = rocker_port->trans_mem.next;
-   list_del_init(elem);
+   elem = switchdev_trans_item_dequeue(rtrans->trans);
break;
case ROCKER_TRANS_PH_NONE:
-   elem = kzalloc(size + sizeof(*elem), gfp_flags);
-   if (elem)
-   INIT_LIST_HEAD(elem);
+   elem = kzalloc(size + sizeof(elem), gfp_flags);
break;
default:
break;
@@ -407,18 +402,16 @@ static void *rocker_port_kcalloc(struct rocker_port 
*rocker_port,
 
 static void rocker_port_kfree(struct rocker_trans *rtrans, const void *mem)
 {
-   struct list_head *elem;
+   struct switchdev_trans_item *elem;
 
-   /* Frees are ignored if in transaction prepare phase.  The
-* memory remains on the per-port list until freed in the
-* commit phase.
+   /* Free only in case of NONE phase, otherwise, switchdev core
+* will take care of the cleanup
 */
 
-   if (rtrans->ph == ROCKER_TRANS_PH_PREPARE)
+   if (rtrans->ph != ROCKER_TRANS_PH_NONE)
return;
 
-   elem = (struct list_head *)mem - 1;
-   BUG_ON(!list_empty(elem));
+   elem = (struct switchdev_trans_item *) mem - 1;
kfree(elem);
 }
 
@@ -4322,16 +4315,6 @@ static int rocker_port_attr_get(struct net_device *dev,
return 0;
 }
 
-static void rocker_port_trans_abort(const struct rocker_port *rocker_port)
-{
-   struct list_head *mem, *tmp;
-
-   list_for_each_safe(mem, tmp, _port->trans_mem) {
-   list_del(mem);
-   kfree(mem);
-   }
-}
-
 static int rocker_port_brport_flags_set(struct rocker_port *rocker_port,
struct rocker_trans *rtrans,
unsigned long brport_flags)
@@ -4357,20 +4340,10 @@ static int rocker_port_attr_set(struct net_device *dev,
struct rocker_port *rocker_port = netdev_priv(dev);
struct rocker_trans rtrans = {
.ph = trans->ph,
+   .trans = trans,
};
int err = 0;
 
-   switch (rtrans.ph) {
-   case ROCKER_TRANS_PH_PREPARE:
-   BUG_ON(!list_empty(_port->trans_mem));
-   break;
-   case ROCKER_TRANS_PH_ABORT:
-   rocker_port_trans_abort(rocker_port);
-   return 0;
-   default:
-   break;
-   }
-
switch (attr->id) {
case SWITCHDEV_ATTR_PORT_STP_STATE:
err = rocker_port_stp_update(rocker_port, ,
@@ -4445,21 +4418,11 @@ static int rocker_port_obj_add(struct net_device *dev,
struct rocker_port *rocker_port = netdev_priv(dev);
struct rocker_trans rtrans = {
.ph = trans->ph,
+   .trans = trans,
};
const struct switchdev_obj_ipv4_fib *fib4;
int err = 0;
 
-   switch (rtrans.ph) {
-   case ROCKER_TRANS_PH_PREPARE:
-   BUG_ON(!list_empty(_port->trans_mem));
-   break;
-   case ROCKER_TRANS_PH_ABORT:
-   

[patch net-next RFC 3/6] rocker: switch to local transaction phase enum

2015-09-19 Thread Jiri Pirko
Since switchdev_trans_ph anum is going to be removed, and rocker code is
way too complicated in this matter to be converted, just introduce local
enum for transaction phase. Pass it around in local transaction
structure.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c | 469 ++-
 1 file changed, 245 insertions(+), 224 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 92e1520..de1a367 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -339,8 +339,19 @@ static bool rocker_port_is_ovsed(const struct rocker_port 
*rocker_port)
 #define ROCKER_OP_FLAG_LEARNED BIT(2)
 #define ROCKER_OP_FLAG_REFRESH BIT(3)
 
+enum rocker_trans_ph {
+   ROCKER_TRANS_PH_NONE,
+   ROCKER_TRANS_PH_PREPARE,
+   ROCKER_TRANS_PH_ABORT,
+   ROCKER_TRANS_PH_COMMIT,
+};
+
+struct rocker_trans {
+   enum rocker_trans_ph ph;
+};
+
 static void *__rocker_port_mem_alloc(struct rocker_port *rocker_port,
-enum switchdev_trans_ph trans_ph, int 
flags,
+struct rocker_trans *rtrans, int flags,
 size_t size)
 {
struct list_head *elem = NULL;
@@ -356,19 +367,19 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 * memory used in the commit phase.
 */
 
-   switch (trans_ph) {
-   case SWITCHDEV_TRANS_PREPARE:
+   switch (rtrans->ph) {
+   case ROCKER_TRANS_PH_PREPARE:
elem = kzalloc(size + sizeof(*elem), gfp_flags);
if (!elem)
return NULL;
list_add_tail(elem, _port->trans_mem);
break;
-   case SWITCHDEV_TRANS_COMMIT:
+   case ROCKER_TRANS_PH_COMMIT:
BUG_ON(list_empty(_port->trans_mem));
elem = rocker_port->trans_mem.next;
list_del_init(elem);
break;
-   case SWITCHDEV_TRANS_NONE:
+   case ROCKER_TRANS_PH_NONE:
elem = kzalloc(size + sizeof(*elem), gfp_flags);
if (elem)
INIT_LIST_HEAD(elem);
@@ -381,20 +392,20 @@ static void *__rocker_port_mem_alloc(struct rocker_port 
*rocker_port,
 }
 
 static void *rocker_port_kzalloc(struct rocker_port *rocker_port,
-enum switchdev_trans_ph trans_ph, int flags,
+struct rocker_trans *rtrans, int flags,
 size_t size)
 {
-   return __rocker_port_mem_alloc(rocker_port, trans_ph, flags, size);
+   return __rocker_port_mem_alloc(rocker_port, rtrans, flags, size);
 }
 
 static void *rocker_port_kcalloc(struct rocker_port *rocker_port,
-enum switchdev_trans_ph trans_ph, int flags,
+struct rocker_trans *rtrans, int flags,
 size_t n, size_t size)
 {
-   return __rocker_port_mem_alloc(rocker_port, trans_ph, flags, n * size);
+   return __rocker_port_mem_alloc(rocker_port, rtrans, flags, n * size);
 }
 
-static void rocker_port_kfree(enum switchdev_trans_ph trans_ph, const void 
*mem)
+static void rocker_port_kfree(struct rocker_trans *rtrans, const void *mem)
 {
struct list_head *elem;
 
@@ -403,7 +414,7 @@ static void rocker_port_kfree(enum switchdev_trans_ph 
trans_ph, const void *mem)
 * commit phase.
 */
 
-   if (trans_ph == SWITCHDEV_TRANS_PREPARE)
+   if (rtrans->ph == ROCKER_TRANS_PH_PREPARE)
return;
 
elem = (struct list_head *)mem - 1;
@@ -430,22 +441,22 @@ static void rocker_wait_init(struct rocker_wait *wait)
 }
 
 static struct rocker_wait *rocker_wait_create(struct rocker_port *rocker_port,
- enum switchdev_trans_ph trans_ph,
+ struct rocker_trans *rtrans,
  int flags)
 {
struct rocker_wait *wait;
 
-   wait = rocker_port_kzalloc(rocker_port, trans_ph, flags, sizeof(*wait));
+   wait = rocker_port_kzalloc(rocker_port, rtrans, flags, sizeof(*wait));
if (!wait)
return NULL;
rocker_wait_init(wait);
return wait;
 }
 
-static void rocker_wait_destroy(enum switchdev_trans_ph trans_ph,
+static void rocker_wait_destroy(struct rocker_trans *rtrans,
struct rocker_wait *wait)
 {
-   rocker_port_kfree(trans_ph, wait);
+   rocker_port_kfree(rtrans, wait);
 }
 
 static bool rocker_wait_event_timeout(struct rocker_wait *wait,
@@ -1408,7 +1419,7 @@ static irqreturn_t rocker_cmd_irq_handler(int irq, void 
*dev_id)
wait = rocker_desc_cookie_ptr_get(desc_info);
if (wait->nowait) {

[patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Jiri Pirko
Jiri Pirko (6):
  switchdev: rename "trans" to "trans_ph".
  switchdev: introduce transaction infrastructure for attr_set and
obj_add
  rocker: switch to local transaction phase enum
  switchdev: move transaction phase enum under transaction structure
  rocker: use switchdev transaction queue for allocated memory
  switchdev: split commit and prepare phase into two callbacks

 drivers/net/ethernet/rocker/rocker.c | 579 ++-
 include/net/switchdev.h  |  39 ++-
 net/dsa/slave.c  | 129 +---
 net/switchdev/switchdev.c| 153 +++--
 4 files changed, 545 insertions(+), 355 deletions(-)

-- 
1.9.3

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[patch net-next RFC 2/6] switchdev: introduce transaction infrastructure for attr_set and obj_add

2015-09-19 Thread Jiri Pirko
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Commit/abort item destruction
is handled as well.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |  6 ++-
 include/net/switchdev.h  | 24 --
 net/dsa/slave.c  |  6 ++-
 net/switchdev/switchdev.c| 87 ++--
 4 files changed, 103 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index b5f2ff8..92e1520 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4340,7 +4340,8 @@ static int rocker_port_brport_flags_set(struct 
rocker_port *rocker_port,
 }
 
 static int rocker_port_attr_set(struct net_device *dev,
-   struct switchdev_attr *attr)
+   struct switchdev_attr *attr,
+   struct switchdev_trans *trans)
 {
struct rocker_port *rocker_port = netdev_priv(dev);
int err = 0;
@@ -4424,7 +4425,8 @@ static int rocker_port_fdb_add(struct rocker_port 
*rocker_port,
 }
 
 static int rocker_port_obj_add(struct net_device *dev,
-  struct switchdev_obj *obj)
+  struct switchdev_obj *obj,
+  struct switchdev_trans *trans)
 {
struct rocker_port *rocker_port = netdev_priv(dev);
const struct switchdev_obj_ipv4_fib *fib4;
diff --git a/include/net/switchdev.h b/include/net/switchdev.h
index 494f510..1e394f1 100644
--- a/include/net/switchdev.h
+++ b/include/net/switchdev.h
@@ -1,6 +1,6 @@
 /*
  * include/net/switchdev.h - Switch device API
- * Copyright (c) 2014 Jiri Pirko 
+ * Copyright (c) 2014-2015 Jiri Pirko 
  * Copyright (c) 2014-2015 Scott Feldman 
  *
  * This program is free software; you can redistribute it and/or modify
@@ -13,6 +13,7 @@
 
 #include 
 #include 
+#include 
 
 #define SWITCHDEV_F_NO_RECURSE BIT(0)
 
@@ -23,6 +24,16 @@ enum switchdev_trans_ph {
SWITCHDEV_TRANS_COMMIT,
 };
 
+struct switchdev_trans_item {
+   struct list_head list;
+   void *data;
+   void (*destructor)(const void *data);
+};
+
+struct switchdev_trans {
+   struct list_head item_list;
+};
+
 enum switchdev_attr_id {
SWITCHDEV_ATTR_UNDEFINED,
SWITCHDEV_ATTR_PORT_PARENT_ID,
@@ -77,6 +88,11 @@ struct switchdev_obj {
} u;
 };
 
+void switchdev_trans_item_enqueue(struct switchdev_trans *trans,
+ void *data, void (*destructor)(void const *),
+ struct switchdev_trans_item *tritem);
+void *switchdev_trans_item_dequeue(struct switchdev_trans *trans);
+
 /**
  * struct switchdev_ops - switchdev operations
  *
@@ -94,9 +110,11 @@ struct switchdev_ops {
int (*switchdev_port_attr_get)(struct net_device *dev,
   struct switchdev_attr *attr);
int (*switchdev_port_attr_set)(struct net_device *dev,
-  struct switchdev_attr *attr);
+  struct switchdev_attr *attr,
+  struct switchdev_trans *trans);
int (*switchdev_port_obj_add)(struct net_device *dev,
- struct switchdev_obj *obj);
+ struct switchdev_obj *obj,
+ struct switchdev_trans *trans);
int (*switchdev_port_obj_del)(struct net_device *dev,
  struct switchdev_obj *obj);
int (*switchdev_port_obj_dump)(struct net_device *dev,
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 7f50b74..ac76fd1 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -456,7 +456,8 @@ static int dsa_slave_stp_update(struct net_device *dev, u8 
state)
 }
 
 static int dsa_slave_port_attr_set(struct net_device *dev,
-  struct switchdev_attr *attr)
+  struct switchdev_attr *attr,
+  struct switchdev_trans *trans)
 {
int ret = 0;
 
@@ -474,7 +475,8 @@ static int dsa_slave_port_attr_set(struct net_device *dev,
 }
 
 static int dsa_slave_port_obj_add(struct net_device *dev,
- struct switchdev_obj *obj)
+ struct switchdev_obj *obj,
+ struct switchdev_trans *trans)
 {
int err;
 
diff --git a/net/switchdev/switchdev.c b/net/switchdev/switchdev.c
index df5a544..a3647bf 100644
--- a/net/switchdev/switchdev.c

[patch net-next RFC 6/6] switchdev: split commit and prepare phase into two callbacks

2015-09-19 Thread Jiri Pirko
It is nore convenient to have prepare and commit phase for attr_set and
obj_add as separete callbacks. If a driver needs to do it differently, it
can easily do in inside its code.

Signed-off-by: Jiri Pirko 
---
 drivers/net/ethernet/rocker/rocker.c |  88 +++---
 include/net/switchdev.h  |  18 +++---
 net/dsa/slave.c  | 117 +++
 net/switchdev/switchdev.c|  74 +-
 4 files changed, 210 insertions(+), 87 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker.c 
b/drivers/net/ethernet/rocker/rocker.c
index 0735d90..42aa86c 100644
--- a/drivers/net/ethernet/rocker/rocker.c
+++ b/drivers/net/ethernet/rocker/rocker.c
@@ -4333,35 +4333,55 @@ static int rocker_port_brport_flags_set(struct 
rocker_port *rocker_port,
return err;
 }
 
-static int rocker_port_attr_set(struct net_device *dev,
-   struct switchdev_attr *attr,
-   struct switchdev_trans *trans)
+static int __rocker_port_attr_set(struct rocker_port *rocker_port,
+ struct switchdev_attr *attr,
+ struct rocker_trans *rtrans)
 {
-   struct rocker_port *rocker_port = netdev_priv(dev);
-   struct rocker_trans rtrans = {
-   .ph = trans->ph,
-   .trans = trans,
-   };
int err = 0;
 
switch (attr->id) {
case SWITCHDEV_ATTR_PORT_STP_STATE:
-   err = rocker_port_stp_update(rocker_port, ,
+   err = rocker_port_stp_update(rocker_port, rtrans,
 ROCKER_OP_FLAG_NOWAIT,
 attr->u.stp_state);
break;
case SWITCHDEV_ATTR_PORT_BRIDGE_FLAGS:
-   err = rocker_port_brport_flags_set(rocker_port, ,
+   err = rocker_port_brport_flags_set(rocker_port, rtrans,
   attr->u.brport_flags);
break;
default:
err = -EOPNOTSUPP;
break;
}
-
return err;
 }
 
+static int rocker_port_attr_pre_set(struct net_device *dev,
+   struct switchdev_attr *attr,
+   struct switchdev_trans *trans)
+{
+   struct rocker_port *rocker_port = netdev_priv(dev);
+   struct rocker_trans rtrans = {
+   .ph = ROCKER_TRANS_PH_PREPARE,
+   .trans = trans,
+   };
+
+   return __rocker_port_attr_set(rocker_port, attr, );
+}
+
+static int rocker_port_attr_set(struct net_device *dev,
+   struct switchdev_attr *attr,
+   struct switchdev_trans *trans)
+{
+   struct rocker_port *rocker_port = netdev_priv(dev);
+   struct rocker_trans rtrans = {
+   .ph = ROCKER_TRANS_PH_COMMIT,
+   .trans = trans,
+   };
+
+   return __rocker_port_attr_set(rocker_port, attr, );
+}
+
 static int rocker_port_vlan_add(struct rocker_port *rocker_port,
struct rocker_trans *rtrans, u16 vid, u16 flags)
 {
@@ -4411,40 +4431,60 @@ static int rocker_port_fdb_add(struct rocker_port 
*rocker_port,
return rocker_port_fdb(rocker_port, rtrans, fdb->addr, vlan_id, flags);
 }
 
-static int rocker_port_obj_add(struct net_device *dev,
-  struct switchdev_obj *obj,
-  struct switchdev_trans *trans)
+static int __rocker_port_obj_add(struct rocker_port *rocker_port,
+struct switchdev_obj *obj,
+struct rocker_trans *rtrans)
 {
-   struct rocker_port *rocker_port = netdev_priv(dev);
-   struct rocker_trans rtrans = {
-   .ph = trans->ph,
-   .trans = trans,
-   };
const struct switchdev_obj_ipv4_fib *fib4;
int err = 0;
 
switch (obj->id) {
case SWITCHDEV_OBJ_PORT_VLAN:
-   err = rocker_port_vlans_add(rocker_port, ,
+   err = rocker_port_vlans_add(rocker_port, rtrans,
>u.vlan);
break;
case SWITCHDEV_OBJ_IPV4_FIB:
fib4 = >u.ipv4_fib;
-   err = rocker_port_fib_ipv4(rocker_port, ,
+   err = rocker_port_fib_ipv4(rocker_port, rtrans,
   htonl(fib4->dst), fib4->dst_len,
   fib4->fi, fib4->tb_id, 0);
break;
case SWITCHDEV_OBJ_PORT_FDB:
-   err = rocker_port_fdb_add(rocker_port, , >u.fdb);
+   err = rocker_port_fdb_add(rocker_port, rtrans, >u.fdb);
break;
default:
err = -EOPNOTSUPP;
break;
}
-
return err;
 }
 
+static int 

Re: epoll, missed opportunity?

2015-09-19 Thread Eric Dumazet
On Fri, 2015-09-18 at 22:51 -0600, Jonathan Marler wrote:
> I'm curious why there wasn't another field added to the epoll_event
> struct for the application to store the descriptor's context. Any
> useful multi-plexing application will have a context that will need to
> be retrieved every time a descriptor needs to be serviced. Since the
> epoll api has no way of storing this context, it has to be looked up
> using the descriptor, which will take more time/memory as the number
> of descriptors increase. The memory saved from omitting this context
> can't be worth it since you'll have to allocate the memory in the
> application anyway, plus you're now adding the extra lookup.
> 
> This "lookup" problem has always existed in multi-plexed applications.
> It was impossible to fix with older polling interfaces, however, since
> epoll is stateful, it provides an opportunity to fix this problem by
> storing the descriptor context in epoll's "state". What was the reason
> for not doing this?  Was it an oversight or am I missing something?


typedef union epoll_data
{
  void *ptr;
  int fd;
  uint32_t u32;
  uint64_t u64;
} epoll_data_t;

struct epoll_event
{
  uint32_t events;  /* Epoll events */
  epoll_data_t data;/* User data variable */
} __EPOLL_PACKED;



Application is free to use whatever is needed in poll_data_t

You can store a pointer to your own data (ptr)
Or a 32 bit cookie (u32)
Or a 64 bit cookie (u64)

(But is an union, you have to pick one of them)

Nothing forces you to use 'fd', kernel does not care.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: epoll, missed opportunity?

2015-09-19 Thread Jonathan Marler
The data field holds the file descriptor you are waiting on, it has to
be the file descriptor, otherwise, how would the kernel know which
file descriptor you are trying to wait on?

On Sat, Sep 19, 2015 at 9:21 AM, Eric Dumazet  wrote:
> On Fri, 2015-09-18 at 22:51 -0600, Jonathan Marler wrote:
>> I'm curious why there wasn't another field added to the epoll_event
>> struct for the application to store the descriptor's context. Any
>> useful multi-plexing application will have a context that will need to
>> be retrieved every time a descriptor needs to be serviced. Since the
>> epoll api has no way of storing this context, it has to be looked up
>> using the descriptor, which will take more time/memory as the number
>> of descriptors increase. The memory saved from omitting this context
>> can't be worth it since you'll have to allocate the memory in the
>> application anyway, plus you're now adding the extra lookup.
>>
>> This "lookup" problem has always existed in multi-plexed applications.
>> It was impossible to fix with older polling interfaces, however, since
>> epoll is stateful, it provides an opportunity to fix this problem by
>> storing the descriptor context in epoll's "state". What was the reason
>> for not doing this?  Was it an oversight or am I missing something?
>
>
> typedef union epoll_data
> {
>   void *ptr;
>   int fd;
>   uint32_t u32;
>   uint64_t u64;
> } epoll_data_t;
>
> struct epoll_event
> {
>   uint32_t events;  /* Epoll events */
>   epoll_data_t data;/* User data variable */
> } __EPOLL_PACKED;
>
>
>
> Application is free to use whatever is needed in poll_data_t
>
> You can store a pointer to your own data (ptr)
> Or a 32 bit cookie (u32)
> Or a 64 bit cookie (u64)
>
> (But is an union, you have to pick one of them)
>
> Nothing forces you to use 'fd', kernel does not care.
>
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Rosen, Rami
Hi,

>introduce tranction enfra and for pre-commit split

Typo:
Instead "tranction enfra" should be "transaction infrastructure".

Regards,
Rami Rosen
Intel Corporation
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [linux-next] oops in ip_route_input_noref

2015-09-19 Thread David Ahern

On 9/18/15 5:06 PM, Andrew Morton wrote:


I've been hitting this as well.  An oops on boot in
ip_route_input_slow(), here:


Fixed in net-next. bde6f9ded1bd37ff27a042dcb968e104d92b02c1

David

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 3/7] rocker: adding port ageing_time for ageing out FDB entries

2015-09-19 Thread Scott Feldman
On Fri, Sep 18, 2015 at 11:30 PM, Jiri Pirko  wrote:
> Fri, Sep 18, 2015 at 09:55:47PM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman 
>>
>>Follow-up patcheset will allow user to change ageing_time, but for now
>>just hard-code it to a fixed value (the same value used as the default
>>for the bridge driver).
>>
>>Signed-off-by: Scott Feldman 
>>---
>> drivers/net/ethernet/rocker/rocker.c |2 ++
>> 1 file changed, 2 insertions(+)
>>
>>diff --git a/drivers/net/ethernet/rocker/rocker.c 
>>b/drivers/net/ethernet/rocker/rocker.c
>>index f55ed2c..eba22f5 100644
>>--- a/drivers/net/ethernet/rocker/rocker.c
>>+++ b/drivers/net/ethernet/rocker/rocker.c
>>@@ -221,6 +221,7 @@ struct rocker_port {
>>   __be16 internal_vlan_id;
>>   int stp_state;
>>   u32 brport_flags;
>>+  unsigned long ageing_time;
>>   bool ctrls[ROCKER_CTRL_MAX];
>>   unsigned long vlan_bitmap[ROCKER_VLAN_BITMAP_LEN];
>>   struct napi_struct napi_tx;
>>@@ -4975,6 +4976,7 @@ static int rocker_probe_port(struct rocker *rocker, 
>>unsigned int port_number)
>>   rocker_port->port_number = port_number;
>>   rocker_port->pport = port_number + 1;
>>   rocker_port->brport_flags = BR_LEARNING | BR_LEARNING_SYNC;
>>+  rocker_port->ageing_time = 300 * HZ;
>
> How about to add also "BR_DEFAULT_AGEING_TIME" and use it here?

Yes, added for v2
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net] af_unix: return data from multiple SKBs on recv() with MSG_PEEK flag

2015-09-19 Thread Sergei Shtylyov

Hello.

On 9/18/2015 7:04 PM, Aaron Conole wrote:


AF_UNIX sockets now return multiple skbs from recv() when MSG_PEEK flag
is set.

This is referenced in kernel bugzilla #12323 @
https://bugzilla.kernel.org/show_bug.cgi?id=12323

As described both in the BZ and lkml thread @
http://lkml.org/lkml/2008/1/8/444 calling recv() with MSG_PEEK on an
AF_UNIX socket only reads a single skb, where the desired effect is
to return as much skb data has been queued, until hitting the recv
buffer size (whichever comes first).

The modified MSG_PEEK path will now move to the next skb in the tree
and jump to the again: label, rather than following the natural loop
structure. This requires duplicating some of the loop head actions.

This was tested using the python socketpair() code attached to the
bugzilla issue.

Signed-off-by: Aaron Conole 


   Your patch doesn't comply to the Linux CodingStyle.


---
  net/unix/af_unix.c | 17+-
  1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 03ee4d3..d2fd342 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -2179,9 +2181,22 @@ unlock:
if (UNIXCB(skb).fp)
scm.fp = scm_fp_dup(UNIXCB(skb).fp);

-   sk_peek_offset_fwd(sk, chunk);
+   if (skip)
+   {


if (skip) {


+   sk_peek_offset_fwd(sk, chunk);
+   skip -= chunk;
+   }

-   break;
+   if (UNIXCB(skb).fp)
+   break;
+
+   /* XXX - this is ugly; better would be rewrite the 
function  */
+   last = skb;
+   last_len = skb->len;
+   unix_state_lock();
+   skb = skb_peek_next(skb, >sk_receive_queue);
+   if (skb) goto again;


if (skb)
goto again;


+   goto unlock;
}
} while (size);



MBR, Sergei

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 4/7] bridge: define some min/max ageing time constants we'll use next

2015-09-19 Thread Scott Feldman
On Fri, Sep 18, 2015 at 11:45 PM, Jiri Pirko  wrote:
> Fri, Sep 18, 2015 at 09:55:48PM CEST, sfel...@gmail.com wrote:
>>From: Scott Feldman 
>>
>>Signed-off-by: Scott Feldman 
>>---
>> include/linux/if_bridge.h |4 
>> 1 file changed, 4 insertions(+)
>>
>>diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
>>index dad8b00..6cc6dbc 100644
>>--- a/include/linux/if_bridge.h
>>+++ b/include/linux/if_bridge.h
>>@@ -46,6 +46,10 @@ struct br_ip_list {
>> #define BR_LEARNING_SYNC  BIT(9)
>> #define BR_PROXYARP_WIFI  BIT(10)
>>
>>+/* values as per ieee8021QBridgeFdbAgingTime */
>>+#define BR_MIN_AGEING_TIME(10 * HZ)
>>+#define BR_MAX_AGEING_TIME(100 * HZ)
>
> I think that a bridge patch checking against these values should be
> introduced along with these values, in the same patchset

I need the MIN value for this patchset in rocker's ageing timer, so
it's introduced here.  MIN/MAX will be used again in follow-on patch
Prem is going to send to range check user input.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Apply Loan!

2015-09-19 Thread ALLIANCE & LEICESTER LOANS
ALLIANCE & LEICESTER LOANS

Do you need a loan? Arrangements to borrow up to £ 100,000,000.00, choose 
between 1 to 25 years repayment period, choose between monthly and annual 
repayment plan, flexible loan terms and conditions. All this plan and more by 
contacting us

Dr. Richard Jose
Head of Finance,
ALLIANCE & LEICESTER PLC,
Tel:+1-314 329-0172

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Scott Feldman
On Sat, Sep 19, 2015 at 5:29 AM, Jiri Pirko  wrote:
> Jiri Pirko (6):
>   switchdev: rename "trans" to "trans_ph".
>   switchdev: introduce transaction infrastructure for attr_set and
> obj_add
>   rocker: switch to local transaction phase enum
>   switchdev: move transaction phase enum under transaction structure
>   rocker: use switchdev transaction queue for allocated memory
>   switchdev: split commit and prepare phase into two callbacks

Whew, that's a lot of work!  Seems like a good idea to up-level this
for other drivers to share.  Let me apply the patches and run my tests
and get back to you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: epoll, missed opportunity?

2015-09-19 Thread Jonathan Marler
Wow how did I miss that?! This is perfect though, there is a context
pointer!  Finally my dream of a perfect polling interface exists in
linux.  Thanks so much for the quick response.

On Sat, Sep 19, 2015 at 9:46 AM, Tom Herbert  wrote:
> On Sat, Sep 19, 2015 at 8:30 AM, Jonathan Marler  
> wrote:
>> The data field holds the file descriptor you are waiting on, it has to
>> be the file descriptor, otherwise, how would the kernel know which
>> file descriptor you are trying to wait on?
>>
> fd is the third argument in epoll_ctl.
>
> int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);
>
>> On Sat, Sep 19, 2015 at 9:21 AM, Eric Dumazet  wrote:
>>> On Fri, 2015-09-18 at 22:51 -0600, Jonathan Marler wrote:
 I'm curious why there wasn't another field added to the epoll_event
 struct for the application to store the descriptor's context. Any
 useful multi-plexing application will have a context that will need to
 be retrieved every time a descriptor needs to be serviced. Since the
 epoll api has no way of storing this context, it has to be looked up
 using the descriptor, which will take more time/memory as the number
 of descriptors increase. The memory saved from omitting this context
 can't be worth it since you'll have to allocate the memory in the
 application anyway, plus you're now adding the extra lookup.

 This "lookup" problem has always existed in multi-plexed applications.
 It was impossible to fix with older polling interfaces, however, since
 epoll is stateful, it provides an opportunity to fix this problem by
 storing the descriptor context in epoll's "state". What was the reason
 for not doing this?  Was it an oversight or am I missing something?
>>>
>>>
>>> typedef union epoll_data
>>> {
>>>   void *ptr;
>>>   int fd;
>>>   uint32_t u32;
>>>   uint64_t u64;
>>> } epoll_data_t;
>>>
>>> struct epoll_event
>>> {
>>>   uint32_t events;  /* Epoll events */
>>>   epoll_data_t data;/* User data variable */
>>> } __EPOLL_PACKED;
>>>
>>>
>>>
>>> Application is free to use whatever is needed in poll_data_t
>>>
>>> You can store a pointer to your own data (ptr)
>>> Or a 32 bit cookie (u32)
>>> Or a 64 bit cookie (u64)
>>>
>>> (But is an union, you have to pick one of them)
>>>
>>> Nothing forces you to use 'fd', kernel does not care.
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] tcp/dccp: fix timewait races in timer handling

2015-09-19 Thread Eric Dumazet
From: Eric Dumazet 

When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet 
Reported-by: Ying Cai 
---
 include/net/inet_timewait_sock.h |   14 +-
 net/dccp/minisocks.c |4 ++--
 net/ipv4/inet_timewait_sock.c|   16 ++--
 net/ipv4/tcp_minisocks.c |   13 ++---
 4 files changed, 31 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 879d6e5a973b..186f3a1e1b1f 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -110,7 +110,19 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct 
sock *sk,
 void __inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
   struct inet_hashinfo *hashinfo);
 
-void inet_twsk_schedule(struct inet_timewait_sock *tw, const int timeo);
+void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
+ bool rearm);
+
+static void inline inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo)
+{
+   __inet_twsk_schedule(tw, timeo, false);
+}
+
+static void inline inet_twsk_reschedule(struct inet_timewait_sock *tw, int 
timeo)
+{
+   __inet_twsk_schedule(tw, timeo, true);
+}
+
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw);
 
 void inet_twsk_purge(struct inet_hashinfo *hashinfo,
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 30addee2dd03..838f524cf11a 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -48,8 +48,6 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
tw->tw_ipv6only = sk->sk_ipv6only;
}
 #endif
-   /* Linkage updates. */
-   __inet_twsk_hashdance(tw, sk, _hashinfo);
 
/* Get the TIME_WAIT timeout firing. */
if (timeo < rto)
@@ -60,6 +58,8 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
timeo = DCCP_TIMEWAIT_LEN;
 
inet_twsk_schedule(tw, timeo);
+   /* Linkage updates. */
+   __inet_twsk_hashdance(tw, sk, _hashinfo);
inet_twsk_put(tw);
} else {
/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index ae22cc24fbe8..c67f9bd7699c 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -123,13 +123,15 @@ void __inet_twsk_hashdance(struct inet_timewait_sock *tw, 
struct sock *sk,
/*
 * Step 2: Hash TW into tcp ehash chain.
 * Notes :
-* - tw_refcnt is set to 3 because :
+* - tw_refcnt is set to 4 because :
 * - We have one reference from bhash chain.
 * - We have one reference from ehash chain.
+* - We have one reference from timer.
+* - One reference for ourself (our caller will release it).
 * We can use atomic_set() because prior spin_lock()/spin_unlock()
 * committed into memory all tw fields.
 */
-   atomic_set(>tw_refcnt, 1 + 1 + 1);
+   atomic_set(>tw_refcnt, 4);
inet_twsk_add_node_rcu(tw, >chain);
 
/* Step 3: Remove SK from hash chain */
@@ -217,7 +219,7 @@ void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 }
 EXPORT_SYMBOL(inet_twsk_deschedule_put);
 
-void inet_twsk_schedule(struct inet_timewait_sock *tw, const int timeo)
+void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo, bool rearm)
 {
/* timeout := RTO * 3.5
 *
@@ -245,12 +247,14 @@ void inet_twsk_schedule(struct inet_timewait_sock *tw, 
const int timeo)
 */
 
tw->tw_kill = timeo <= 4*HZ;
-   if (!mod_timer_pinned(>tw_timer, jiffies + timeo)) {
-   atomic_inc(>tw_refcnt);
+

Re: [patch net-next RFC 0/6] switchdev: introduce tranction enfra and for pre-commit split

2015-09-19 Thread Jiri Pirko
Sat, Sep 19, 2015 at 03:35:51PM CEST, rami.ro...@intel.com wrote:
>Hi,
>
>>introduce tranction enfra and for pre-commit split
>
>Typo:
>Instead "tranction enfra" should be "transaction infrastructure".

Will fix. Thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: epoll, missed opportunity?

2015-09-19 Thread Tom Herbert
On Sat, Sep 19, 2015 at 8:30 AM, Jonathan Marler  wrote:
> The data field holds the file descriptor you are waiting on, it has to
> be the file descriptor, otherwise, how would the kernel know which
> file descriptor you are trying to wait on?
>
fd is the third argument in epoll_ctl.

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

> On Sat, Sep 19, 2015 at 9:21 AM, Eric Dumazet  wrote:
>> On Fri, 2015-09-18 at 22:51 -0600, Jonathan Marler wrote:
>>> I'm curious why there wasn't another field added to the epoll_event
>>> struct for the application to store the descriptor's context. Any
>>> useful multi-plexing application will have a context that will need to
>>> be retrieved every time a descriptor needs to be serviced. Since the
>>> epoll api has no way of storing this context, it has to be looked up
>>> using the descriptor, which will take more time/memory as the number
>>> of descriptors increase. The memory saved from omitting this context
>>> can't be worth it since you'll have to allocate the memory in the
>>> application anyway, plus you're now adding the extra lookup.
>>>
>>> This "lookup" problem has always existed in multi-plexed applications.
>>> It was impossible to fix with older polling interfaces, however, since
>>> epoll is stateful, it provides an opportunity to fix this problem by
>>> storing the descriptor context in epoll's "state". What was the reason
>>> for not doing this?  Was it an oversight or am I missing something?
>>
>>
>> typedef union epoll_data
>> {
>>   void *ptr;
>>   int fd;
>>   uint32_t u32;
>>   uint64_t u64;
>> } epoll_data_t;
>>
>> struct epoll_event
>> {
>>   uint32_t events;  /* Epoll events */
>>   epoll_data_t data;/* User data variable */
>> } __EPOLL_PACKED;
>>
>>
>>
>> Application is free to use whatever is needed in poll_data_t
>>
>> You can store a pointer to your own data (ptr)
>> Or a 32 bit cookie (u32)
>> Or a 64 bit cookie (u64)
>>
>> (But is an union, you have to pick one of them)
>>
>> Nothing forces you to use 'fd', kernel does not care.
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


DO YOU NEED A LOAN AT 3% RATE ANNUALLY.

2015-09-19 Thread
DO YOU NEED A LOAN AT 3% RATE ANNUALLY,IF YES CONTACT US FOR MORE INFO
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH net] inet: fix races in reqsk_queue_hash_req()

2015-09-19 Thread Eric Dumazet
From: Eric Dumazet 

Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet 
Cc: Ying Cai 
---
 net/ipv4/inet_connection_sock.c |8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 134957159c27..7bb9c39e0a4d 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -685,20 +685,20 @@ void reqsk_queue_hash_req(struct request_sock_queue 
*queue,
req->num_timeout = 0;
req->sk = NULL;
 
+   setup_timer(>rsk_timer, reqsk_timer_handler, (unsigned long)req);
+   mod_timer_pinned(>rsk_timer, jiffies + timeout);
+   req->rsk_hash = hash;
+
/* before letting lookups find us, make sure all req fields
 * are committed to memory and refcnt initialized.
 */
smp_wmb();
atomic_set(>rsk_refcnt, 2);
-   setup_timer(>rsk_timer, reqsk_timer_handler, (unsigned long)req);
-   req->rsk_hash = hash;
 
spin_lock(>syn_wait_lock);
req->dl_next = lopt->syn_table[hash];
lopt->syn_table[hash] = req;
spin_unlock(>syn_wait_lock);
-
-   mod_timer_pinned(>rsk_timer, jiffies + timeout);
 }
 EXPORT_SYMBOL(reqsk_queue_hash_req);
 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html