date:20170919

Re: Regression in throughput between kvm guests over virtual bridge

2017-09-19 Thread Jason Wang




On 2017年09月19日 02:11, Matthew Rosato wrote:

On 09/18/2017 03:36 AM, Jason Wang wrote:


On 2017年09月18日 11:13, Jason Wang wrote:


On 2017年09月16日 03:19, Matthew Rosato wrote:

It looks like vhost is slowed down for some reason which leads to more
idle time on 4.13+VHOST_RX_BATCH=1. Appreciated if you can collect the
perf.diff on host, one for rx and one for tx.


perf data below for the associated vhost threads, baseline=4.12,
delta1=4.13, delta2=4.13+VHOST_RX_BATCH=1

Client vhost:

60.12%  -11.11%  -12.34%  [kernel.vmlinux]   [k] raw_copy_from_user
13.76%   -1.28%   -0.74%  [kernel.vmlinux]   [k] get_page_from_freelist
   2.00%   +3.69%   +3.54%  [kernel.vmlinux]   [k] __wake_up_sync_key
   1.19%   +0.60%   +0.66%  [kernel.vmlinux]   [k] __alloc_pages_nodemask
   1.12%   +0.76%   +0.86%  [kernel.vmlinux]   [k] copy_page_from_iter
   1.09%   +0.28%   +0.35%  [vhost][k] vhost_get_vq_desc
   1.07%   +0.31%   +0.26%  [kernel.vmlinux]   [k] alloc_skb_with_frags
   0.94%   +0.42%   +0.65%  [kernel.vmlinux]   [k] alloc_pages_current
   0.91%   -0.19%   -0.18%  [kernel.vmlinux]   [k] memcpy
   0.88%   +0.26%   +0.30%  [kernel.vmlinux]   [k] __next_zones_zonelist
   0.85%   +0.05%   +0.12%  [kernel.vmlinux]   [k] iov_iter_advance
   0.79%   +0.09%   +0.19%  [vhost][k] __vhost_add_used_n
   0.74%[kernel.vmlinux]   [k] get_task_policy.part.7
   0.74%   -0.01%   -0.05%  [kernel.vmlinux]   [k] tun_net_xmit
   0.60%   +0.17%   +0.33%  [kernel.vmlinux]   [k] policy_nodemask
   0.58%   -0.15%   -0.12%  [ebtables] [k] ebt_do_table
   0.52%   -0.25%   -0.22%  [kernel.vmlinux]   [k] __alloc_skb
 ...
   0.42%   +0.58%   +0.59%  [kernel.vmlinux]   [k] eventfd_signal
 ...
   0.32%   +0.96%   +0.93%  [kernel.vmlinux]   [k] finish_task_switch
 ...
   +1.50%   +1.16%  [kernel.vmlinux]   [k] get_task_policy.part.9
   +0.40%   +0.42%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
   +0.39%   +0.40%  [kernel.vmlinux]   [k] _copy_from_iter_full
   +0.24%   +0.23%  [vhost_net][k] vhost_net_buf_peek

Server vhost:

61.93%  -10.72%  -10.91%  [kernel.vmlinux]   [k] raw_copy_to_user
   9.25%   +0.47%   +0.86%  [kernel.vmlinux]   [k] free_hot_cold_page
   5.16%   +1.41%   +1.57%  [vhost][k] vhost_get_vq_desc
   5.12%   -3.81%   -3.78%  [kernel.vmlinux]   [k] skb_release_data
   3.30%   +0.42%   +0.55%  [kernel.vmlinux]   [k] raw_copy_from_user
   1.29%   +2.20%   +2.28%  [kernel.vmlinux]   [k] copy_page_to_iter
   1.24%   +1.65%   +0.45%  [vhost_net][k] handle_rx
   1.08%   +3.03%   +2.85%  [kernel.vmlinux]   [k] __wake_up_sync_key
   0.96%   +0.70%   +1.10%  [vhost][k] translate_desc
   0.69%   -0.20%   -0.22%  [kernel.vmlinux]   [k] tun_do_read.part.10
   0.69%[kernel.vmlinux]   [k] tun_peek_len
   0.67%   +0.75%   +0.78%  [kernel.vmlinux]   [k] eventfd_signal
   0.52%   +0.96%   +0.98%  [kernel.vmlinux]   [k] finish_task_switch
   0.50%   +0.05%   +0.09%  [vhost][k] vhost_add_used_n
 ...
   +0.63%   +0.58%  [vhost_net][k] vhost_net_buf_peek
   +0.32%   +0.32%  [kernel.vmlinux]   [k] _copy_to_iter
   +0.19%   +0.19%  [kernel.vmlinux]   [k] __skb_get_hash_symmetr
   +0.11%   +0.21%  [vhost][k] vhost_umem_interval_tr


Looks like for some unknown reason which leads more wakeups.

Could you please try to attached patch to see if it solves or mitigate
the issue?

Thanks

My bad, please try this.

Thanks

Thanks Jason.  Built 4.13 + supplied patch, I see some decrease in
wakeups, but there's still quite a bit more compared to 4.12
(baseline=4.12, delta1=4.13, delta2=4.13+patch):

client:
  2.00%   +3.69%   +2.55%  [kernel.vmlinux]   [k] __wake_up_sync_key

server:
  1.08%   +3.03%   +1.85%  [kernel.vmlinux]   [k] __wake_up_sync_key


Throughput was roughly equivalent to base 4.13 (so, still seeing the
regression w/ this patch applied).



Seems to make some progress on wakeup mitigation. Previous patch tries 
to reduce the unnecessary traversal of waitqueue during rx. Attached 
patch goes even further which disables rx polling during processing tx. 
Please try it to see if it has any difference.


And two questions:
- Is the issue existed if you do uperf between 2VMs (instead of 4VMs)
- Can enable batching in the tap of sending VM improve the performance 
(ethtool -C $tap rx-frames 64)


Thanks
>From d57ad96083fc57205336af1b5ea777e5185f1581 Mon Sep 17 00:00:00 2001
From: Jason Wang 
Date: Wed, 20 Sep 2017 11:44:49 +0800
Subject: [PATCH] vhost_net: avoid unnecessary wakeups during tx

Signed-off-by: Jason Wang 
---
 drivers/vhost/net.c | 21 ++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ed476fa..e7349cf 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -444,8 +444,11 @@ static bool vhost_exceeds_maxpend(struct vhost_net *n

[PATCH net-next v5 2/4] bpf: add a test case for helper bpf_perf_event_read_value

2017-09-19 Thread Yonghong Song

The bpf sample program tracex6 is enhanced to use the new
helper to read enabled/running time as well.

Signed-off-by: Yonghong Song 
---
 samples/bpf/tracex6_kern.c| 26 ++
 samples/bpf/tracex6_user.c| 13 -
 tools/include/uapi/linux/bpf.h|  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/samples/bpf/tracex6_kern.c b/samples/bpf/tracex6_kern.c
index e7d1803..46c557a 100644
--- a/samples/bpf/tracex6_kern.c
+++ b/samples/bpf/tracex6_kern.c
@@ -15,6 +15,12 @@ struct bpf_map_def SEC("maps") values = {
.value_size = sizeof(u64),
.max_entries = 64,
 };
+struct bpf_map_def SEC("maps") values2 = {
+   .type = BPF_MAP_TYPE_HASH,
+   .key_size = sizeof(int),
+   .value_size = sizeof(struct bpf_perf_event_value),
+   .max_entries = 64,
+};
 
 SEC("kprobe/htab_map_get_next_key")
 int bpf_prog1(struct pt_regs *ctx)
@@ -37,5 +43,25 @@ int bpf_prog1(struct pt_regs *ctx)
return 0;
 }
 
+SEC("kprobe/htab_map_lookup_elem")
+int bpf_prog2(struct pt_regs *ctx)
+{
+   u32 key = bpf_get_smp_processor_id();
+   struct bpf_perf_event_value *val, buf;
+   int error;
+
+   error = bpf_perf_event_read_value(&counters, key, &buf, sizeof(buf));
+   if (error)
+   return 0;
+
+   val = bpf_map_lookup_elem(&values2, &key);
+   if (val)
+   *val = buf;
+   else
+   bpf_map_update_elem(&values2, &key, &buf, BPF_NOEXIST);
+
+   return 0;
+}
+
 char _license[] SEC("license") = "GPL";
 u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex6_user.c b/samples/bpf/tracex6_user.c
index a05a99a..3341a96 100644
--- a/samples/bpf/tracex6_user.c
+++ b/samples/bpf/tracex6_user.c
@@ -22,6 +22,7 @@
 
 static void check_on_cpu(int cpu, struct perf_event_attr *attr)
 {
+   struct bpf_perf_event_value value2;
int pmu_fd, error = 0;
cpu_set_t set;
__u64 value;
@@ -46,8 +47,18 @@ static void check_on_cpu(int cpu, struct perf_event_attr 
*attr)
fprintf(stderr, "Value missing for CPU %d\n", cpu);
error = 1;
goto on_exit;
+   } else {
+   fprintf(stderr, "CPU %d: %llu\n", cpu, value);
+   }
+   /* The above bpf_map_lookup_elem should trigger the second kprobe */
+   if (bpf_map_lookup_elem(map_fd[2], &cpu, &value2)) {
+   fprintf(stderr, "Value2 missing for CPU %d\n", cpu);
+   error = 1;
+   goto on_exit;
+   } else {
+   fprintf(stderr, "CPU %d: counter: %llu, enabled: %llu, running: 
%llu\n", cpu,
+   value2.counter, value2.enabled, value2.running);
}
-   fprintf(stderr, "CPU %d: %llu\n", cpu, value);
 
 on_exit:
assert(bpf_map_delete_elem(map_fd[0], &cpu) == 0 || error);
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 461811e..79eb529 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -632,7 +632,8 @@ union bpf_attr {
FN(skb_adjust_room),\
FN(redirect_map),   \
FN(sk_redirect_map),\
-   FN(sock_map_update),
+   FN(sock_map_update),\
+   FN(perf_event_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 36fb916..08e6f8c 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -70,6 +70,9 @@ static int (*bpf_sk_redirect_map)(void *map, int key, int 
flags) =
 static int (*bpf_sock_map_update)(void *map, void *key, void *value,
  unsigned long long flags) =
(void *) BPF_FUNC_sock_map_update;
+static int (*bpf_perf_event_read_value)(void *map, unsigned long long flags,
+   void *buf, unsigned int buf_size) =
+   (void *) BPF_FUNC_perf_event_read_value;
 
 
 /* llvm builtin functions that eBPF C program may use to
-- 
2.9.5

[PATCH net-next v5 0/4] bpf: add two helpers to read perf event enabled/running time

2017-09-19 Thread Yonghong Song

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.
 
This patch set implements two helper functions.
The helper bpf_perf_event_read_value reads counter/time_enabled/time_running
for perf event array map. The helper bpf_perf_prog_read_value read
counter/time_enabled/time_running for bpf prog with type 
BPF_PROG_TYPE_PERF_EVENT.
 
Changelogs:
v4->v5:
  . fix some coding style issues
  . memset the input buffer in case of error for ARG_PTR_TO_UNINIT_MEM
type of argument.
v3->v4:
  . fix a build failure
v2->v3:
  . counters should be read in order to read enabled/running time. This is to
prevent that counters and enabled/running time may be read separately.
v1->v2:
  . reading enabled/running time should be together with reading counters
which contains the logic to ensure the result is valid.

Yonghong Song (4):
  bpf: add helper bpf_perf_event_read_value for perf event array map
  bpf: add a test case for helper bpf_perf_event_read_value
  bpf: add helper bpf_perf_prog_read_value
  bpf: add a test case for helper bpf_perf_prog_read_value

 include/linux/perf_event.h|  7 ++-
 include/uapi/linux/bpf.h  | 27 ++-
 kernel/bpf/arraymap.c |  2 +-
 kernel/bpf/verifier.c |  4 +-
 kernel/events/core.c  | 16 +--
 kernel/trace/bpf_trace.c  | 74 ---
 samples/bpf/trace_event_kern.c| 10 +
 samples/bpf/trace_event_user.c| 13 +++---
 samples/bpf/tracex6_kern.c| 26 +++
 samples/bpf/tracex6_user.c| 13 +-
 tools/include/uapi/linux/bpf.h|  4 +-
 tools/testing/selftests/bpf/bpf_helpers.h |  6 +++
 12 files changed, 182 insertions(+), 20 deletions(-)

-- 
2.9.5

[PATCH net-next v5 4/4] bpf: add a test case for helper bpf_perf_prog_read_value

2017-09-19 Thread Yonghong Song

The bpf sample program trace_event is enhanced to use the new
helper to print out enabled/running time.

Signed-off-by: Yonghong Song 
---
 samples/bpf/trace_event_kern.c| 10 ++
 samples/bpf/trace_event_user.c| 13 -
 tools/include/uapi/linux/bpf.h|  3 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 +++
 4 files changed, 23 insertions(+), 6 deletions(-)

diff --git a/samples/bpf/trace_event_kern.c b/samples/bpf/trace_event_kern.c
index 41b6115..a77a583d 100644
--- a/samples/bpf/trace_event_kern.c
+++ b/samples/bpf/trace_event_kern.c
@@ -37,10 +37,14 @@ struct bpf_map_def SEC("maps") stackmap = {
 SEC("perf_event")
 int bpf_prog1(struct bpf_perf_event_data *ctx)
 {
+   char time_fmt1[] = "Time Enabled: %llu, Time Running: %llu";
+   char time_fmt2[] = "Get Time Failed, ErrCode: %d";
char fmt[] = "CPU-%d period %lld ip %llx";
u32 cpu = bpf_get_smp_processor_id();
+   struct bpf_perf_event_value value_buf;
struct key_t key;
u64 *val, one = 1;
+   int ret;
 
if (ctx->sample_period < 1)
/* ignore warmup */
@@ -54,6 +58,12 @@ int bpf_prog1(struct bpf_perf_event_data *ctx)
return 0;
}
 
+   ret = bpf_perf_prog_read_value(ctx, (void *)&value_buf, sizeof(struct 
bpf_perf_event_value));
+   if (!ret)
+ bpf_trace_printk(time_fmt1, sizeof(time_fmt1), value_buf.enabled, 
value_buf.running);
+   else
+ bpf_trace_printk(time_fmt2, sizeof(time_fmt2), ret);
+
val = bpf_map_lookup_elem(&counts, &key);
if (val)
(*val)++;
diff --git a/samples/bpf/trace_event_user.c b/samples/bpf/trace_event_user.c
index 7bd827b..bf4f1b6 100644
--- a/samples/bpf/trace_event_user.c
+++ b/samples/bpf/trace_event_user.c
@@ -127,6 +127,9 @@ static void test_perf_event_all_cpu(struct perf_event_attr 
*attr)
int *pmu_fd = malloc(nr_cpus * sizeof(int));
int i, error = 0;
 
+   /* system wide perf event, no need to inherit */
+   attr->inherit = 0;
+
/* open perf_event on all cpus */
for (i = 0; i < nr_cpus; i++) {
pmu_fd[i] = sys_perf_event_open(attr, -1, i, -1, 0);
@@ -154,6 +157,11 @@ static void test_perf_event_task(struct perf_event_attr 
*attr)
 {
int pmu_fd;
 
+   /* per task perf event, enable inherit so the "dd ..." command can be 
traced properly.
+* Enabling inherit will cause bpf_perf_prog_read_time helper failure.
+*/
+   attr->inherit = 1;
+
/* open task bound event */
pmu_fd = sys_perf_event_open(attr, 0, -1, -1, 0);
if (pmu_fd < 0) {
@@ -175,14 +183,12 @@ static void test_bpf_perf_event(void)
.freq = 1,
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
-   .inherit = 1,
};
struct perf_event_attr attr_type_sw = {
.sample_freq = SAMPLE_FREQ,
.freq = 1,
.type = PERF_TYPE_SOFTWARE,
.config = PERF_COUNT_SW_CPU_CLOCK,
-   .inherit = 1,
};
struct perf_event_attr attr_hw_cache_l1d = {
.sample_freq = SAMPLE_FREQ,
@@ -192,7 +198,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_L1D |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_ACCESS << 16),
-   .inherit = 1,
};
struct perf_event_attr attr_hw_cache_branch_miss = {
.sample_freq = SAMPLE_FREQ,
@@ -202,7 +207,6 @@ static void test_bpf_perf_event(void)
PERF_COUNT_HW_CACHE_BPU |
(PERF_COUNT_HW_CACHE_OP_READ << 8) |
(PERF_COUNT_HW_CACHE_RESULT_MISS << 16),
-   .inherit = 1,
};
struct perf_event_attr attr_type_raw = {
.sample_freq = SAMPLE_FREQ,
@@ -210,7 +214,6 @@ static void test_bpf_perf_event(void)
.type = PERF_TYPE_RAW,
/* Intel Instruction Retired */
.config = 0xc0,
-   .inherit = 1,
};
 
printf("Test HW_CPU_CYCLES\n");
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 79eb529..50d2bcd 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -633,7 +633,8 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
-   FN(perf_event_read_value),
+   FN(perf_event_read_value),  \
+   FN(perf_prog_read_value),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index 08e6f8c..1d3dcd4 100644
--- a/tools/testing/selftests

[PATCH net-next v5 1/4] bpf: add helper bpf_perf_event_read_value for perf event array map

2017-09-19 Thread Yonghong Song

Hardware pmu counters are limited resources. When there are more
pmu based perf events opened than available counters, kernel will
multiplex these events so each event gets certain percentage
(but not 100%) of the pmu time. In case that multiplexing happens,
the number of samples or counter value will not reflect the
case compared to no multiplexing. This makes comparison between
different runs difficult.

Typically, the number of samples or counter value should be
normalized before comparing to other experiments. The typical
normalization is done like:
  normalized_num_samples = num_samples * time_enabled / time_running
  normalized_counter_value = counter_value * time_enabled / time_running
where time_enabled is the time enabled for event and time_running is
the time running for event since last normalization.

This patch adds helper bpf_perf_event_read_value for kprobed based perf
event array map, to read perf counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.
To achieve scaling factor between two bpf invocations, users
can can use cpu_id as the key (which is typical for perf array usage model)
to remember the previous value and do the calculation inside the
bpf program.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h |  6 --
 include/uapi/linux/bpf.h   | 19 ++-
 kernel/bpf/arraymap.c  |  2 +-
 kernel/bpf/verifier.c  |  4 +++-
 kernel/events/core.c   | 15 ---
 kernel/trace/bpf_trace.c   | 46 +-
 6 files changed, 79 insertions(+), 13 deletions(-)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 8e22f24..21d8c12 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -884,7 +884,8 @@ perf_event_create_kernel_counter(struct perf_event_attr 
*attr,
void *context);
 extern void perf_pmu_migrate_context(struct pmu *pmu,
int src_cpu, int dst_cpu);
-int perf_event_read_local(struct perf_event *event, u64 *value);
+int perf_event_read_local(struct perf_event *event, u64 *value,
+ u64 *enabled, u64 *running);
 extern u64 perf_event_read_value(struct perf_event *event,
 u64 *enabled, u64 *running);
 
@@ -1286,7 +1287,8 @@ static inline const struct perf_event_attr 
*perf_event_attrs(struct perf_event *
 {
return ERR_PTR(-EINVAL);
 }
-static inline int perf_event_read_local(struct perf_event *event, u64 *value)
+static inline int perf_event_read_local(struct perf_event *event, u64 *value,
+   u64 *enabled, u64 *running)
 {
return -EINVAL;
 }
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 43ab5c4..ccfe1b1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -582,6 +582,14 @@ union bpf_attr {
  * @map: pointer to sockmap to update
  * @key: key to insert/update sock in map
  * @flags: same flags as map update elem
+ *
+ * int bpf_perf_event_read_value(map, flags, buf, buf_size)
+ * read perf event counter value and perf event enabled/running time
+ * @map: pointer to perf_event_array map
+ * @flags: index of event in the map or bitmask flags
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return: 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -638,6 +646,7 @@ union bpf_attr {
FN(redirect_map),   \
FN(sk_redirect_map),\
FN(sock_map_update),\
+   FN(perf_event_read_value),  \
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
@@ -681,7 +690,9 @@ enum bpf_func_id {
 #define BPF_F_ZERO_CSUM_TX (1ULL << 1)
 #define BPF_F_DONT_FRAGMENT(1ULL << 2)
 
-/* BPF_FUNC_perf_event_output and BPF_FUNC_perf_event_read flags. */
+/* BPF_FUNC_perf_event_output, BPF_FUNC_perf_event_read and
+ * BPF_FUNC_perf_event_read_value flags.
+ */
 #define BPF_F_INDEX_MASK   0xULL
 #define BPF_F_CURRENT_CPU  BPF_F_INDEX_MASK
 /* BPF_FUNC_perf_event_output for sk_buff input context. */
@@ -864,4 +875,10 @@ enum {
 #define TCP_BPF_IW 1001/* Set TCP initial congestion window */
 #define TCP_BPF_SNDCWND_CLAMP  1002/* Set sndcwnd_clamp */
 
+struct bpf_perf_event_value {
+   __u64 counter;
+   __u64 enabled;
+   __u64 running;
+};
+
 #endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 98c0f00..68d8666 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -492,7 +492,7 @@ static void *perf_event_fd_array_get_ptr(struct bpf_map 
*map,
 
ee = ERR_PTR(-EOPNOTSUPP);
event = perf_file->private_data;
-   if (perf_event_read_loca

[PATCH net-next v5 3/4] bpf: add helper bpf_perf_prog_read_value

2017-09-19 Thread Yonghong Song

This patch adds helper bpf_perf_prog_read_cvalue for perf event based bpf
programs, to read event counter and enabled/running time.
The enabled/running time is accumulated since the perf event open.

The typical use case for perf event based bpf program is to attach itself
to a single event. In such cases, if it is desirable to get scaling factor
between two bpf invocations, users can can save the time values in a map,
and use the value from the map and the current value to calculate
the scaling factor.

Signed-off-by: Yonghong Song 
---
 include/linux/perf_event.h |  1 +
 include/uapi/linux/bpf.h   |  8 
 kernel/events/core.c   |  1 +
 kernel/trace/bpf_trace.c   | 28 
 4 files changed, 38 insertions(+)

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 21d8c12..79b18a2 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -806,6 +806,7 @@ struct perf_output_handle {
 struct bpf_perf_event_data_kern {
struct pt_regs *regs;
struct perf_sample_data *data;
+   struct perf_event *event;
 };
 
 #ifdef CONFIG_CGROUP_PERF
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index ccfe1b1..f3eeae2 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -590,6 +590,13 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return: 0 on success or negative error code
+ *
+ * int bpf_perf_prog_read_value(ctx, buf, buf_size)
+ * read perf prog attached perf event counter and enabled/running time
+ * @ctx: pointer to ctx
+ * @buf: buf to fill
+ * @buf_size: size of the buf
+ * Return : 0 on success or negative error code
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -647,6 +654,7 @@ union bpf_attr {
FN(sk_redirect_map),\
FN(sock_map_update),\
FN(perf_event_read_value),  \
+   FN(perf_prog_read_value),   \
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends to call
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2d5bbe5..d039086 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8081,6 +8081,7 @@ static void bpf_overflow_handler(struct perf_event *event,
struct bpf_perf_event_data_kern ctx = {
.data = data,
.regs = regs,
+   .event = event,
};
int ret = 0;
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 686dfa1..c4d617a 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -612,6 +612,32 @@ static const struct bpf_func_proto 
bpf_get_stackid_proto_tp = {
.arg3_type  = ARG_ANYTHING,
 };
 
+BPF_CALL_3(bpf_perf_prog_read_value_tp, struct bpf_perf_event_data_kern *, ctx,
+  struct bpf_perf_event_value *, buf, u32, size)
+{
+   int err;
+
+   if (unlikely(size != sizeof(struct bpf_perf_event_value)))
+   return -EINVAL;
+
+   err = perf_event_read_local(ctx->event, &buf->counter, &buf->enabled,
+   &buf->running);
+   if (unlikely(err)) {
+   memset(buf, 0, size);
+   return err;
+   }
+   return 0;
+}
+
+static const struct bpf_func_proto bpf_perf_prog_read_value_proto_tp = {
+ .func   = bpf_perf_prog_read_value_tp,
+ .gpl_only   = true,
+ .ret_type   = RET_INTEGER,
+ .arg1_type  = ARG_PTR_TO_CTX,
+ .arg2_type  = ARG_PTR_TO_UNINIT_MEM,
+ .arg3_type  = ARG_CONST_SIZE,
+};
+
 static const struct bpf_func_proto *tp_prog_func_proto(enum bpf_func_id 
func_id)
 {
switch (func_id) {
@@ -619,6 +645,8 @@ static const struct bpf_func_proto *tp_prog_func_proto(enum 
bpf_func_id func_id)
return &bpf_perf_event_output_proto_tp;
case BPF_FUNC_get_stackid:
return &bpf_get_stackid_proto_tp;
+   case BPF_FUNC_perf_prog_read_value:
+   return &bpf_perf_prog_read_value_proto_tp;
default:
return tracing_func_proto(func_id);
}
-- 
2.9.5

Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller

2017-09-19 Thread David Miller

From: Petar Penkov 
Date: Tue, 19 Sep 2017 21:26:14 -0700

> Furthermore, in a way testing already requires specific kernel
> configuration.  In this particular example, syzkaller prefers
> synchronous operation and therefore needs 4KSTACKS disabled. Other
> features that require rebuilding are KASAN and dbx. From this point
> of view, I still think that having the TUN_NAPI flag has value.

Then I think this path could be enabled/disabled with a runtime flag
just as easily, no?

Re: [PATCH net-next 2/4] qed: Add iWARP out of order support

2017-09-19 Thread Kalderon, Michal

From: Leon Romanovsky 
Sent: Tuesday, September 19, 2017 8:45 PM
On Tue, Sep 19, 2017 at 08:26:17PM +0300, Michal Kalderon wrote:
>> iWARP requires OOO support which is already provided by the ll2
>> interface (until now was used only for iSCSI offload).
>> The changes mostly include opening a ll2 dedicated connection for
>> OOO and notifiying the FW about the handle id.
>>
>> Signed-off-by: Michal Kalderon 
>> Signed-off-by: Ariel Elior 
>> ---
>>  drivers/net/ethernet/qlogic/qed/qed_iwarp.c | 44 
>> +
>>  drivers/net/ethernet/qlogic/qed/qed_iwarp.h | 11 +++-
>>  drivers/net/ethernet/qlogic/qed/qed_rdma.c  |  7 +++--
>>  3 files changed, 59 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c 
>> b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> index 9d989c9..568e985 100644
>> --- a/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> +++ b/drivers/net/ethernet/qlogic/qed/qed_iwarp.c
>> @@ -41,6 +41,7 @@
>>  #include "qed_rdma.h"
>>  #include "qed_reg_addr.h"
>>  #include "qed_sp.h"
>> +#include "qed_ooo.h"
>>
>>  #define QED_IWARP_ORD_DEFAULT32
>>  #define QED_IWARP_IRD_DEFAULT32
>> @@ -119,6 +120,13 @@ static void qed_iwarp_cid_cleaned(struct qed_hwfn 
>> *p_hwfn, u32 cid)
>>   spin_unlock_bh(&p_hwfn->p_rdma_info->lock);
>>  }
>>
>> +void qed_iwarp_init_fw_ramrod(struct qed_hwfn *p_hwfn,
>> +   struct iwarp_init_func_params *p_ramrod)
>> +{
>> + p_ramrod->ll2_ooo_q_index = RESC_START(p_hwfn, QED_LL2_QUEUE) +
>> + p_hwfn->p_rdma_info->iwarp.ll2_ooo_handle;
>> +}
>> +
>>  static int qed_iwarp_alloc_cid(struct qed_hwfn *p_hwfn, u32 *cid)
>>  {
>>   int rc;
>> @@ -1876,6 +1884,16 @@ static int qed_iwarp_ll2_stop(struct qed_hwfn 
>> *p_hwfn, struct qed_ptt *p_ptt)
>>   iwarp_info->ll2_syn_handle = QED_IWARP_HANDLE_INVAL;
>>   }
>>
>> + if (iwarp_info->ll2_ooo_handle != QED_IWARP_HANDLE_INVAL) {
>> + rc = qed_ll2_terminate_connection(p_hwfn,
>> +   iwarp_info->ll2_ooo_handle);
>> + if (rc)
>> + DP_INFO(p_hwfn, "Failed to terminate ooo 
>> connection\n");
>
>What exactly will you do with this knowledge? Anyway you are not
>interested in return values of qed_ll2_terminate_connection function in
>this place and other places too.
>
>Why don't you handle EAGAIN returned from the qed_ll2_terminate_connection()?
>
>Thanks
Thanks for pointing this out, you're right we could have ignored the return 
code, as there's
not much we can do at this point if it failed. But I still feel failures are 
worth knowing about,
and could help in analysis if they unexpectedly lead to another issue.
As for EAGAIN, it is very unlikely that we'll get this return code. Will 
consider adding generic
handling for this as a separate patch, as this currently isn't handled in any 
of the ll2 flows.
thanks,

[PATCH net-next] cxgb4: add new T5 pci device id's

2017-09-19 Thread Ganesh Goudar

Add 0x50a5, 0x50a6, 0x50a7, 0x50a8 and 0x50a9 T5 device
id's.

Signed-off-by: Ganesh Goudar 
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h 
b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index aa28299..37d90d6 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -176,6 +176,11 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
CH_PCI_ID_TABLE_FENTRY(0x50a2), /* Custom T540-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a3), /* Custom T580-KR4 */
CH_PCI_ID_TABLE_FENTRY(0x50a4), /* Custom 2x T540-CR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a5), /* Custom T522-BT */
+   CH_PCI_ID_TABLE_FENTRY(0x50a6), /* Custom T522-BT-SO */
+   CH_PCI_ID_TABLE_FENTRY(0x50a7), /* Custom T580-CR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a8), /* Custom T580-KR */
+   CH_PCI_ID_TABLE_FENTRY(0x50a9), /* Custom T580-KR */
 
/* T6 adapters:
 */
-- 
2.1.0

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread Richard Cochran

On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> (I think LaunchTime is something specific to the i210, right?)

Levi just told us:

   Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8
   parts) support Qav shaping as well as scheduled launch
   functionality;

Thanks,
Richard

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread Richard Cochran

On Tue, Sep 19, 2017 at 07:59:11PM -0600, levipear...@gmail.com wrote:
> If some endpoint device shows up with direct Qbv support, this interface would
> probably work well there too, although a talker would need to be able to
> schedule its transmits pretty precisely to achieve the lowest possible 
> latency.

This is an argument for SO_TXTIME.

> One concern here is calling the base-time parameter an interval; it's really
> an absolute time with respect to the PTP timescale. Good documentation will
> be important to this one, since the specification discusses some subtleties
> regarding the impact of different time values chosen here.
> 
> The format for specifying the actual intervals such as cycle-time could prove
> to be an important detail as well; Qbv specifies cycle-time as a ratio of two
> integers expressed in seconds, while extension-time is specified as an integer
> number of nanoseconds.
> 
> Precision with the cycle-time is especially important, since base-time can be
> almost arbitrarily far in the past or future, and any given cycle start should
> be calculable from the base-time plus/minus some integer multiple of cycle-
> time.

The above three points also.

Thanks,
Richard

Re: TSN Scorecard, was Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread Richard Cochran

On Tue, Sep 19, 2017 at 11:17:54PM -0600, levipear...@gmail.com wrote:
> In addition to OpenAvnu, Renesas has a number of github repositories with 
> what looks like a fairly
> complete media streaming system:

Is it a generic stack or a set of hacks for their HW?

> Although your SO_TXTIME proposal could certainly form the basis of an 
> endpoint's implementation of Qbv, I
> think it is a stretch to consider it a Qbv implementation in itself, if 
> that's what you mean by this.

No, that is not what I meant.  We need some minimal additional kernel
support in order to fully implement the TSN family of standards.  Of
course, the bulk will have to be done in user space.  It would be a
mistake to cram the stuff that belongs in userland into the kernel.

Looking at the table, and reading your descriptions of the state of
OpenAVB, I remained convinced that the kernel needs only three
additions:

1. SO_TXTIME
2. CBS Qdisc
3. ALSA support for DAC clock control (but that is another story)

> The proper interfaces for the Qbv configuration and managing of switch-level 
> PTP timestamps are not yet
> in place, so there's nothing even at RFC stage to present yet, but 
> Qbv-capable Linux-managed switch
> hardware is available and we hope to get some reusable code published even if 
> it's not yet ready to be
> integrated in the kernel.

Right, configuring Qbv in an attached DSA switch needs its own
interface.

Regarding PHC support for DSA switches, I have something in the works
to be published soon.

> A bit of progress has been made since that was written, although it is true 
> that it's still not
> quite complete and certainly not turnkey.

So OpenAVB is neither complete nor turnkey.  That was my impression,
too.

> Things are maybe a bit farther along than they seemed, but there is still 
> important kernel work to be
> done to reduce the need for out-of-tree drivers and to get everyone on the 
> same interfaces. I plan
> to be an active participant going forward.

You mentioned a couple of different kernel things you implemented.
I would encourage you to post the work already done.

Thanks,
Richard

Re: [PATCH net-next 3/4] qed: Fix maximum number of CQs for iWARP

2017-09-19 Thread Kalderon, Michal

From: Leon Romanovsky 
Sent: Tuesday, September 19, 2017 8:46 PM
On Tue, Sep 19, 2017 at 08:26:18PM +0300, Michal Kalderon wrote:
>> The maximum number of CQs supported is bound to the number
>> of connections supported, which differs between RoCE and iWARP.
>>
>> This fixes a crash that occurred in iWARP when running 1000 sessions
>> using perftest.
>>
>> Signed-off-by: Michal Kalderon 
>> Signed-off-by: Ariel Elior 
>> ---
>
>It is worth to add Fixes line.
>
>Thanks
The original code was there before we had iWARP support, so this doesn't
exactly fix an older commit, but fixes iWARP code in general.

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread Richard Cochran

On Tue, Sep 19, 2017 at 05:19:18PM -0700, Vinicius Costa Gomes wrote:
> One of the problems with OpenAVNU is that it's too coupled with the i210
> NIC. One of the things we want is to decouple OpenAVNU from the
> controller.

Yes, I want that, too.

> The way we thought best was to propose interfaces (that
> would work along side to the Linux networking stack) as close as
> possible to what the current standards define, that means the IEEE
> 802.1Q family of specifications, in the hope that network controller
> vendors would also look at the specifications when designing their
> controllers.

These standard define the *behavior*, not the programming APIs.  Our
task as kernel developers is to invent the best interfaces for
supporting 802.1Q and other standards, the hardware capabilities, and
the widest range of applications (not jut AVB).

> Our objective with the Qdiscs we are proposing (both cbs and taprio) is
> to provide a sane way to configure controllers that support TSN features
> (we were looking specifically at the IEEE specs).

I can see how your proposed Qdiscs are inspired by the IEEE standards.
However, in the case of time based transmission, I think there is a
better way to do it, namely with SO_TXTIME (which BTW was originally
proposed by Eric Mann).

> After we have some rough consensus on the interfaces to use, then we can
> start working on OpenAVNU.

Did you see my table in the other mail?  Any comments?

> (Sorry if I am being annoying here, but the idea of an opaque schedule
> is not ours, that comes from the people who wrote the Qbv specification)

The schedule is easy to implement using SO_TXTIME.

> I have a question, what about a controller that doesn't provide a way to
> set a per-packet transmission time, but it supports Qbv/Qbu. What would
> be your proposal to configure it?

SO_TXTIME will have a generic SW fallback.

BTW, regarding the i210, there is no sensible way to configure both
CBS and time based transmission at the same time.  The card performs a
logical AND to make the launch decision.  The effect of this is that
each and every packet needs a LaunchTime, and the driver would be
forced to guess the time for a packet before entering it into its
queue.

So if we end up merging CBS and SO_TXTIME, then we'll have to make
them exclusive of each other (in the case of the i210) and manage the
i210 queue configurations correctly.

> (I think LaunchTime is something specific to the i210, right?)

To my knowledge yes.  However, if TSN does take hold, then other MAC
vendors will copy it.

Thanks,
Richard

TSN Scorecard, was Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread levipearson

On Mon, Sep 18, 2017, Richard Cochran wrote:
> Just for the record, here is my score card showing the current status
> of TSN support in Linux.  Comments and corrections are more welcome.
> 
> Thanks,
> Richard
> 
> 
>  | FEATURE| STANDARD| 
> STATUS   |
>  
> |+-+--|
>  | Synchronization| 802.1AS-2011| 
> Implemented in   |
>  || | - 
> Linux kernel PHC subsystem |
>  || | - 
> linuxptp (userspace)   |
>  
> |+-+--|

An alternate implementation of the userspace portion of gPTP is also available 
at [1]

>  | Forwarding and Queuing Enhancements| 802.1Q-2014 sec. 34 | RFC 
> posted (this thread) |
>  | for Time-Sensitive Streams (FQTSS) | | 
>  |
>  
> |+-+--|
>  | Stream Reservation Protocol (SRP)  | 802.1Q-2014 sec. 35 | in 
> Open-AVB [1]  |
>  
> |+-+--|
>  | Audio Video Transport Protocol (AVTP)  | IEEE 1722-2011  | DNE 
>  |
>  
> |+-+--|
>  | Audio/Video Device Discovery, Enumeration, | IEEE 1722.1-2013| 
> jdksavdecc-c [2] |
>  | Connection Management and Control (AVDECC) | | 
>  |
>  | AVDECC Connection Management Protocol (ACMP)   | | 
>  |
>  | AVDECC Enumeration and Control Protocol (AECP) | | 
>  |
>  | MAC Address Acquisition Protocol (MAAP)| | in 
> Open-AVB  |
>  
> |+-+--|

All of the above are available to some degree in the AVTP Pipeline part of [1], 
specifically at this
location: https://github.com/AVnu/OpenAvnu/tree/master/lib/avtp_pipeline

The code is very modular and configurable, although some parts are in better 
shape than others. The AVTP
portion can use the custom userspace driver for the i210, which can be 
configured to use launch scheduling,
or it can use standard kernel interfaces via sendmsg or PACKET_MMAP. It runs 
as-is when configured for
standard interfaces with any network hardware that supports gPTP. I previously 
implemented a CMSG-based
launch time scheduling mechanism like the one you have proposed, and I have a 
socket backend for it that
could easily be ported to your proposal. It is not part of the repository yet 
since there's no kernel
support for it outside of my prototype and your RFC.

It is currently tied to the OpenAvnu gPTP daemon rather than linuxptp, as it 
uses a shared memory interface
to get the current rate-ratio and offset information between the various 
clocks. There may be better ways
to do this, but that's how the initial port of the codebase was done. It would 
be nice to get it working
with linuxptp's userspace tools at some point as well, though.

The libraries under avtp_pipeline are designed to be used separately, but a 
simple integrated application
is provided and is built by the CI system.

In addition to OpenAvnu, Renesas has a number of github repositories with what 
looks like a fairly
complete media streaming system:

https://github.com/renesas-rcar/avb-mse
https://github.com/renesas-rcar/avb-streaming
https://github.com/renesas-rcar/avb-applications

I haven't examined them in great detail yet, though.

>  | Frame Preemption   | P802.1Qbu   | DNE 
>  |
>  | Scheduled Traffic  | P802.1Qbv   | RFC 
> posted (SO_TXTIME)   |
>  | SRP Enhancements and Performance Improvements  | P802.1Qcc   | DNE 
>  |
> 
>  DNE = Does Not Exist (to my knowledge)

Although your SO_TXTIME proposal could certainly form the basis of an 
endpoint's implementation of Qbv, I
think it is a stretch to consider it a Qbv implementation in itself, if that's 
what you mean by this.

I have been working with colleagues on some experiments relating to a 
Linux-controlled DSN switch
(a Marvell Topaz) that are a part of this effort in TSN: 

http://ieee802.org/1/files/public/docs2017/tsn-cgunther-802-3cg-multidrop-0917-v01.pdf

The proper interfaces for the Qbv configurat

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Eric Dumazet

On Tue, 2017-09-19 at 21:59 -0700, Samudrala, Sridhar wrote:
> On 9/19/2017 5:48 PM, Tom Herbert wrote:
> > On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
> >  wrote:
> > > On 9/12/2017 3:53 PM, Tom Herbert wrote:
> > > > On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
> > > >  wrote:
> > > > > 
> > > > > On 9/12/2017 8:47 AM, Eric Dumazet wrote:
> > > > > > On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
> > > > > > > On 9/11/2017 8:53 PM, Eric Dumazet wrote:
> > > > > > > > On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
> > > > > > > > 
> > > > > > > > > Two ints in sock_common for this purpose is quite expensive 
> > > > > > > > > and the
> > > > > > > > > use case for this is limited-- even if a RX->TX queue mapping 
> > > > > > > > > were
> > > > > > > > > introduced to eliminate the queue pair assumption this still 
> > > > > > > > > won't
> > > > > > > > > help if the receive and transmit interfaces are different for 
> > > > > > > > > the
> > > > > > > > > connection. I think we really need to see some very compelling
> > > > > > > > > results
> > > > > > > > > to be able to justify this.
> > > > > > > Will try to collect and post some perf data with symmetric queue
> > > > > > > configuration.
> > > 
> > > Here is some performance data i collected with memcached workload over
> > > ixgbe 10Gb NIC with mcblaster benchmark.
> > > ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
> > > low
> > > interrupt rate.
> > >   ethtool -L p1p1 combined 16
> > >   ethtool -C p1p1 rx-usecs 1000
> > > and busy poll is set to 1000usecs
> > >   sysctl net.core.busy_poll = 1000
> > > 
> > > 16 threads  800K requests/sec
> > > =
> > >   rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > ---
> > > Default2/182/1064123391 61163
> > > Symmetric Queues   2/50/6311  20457 32843
> > > 
> > > 32 threads  800K requests/sec
> > > =
> > >  rtt(min/avg/max)usecs intr/sec contextswitch/sec
> > > 
> > > Default2/162/639032168 69450
> > > Symmetric Queues2/50/385335044 35847
> > > 
> > No idea what "Default" configuration is. Please report how xps_cpus is
> > being set, how many RSS queues there are, and what the mapping is
> > between RSS queues and CPUs and shared caches. Also, whether and
> > threads are pinned.
> Default is linux 4.13 with the settings i listed above.
> ethtool -L p1p1 combined 16
> ethtool -C p1p1 rx-usecs 1000
> sysctl net.core.busy_poll = 1000
> 
> # ethtool -x p1p1
> RX flow hash indirection table for p1p1 with 16 RX ring(s):
> 0:  0 1 2 3 4 5 6 7
> 8:  8 9101112131415
>16:  0 1 2 3 4 5 6 7
>24:  8 9101112131415
>32:  0 1 2 3 4 5 6 7
>40:  8 9101112131415
>48:  0 1 2 3 4 5 6 7
>56:  8 9101112131415
>64:  0 1 2 3 4 5 6 7
>72:  8 9101112131415
>80:  0 1 2 3 4 5 6 7
>88:  8 9101112131415
>96:  0 1 2 3 4 5 6 7
>   104:  8 9101112131415
>   112:  0 1 2 3 4 5 6 7
>   120:  8 9101112131415
> 
> smp_affinity for the 16 queuepairs
> 141 p1p1-TxRx-0 ,0001
> 142 p1p1-TxRx-1 ,0002
> 143 p1p1-TxRx-2 ,0004
> 144 p1p1-TxRx-3 ,0008
> 145 p1p1-TxRx-4 ,0010
> 146 p1p1-TxRx-5 ,0020
> 147 p1p1-TxRx-6 ,0040
> 148 p1p1-TxRx-7 ,0080
> 149 p1p1-TxRx-8 ,0100
> 150 p1p1-TxRx-9 ,0200
> 151 p1p1-TxRx-10 ,0400
> 152 p1p1-TxRx-11 ,0800
> 153 p1p1-TxRx-12 ,1000
> 154 p1p1-TxRx-13 ,2000
> 155 p1p1-TxRx-14 ,4000
> 156 p1p1-TxRx-15 ,8000
> xps_cpus for the 16 Tx queues
> ,0001
> ,0002
> ,0004
> ,0008
> ,0010
> ,0020
> ,0040
> ,0080
> ,0100
> ,0200
> ,0400
> ,0800
> ,1000
> ,2000
> ,4000
> ,8000
> memcached threads are not pinned.
> 

...

I urge you to take the time

Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller

2017-09-19 Thread Petar Penkov

On Tue, Sep 19, 2017 at 4:01 PM, David Miller  wrote:
> From: Petar Penkov 
> Date: Tue, 19 Sep 2017 00:34:00 -0700
>
>> The following patches address this by providing the user(syzkaller)
>> with the ability to send via napi_gro_receive() and napi_gro_frags().
>> Additionally, syzkaller can specify how many fragments there are and
>> how much data per fragment there is. This is done by exploiting the
>> convenient structure of iovecs. Finally, this patch series adds
>> support for exercising the flow dissector during fuzzing.
>>
>> The code path including napi_gro_receive() can be enabled via the
>> CONFIG_TUN_NAPI compile-time flag, and can be used by users other than
>> syzkaller. The remainder of the changes in this patch series give the
>> user significantly more control over packets entering the kernel. To
>> avoid potential security vulnerabilities, hide the ability to send
>> custom skbs and the flow dissector code paths behind a run-time flag
>> IFF_NAPI_FRAGS that is advertised and accepted only if CONFIG_TUN_NAPI
>> is enabled.
>>
>> The patch series will be followed with changes to packetdrill, where
>> these additions to the TUN driver are exercised and demonstrated.
>> This will give the ability to write regression tests for specific
>> parts of the early networking stack.
>>
>> Patch 1/ Add NAPI struct per receive queue, enable NAPI, and use
>>napi_gro_receive()
>> Patch 2/ Use NAPI skb and napi_gro_frags(), exercise flow
>>dissector, and allow custom skbs.
>
> I'm happy with everything except the TUN_NAPI Kconfig knob
> requirement.
>
> Rebuilding something just to test things isn't going to fly very well.
>
> Please make it secure somehow, enable this stuff by default.
>
> Thanks.

Without a compile-time option, the TUN/TAP driver will have a
code-path that allows
user control over kernel memory allocation, and specifically over the
SKBs that enter
the kernel. That path might be hard to exploit as it requires some
user privileges,
but it does exist and increases attack surface of the kernel. While
the flag certainly
inconveniences testing, I think the layer of security it adds
outweighs its disadvantages.

Furthermore, in a way testing already requires specific kernel configuration.
In this particular example, syzkaller prefers synchronous operation
and therefore needs
4KSTACKS disabled. Other features that require rebuilding are KASAN
and dbx. From
this point of view, I still think that having the TUN_NAPI flag has value.

[PATCH net] net/ncsi: Don't assume last available channel exists

2017-09-19 Thread Samuel Mendoza-Jonas

When handling new VLAN tags in NCSI we check the maximum allowed number
of filters on the last active ("hot") channel. However if the 'add'
callback is called before NCSI has configured a channel, this causes a
NULL dereference.

Check that we actually have a hot channel, and warn if it is missing.

Signed-off-by: Samuel Mendoza-Jonas 
---
 net/ncsi/ncsi-manage.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3fd3c39e6278..fc800f934beb 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1420,7 +1420,10 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 
proto, u16 vid)
}
 
ndp = TO_NCSI_DEV_PRIV(nd);
-   ncf = ndp->hot_channel->filters[NCSI_FILTER_VLAN];
+   if (!ndp) {
+   netdev_warn(dev, "ncsi: No ncsi_dev_priv?\n");
+   return 0;
+   }
 
/* Add the VLAN id to our internal list */
list_for_each_entry_rcu(vlan, &ndp->vlan_vids, list) {
@@ -1432,11 +1435,17 @@ int ncsi_vlan_rx_add_vid(struct net_device *dev, __be16 
proto, u16 vid)
}
}
 
-   if (n_vids >= ncf->total) {
-   netdev_info(dev,
-   "NCSI Channel supports up to %u VLAN tags but %u 
are already set\n",
-   ncf->total, n_vids);
-   return -EINVAL;
+   if (!ndp->hot_channel) {
+   netdev_warn(dev,
+   "ncsi: no available filter to check maximum\n");
+   } else {
+   ncf = ndp->hot_channel->filters[NCSI_FILTER_VLAN];
+   if (n_vids >= ncf->total) {
+   netdev_info(dev,
+   "NCSI Channel supports up to %u VLAN tags 
but %u are already set\n",
+   ncf->total, n_vids);
+   return -EINVAL;
+   }
}
 
vlan = kzalloc(sizeof(*vlan), GFP_KERNEL);
-- 
2.14.1

Re: Latest net-next from GIT panic

2017-09-19 Thread Eric Dumazet

On Wed, 2017-09-20 at 02:06 +0200, Paweł Staszewski wrote:
> Just checked kernel 4.13.2 and same problem
> 
> Just after start all 6 bgp sessions - and kernel starts to learn routes 
> it panic.
> 
> https://bugzilla.kernel.org/attachment.cgi?id=258509
> 

Unfortunately we have not enough information from these traces.

Can you get a full stack trace ?

Alternatively, can you bisect ?

Thanks.

Re: [PATCH,net-next,2/2] tun: enable napi_gro_frags() for TUN/TAP driver

2017-09-19 Thread Eric Dumazet

On Tue, 2017-09-19 at 00:34 -0700, Petar Penkov wrote:
> Add a TUN/TAP receive mode that exercises the napi_gro_frags()
> interface. This mode is available only in TAP mode, as the interface
> expects packets with Ethernet headers.
> 
> Furthermore, packets follow the layout of the iovec_iter that was
> received. The first iovec is the linear data, and every one after the
> first is a fragment. If there are more fragments than the max number,
> drop the packet. Additionally, invoke eth_get_headlen() to exercise flow
> dissector code and to verify that the header resides in the linear data.
> 
> The napi_gro_frags() mode requires setting the IFF_NAPI_FRAGS option.
> This is imposed because this mode is intended for testing via tools like
> syzkaller and packetdrill, and the increased flexibility it provides can
> introduce security vulnerabilities.
> 
> Signed-off-by: Petar Penkov 
> Cc: Eric Dumazet 
> Cc: Mahesh Bandewar 
> Cc: Willem de Bruijn 
> Cc: da...@davemloft.net
> Cc: ppen...@stanford.edu
> ---

Again, very nice, thanks a lot Petar.

Acked-by: Eric Dumazet

Re: [PATCH,net-next,1/2] tun: enable NAPI for TUN/TAP driver

2017-09-19 Thread Eric Dumazet

On Tue, 2017-09-19 at 00:34 -0700, Petar Penkov wrote:
> Changes TUN driver to use napi_gro_receive() upon receiving packets
> rather than netif_rx_ni(). Adds flag CONFIG_TUN_NAPI that enables
> these changes and operation is not affected if the flag is disabled.
> SKBs are constructed upon packet arrival and are queued to be
> processed later.
> 

> 
> Signed-off-by: Petar Penkov 
> Cc: Eric Dumazet 
> Cc: Mahesh Bandewar 
> Cc: Willem de Bruijn 
> Cc: da...@davemloft.net
> Cc: ppen...@stanford.edu
> ---

Very nice, thanks a lot Petar.

Acked-by: Eric Dumazet

Re: [PATCH v5 05/10] dt-bindings: net: dwmac-sun8i: update documentation about integrated PHY

2017-09-19 Thread Rob Herring

On Thu, Sep 14, 2017 at 2:19 PM, Andrew Lunn  wrote:
>> > Is the MDIO controller "allwinner,sun8i-h3-emac" or "snps,dwmac-mdio"?
>> > If the latter, then I think the node is fine, but then the mux should be
>> > a child node of it. IOW, the child of an MDIO controller should either
>> > be a mux node or slave devices.
>
> Hi Rob
>
> Up until now, children of an MDIO bus have been MDIO devices. Those
> MDIO devices are either Ethernet PHYs, Ethernet Switches, or the
> oddball devices that Broadcom iProc has, like generic PHYs.
>
> We have never had MDIO-muxes as MDIO children. A Mux is not an MDIO
> device, and does not have the properties of an MDIO device. It is not
> addressable on the MDIO bus. The current MUXes are addressed via GPIOs
> or MMIO.

The DT parent/child relationship defines the bus topology. We describe
MDIO buses in that way and if a mux is sitting between the controller
and the devices, then the DT hierarchy should reflect that. Now
sometimes we have 2 options for what interface has the parent/child
relationship (e.g. an I2C controlled USB hub chip), but in this case
we don't.

> There other similar cases. i2c-mux-gpio is not a child of an i2c bus,
> nor i2c-mux-reg or gpio-mux. nxp,pca9548 is however a child of the i2c
> bus, because it is an i2c device itself...

Some are i2c controlled mux devices, but some can be GPIO controlled.

>
> If the MDIO mux was an MDIO device, i would agree with you. Bit it is
> not, so lets not make it a child.
>
> Andrew
>
> ___
> linux-arm-kernel mailing list
> linux-arm-ker...@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

RE: [PATCH RFC v1 0/3] Support for tap user-space access with veth interfaces

2017-09-19 Thread Grandhi, Sainath

Just a reminder for feedback.

> -Original Message-
> From: Grandhi, Sainath
> Sent: Wednesday, September 06, 2017 5:34 PM
> To: netdev@vger.kernel.org
> Cc: da...@davemloft.net; Grandhi, Sainath 
> Subject: [PATCH RFC v1 0/3] Support for tap user-space access with veth
> interfaces
> 
> From: Sainath Grandhi 
> 
> This patchset adds a tap device driver for veth virtual network interface.
> With this implementation, tap character interface can be added only to the 
> peer
> veth interface. Adding tap interface to veth is for usecases that forwards
> packets between host and VMs. This eliminates the need for an additional
> software bridge. This can be extended to create both the peer interfaces as 
> tap
> interfaces. These patches are a step in that direction.
> 
> Sainath Grandhi (3):
>   net: Adding API to parse IFLA_LINKINFO attribute
>   net: Abstracting out common routines from veth for use by vethtap
>   vethtap: veth based tap driver
> 
>  drivers/net/Kconfig |   1 +
>  drivers/net/Makefile|   2 +
>  drivers/net/{veth.c => veth_main.c} |  80 ++---
>  drivers/net/vethtap.c   | 216 
> 
>  include/linux/if_veth.h |  13 +++
>  include/net/rtnetlink.h |   3 +
>  net/core/rtnetlink.c|   8 ++
>  7 files changed, 308 insertions(+), 15 deletions(-)  rename 
> drivers/net/{veth.c =>
> veth_main.c} (89%)  create mode 100644 drivers/net/vethtap.c  create mode
> 100644 include/linux/if_veth.h
> 
> --
> 2.7.4

RE: [PATCH v4 2/3] net: fec: remove unused interrupt FEC_ENET_TS_TIMER

2017-09-19 Thread Andy Duan

From: Troy Kisky  Sent: Wednesday, September 
20, 2017 8:33 AM
>FEC_ENET_TS_TIMER is not checked in the interrupt routine so there is no
>need to enable it.
>
>Signed-off-by: Troy Kisky 
>Acked-by: Fugang Duan 
>
>---
>v4: Added Acked-by
>
>Signed-off-by: Troy Kisky 
>---

Troy, thank you for submitting the version.
The patch series had been already acked by me.

Thanks again.

> drivers/net/ethernet/freescale/fec.h | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/ethernet/freescale/fec.h
>b/drivers/net/ethernet/freescale/fec.h
>index 38c7b21..ede1876 100644
>--- a/drivers/net/ethernet/freescale/fec.h
>+++ b/drivers/net/ethernet/freescale/fec.h
>@@ -374,8 +374,8 @@ struct bufdesc_ex {
> #define FEC_ENET_TS_AVAIL   ((uint)0x0001)
> #define FEC_ENET_TS_TIMER   ((uint)0x8000)
>
>-#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF |
>FEC_ENET_MII | FEC_ENET_TS_TIMER)
>-#define FEC_NAPI_IMASK(FEC_ENET_MII | FEC_ENET_TS_TIMER)
>+#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF |
>FEC_ENET_MII)
>+#define FEC_NAPI_IMASKFEC_ENET_MII
> #define FEC_RX_DISABLED_IMASK (FEC_DEFAULT_IMASK &
>(~FEC_ENET_RXF))
>
> /* ENET interrupt coalescing macro define */
>--
>2.7.4

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread levipearson

On Thu, Aug 31, 2017 at 06:26:20PM -0700, Vinicius Costa Gomes wrote:
> Hi,
> 
> This patchset is an RFC on a proposal of how the Traffic Control subsystem can
> be used to offload the configuration of traffic shapers into network devices
> that provide support for them in HW. Our goal here is to start upstreaming
> support for features related to the Time-Sensitive Networking (TSN) set of
> standards into the kernel.

I'm very excited to see these features moving into the kernel! I am one of the
maintainers of the OpenAvnu project and I've been involved in building AVB/TSN
systems and working on the standards for around 10 years, so the support that's
been slowly making it into more silicon and now Linux drivers is very
encouraging.

My team at Harman is working on endpoint code based on what's in the OpenAvnu
project and a few Linux-based platforms. The Qav interface you've proposed will
fit nicely with our traffic shaper management daemon, which already uses mqprio
as a base but uses the htb shaper to approximate the Qav credit-based shaper on
platforms where launch time scheduling isn't available.

I've applied your patches and plan on testing them in conjunction with our
shaper manager to see if we run into any hitches, but I don't expect any
problems.

> As part of this work, we've assessed previous public discussions related to 
> TSN
> enabling: patches from Henrik Austad (Cisco), the presentation from Eric Mann
> at Linux Plumbers 2012, patches from Gangfeng Huang (National Instruments) and
> the current state of the OpenAVNU project (https://github.com/AVnu/OpenAvnu/).
> 
> Please note that the patches provided as part of this RFC are implementing 
> what
> is needed only for 802.1Qav (FQTSS) only, but we'd like to take advantage of
> this discussion and share our WIP ideas for the 802.1Qbv and 802.1Qbu 
> interfaces
> as well. The current patches are only providing support for HW offload of the
> configs.
> 
> 
> Overview
> 
> 
> Time-sensitive Networking (TSN) is a set of standards that aim to address
> resources availability for providing bandwidth reservation and bounded latency
> on Ethernet based LANs. The proposal described here aims to cover mainly what 
> is
> needed to enable the following standards: 802.1Qat, 802.1Qav, 802.1Qbv and
> 802.1Qbu.
> 
> The initial target of this work is the Intel i210 NIC, but other controllers'
> datasheet were also taken into account, like the Renesas RZ/A1H RZ/A1M group 
> and
> the Synopsis DesignWare Ethernet QoS controller.

Recent SoCs from NXP (the i.MX 6 SoloX, and all the i.MX 7 and 8 parts) support
Qav shaping as well as scheduled launch functionality; these are the parts I 
have been mostly working with. Marvell silicon (some subset of Armada processors
and Link Street DSA switches) generally supports traffic shaping as well.

I think a lack of an interface like this has probably slowed upstream driver
support for this functionality where it exists; most vendors have an out-of-
tree version of their driver with TSN functionality enabled via non-standard
interfaces. Hopefully making it available will encourage vendors to upstream
their driver support!

> Proposal
> 
> 
> Feature-wise, what is covered here are configuration interfaces for HW
> implementations of the Credit-Based shaper (CBS, 802.1Qav), Time-Aware shaper
> (802.1Qbv) and Frame Preemption (802.1Qbu). CBS is a per-queue shaper, while
> Qbv and Qbu must be configured per port, with the configuration covering all
> queues. Given that these features are related to traffic shaping, and that the
> traffic control subsystem already provides a queueing discipline that offloads
> config into the device driver (i.e. mqprio), designing new qdiscs for the
> specific purpose of offloading the config for each shaper seemed like a good
> fit.

This makes sense to me too. The 802.1Q standards are all based on the sort of
mappings between priority, traffic class, and hardware queues that the existing
tc infrastructure seems to be modeling. I believe the mqprio module's mapping
scheme is flexible enough to meet any TSN needs in conjunction with the other
parts of the kernel qdisc system.

> For steering traffic into the correct queues, we use the socket option
> SO_PRIORITY and then a mechanism to map priority to traffic classes / 
> Txqueues.
> The qdisc mqprio is currently used in our tests.
> 
> As for the shapers config interface:
> 
>  * CBS (802.1Qav)
> 
>This patchset is proposing a new qdisc called 'cbs'. Its 'tc' cmd line is:
>$ tc qdisc add dev IFACE parent ID cbs locredit N hicredit M sendslope S \
>  idleslope I
> 
>Note that the parameters for this qdisc are the ones defined by the
>802.1Q-2014 spec, so no hardware specific functionality is exposed here.

These parameters look good to me as a baseline; some additional optional
parameters may be useful for software-based implementations--such as setting an
interval at which to recalculate queues--but

[PATCH] isdn/i4l: fetch the ppp_write buffer in one shot

2017-09-19 Thread Meng Xu

In isdn_ppp_write(), the header (i.e., protobuf) of the buffer is
fetched twice from userspace. The first fetch is used to peek at the
protocol of the message and reset the huptimer if necessary; while the
second fetch copies in the whole buffer. However, given that buf resides
in userspace memory, a user process can race to change its memory content
across fetches. By doing so, we can either avoid resetting the huptimer
for any type of packets (by first setting proto to PPP_LCP and later
change to the actual type) or force resetting the huptimer for LCP
packets.

This patch changes this double-fetch behavior into two single fetches
decided by condition (lp->isdn_device < 0 || lp->isdn_channel <0).
A more detailed discussion can be found at
https://marc.info/?l=linux-kernel&m=150586376926123&w=2

Signed-off-by: Meng Xu 
---
 drivers/isdn/i4l/isdn_ppp.c | 37 +
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/isdn/i4l/isdn_ppp.c b/drivers/isdn/i4l/isdn_ppp.c
index 6c44609..cd2b3c6 100644
--- a/drivers/isdn/i4l/isdn_ppp.c
+++ b/drivers/isdn/i4l/isdn_ppp.c
@@ -825,7 +825,6 @@ isdn_ppp_write(int min, struct file *file, const char 
__user *buf, int count)
isdn_net_local *lp;
struct ippp_struct *is;
int proto;
-   unsigned char protobuf[4];
 
is = file->private_data;
 
@@ -839,24 +838,28 @@ isdn_ppp_write(int min, struct file *file, const char 
__user *buf, int count)
if (!lp)
printk(KERN_DEBUG "isdn_ppp_write: lp == NULL\n");
else {
-   /*
-* Don't reset huptimer for
-* LCP packets. (Echo requests).
-*/
-   if (copy_from_user(protobuf, buf, 4))
-   return -EFAULT;
-   proto = PPP_PROTOCOL(protobuf);
-   if (proto != PPP_LCP)
-   lp->huptimer = 0;
+   if (lp->isdn_device < 0 || lp->isdn_channel < 0) {
+   unsigned char protobuf[4];
+   /*
+* Don't reset huptimer for
+* LCP packets. (Echo requests).
+*/
+   if (copy_from_user(protobuf, buf, 4))
+   return -EFAULT;
+
+   proto = PPP_PROTOCOL(protobuf);
+   if (proto != PPP_LCP)
+   lp->huptimer = 0;
 
-   if (lp->isdn_device < 0 || lp->isdn_channel < 0)
return 0;
+   }
 
if ((dev->drv[lp->isdn_device]->flags & DRV_FLAG_RUNNING) &&
lp->dialstate == 0 &&
(lp->flags & ISDN_NET_CONNECTED)) {
unsigned short hl;
struct sk_buff *skb;
+   unsigned char *cpy_buf;
/*
 * we need to reserve enough space in front of
 * sk_buff. old call to dev_alloc_skb only reserved
@@ -869,11 +872,21 @@ isdn_ppp_write(int min, struct file *file, const char 
__user *buf, int count)
return count;
}
skb_reserve(skb, hl);
-   if (copy_from_user(skb_put(skb, count), buf, count))
+   cpy_buf = skb_put(skb, count);
+   if (copy_from_user(cpy_buf, buf, count))
{
kfree_skb(skb);
return -EFAULT;
}
+
+   /*
+* Don't reset huptimer for
+* LCP packets. (Echo requests).
+*/
+   proto = PPP_PROTOCOL(cpy_buf);
+   if (proto != PPP_LCP)
+   lp->huptimer = 0;
+
if (is->debug & 0x40) {
printk(KERN_DEBUG "ppp xmit: len %d\n", (int) 
skb->len);
isdn_ppp_frame_log("xmit", skb->data, skb->len, 
32, is->unit, lp->ppp_slot);
-- 
2.7.4

[PATCHv3 iproute2 2/2] lib/libnetlink: update rtnl_talk to support malloc buff at run time

2017-09-19 Thread Hangbin Liu

This is an update for 460c03f3f3cc ("iplink: double the buffer size also in
iplink_get()"). After update, we will not need to double the buffer size
every time when VFs number increased.

With call like rtnl_talk(&rth, &req.n, NULL, 0), we can simply remove the
length parameter.

With call like rtnl_talk(&rth, nlh, nlh, sizeof(req), I add a new variable
answer to avoid overwrite data in nlh, because it may has more info after
nlh. also this will avoid nlh buffer not enough issue.

We need to free answer after using.

Signed-off-by: Hangbin Liu 
Signed-off-by: Phil Sutter 
---
 bridge/fdb.c |  2 +-
 bridge/link.c|  2 +-
 bridge/mdb.c |  2 +-
 bridge/vlan.c|  2 +-
 genl/ctrl.c  | 19 ---
 include/libnetlink.h |  6 +++---
 ip/ipaddress.c   |  4 ++--
 ip/ipaddrlabel.c |  4 ++--
 ip/ipfou.c   |  4 ++--
 ip/ipila.c   |  4 ++--
 ip/ipl2tp.c  |  8 
 ip/iplink.c  | 38 +++---
 ip/iplink_vrf.c  | 44 
 ip/ipmacsec.c|  2 +-
 ip/ipneigh.c |  2 +-
 ip/ipnetns.c | 23 ++-
 ip/ipntable.c|  2 +-
 ip/iproute.c | 26 +-
 ip/iprule.c  |  6 +++---
 ip/ipseg6.c  |  8 +---
 ip/iptoken.c |  2 +-
 ip/link_gre.c| 11 +++
 ip/link_gre6.c   | 11 +++
 ip/link_ip6tnl.c | 11 +++
 ip/link_iptnl.c  | 10 ++
 ip/link_vti.c| 11 +++
 ip/link_vti6.c   | 11 +++
 ip/tcp_metrics.c |  8 +---
 ip/xfrm_policy.c | 25 +
 ip/xfrm_state.c  | 30 --
 lib/libgenl.c|  9 +++--
 lib/libnetlink.c | 24 +++-
 misc/ss.c|  2 +-
 tc/m_action.c| 12 ++--
 tc/tc_class.c|  2 +-
 tc/tc_filter.c   |  8 +---
 tc/tc_qdisc.c|  2 +-
 37 files changed, 220 insertions(+), 177 deletions(-)

diff --git a/bridge/fdb.c b/bridge/fdb.c
index e5cebf9..807914f 100644
--- a/bridge/fdb.c
+++ b/bridge/fdb.c
@@ -535,7 +535,7 @@ static int fdb_modify(int cmd, int flags, int argc, char 
**argv)
return -1;
}
 
-   if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+   if (rtnl_talk(&rth, &req.n, NULL) < 0)
return -1;
 
return 0;
diff --git a/bridge/link.c b/bridge/link.c
index 93472ad..cc29a2a 100644
--- a/bridge/link.c
+++ b/bridge/link.c
@@ -426,7 +426,7 @@ static int brlink_modify(int argc, char **argv)
addattr_nest_end(&req.n, nest);
}
 
-   if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+   if (rtnl_talk(&rth, &req.n, NULL) < 0)
return -1;
 
return 0;
diff --git a/bridge/mdb.c b/bridge/mdb.c
index 748091b..f38e326 100644
--- a/bridge/mdb.c
+++ b/bridge/mdb.c
@@ -440,7 +440,7 @@ static int mdb_modify(int cmd, int flags, int argc, char 
**argv)
entry.vid = vid;
addattr_l(&req.n, sizeof(req), MDBA_SET_ENTRY, &entry, sizeof(entry));
 
-   if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+   if (rtnl_talk(&rth, &req.n, NULL) < 0)
return -1;
 
return 0;
diff --git a/bridge/vlan.c b/bridge/vlan.c
index ebcdace..5d68359 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -133,7 +133,7 @@ static int vlan_modify(int cmd, int argc, char **argv)
 
addattr_nest_end(&req.n, afspec);
 
-   if (rtnl_talk(&rth, &req.n, NULL, 0) < 0)
+   if (rtnl_talk(&rth, &req.n, NULL) < 0)
return -1;
 
return 0;
diff --git a/genl/ctrl.c b/genl/ctrl.c
index 448988e..a6d31b0 100644
--- a/genl/ctrl.c
+++ b/genl/ctrl.c
@@ -55,6 +55,7 @@ int genl_ctrl_resolve_family(const char *family)
};
struct nlmsghdr *nlh = &req.n;
struct genlmsghdr *ghdr = &req.g;
+   struct nlmsghdr *answer = NULL;
 
if (rtnl_open_byproto(&rth, 0, NETLINK_GENERIC) < 0) {
fprintf(stderr, "Cannot open generic netlink socket\n");
@@ -63,19 +64,19 @@ int genl_ctrl_resolve_family(const char *family)
 
addattr_l(nlh, 128, CTRL_ATTR_FAMILY_NAME, family, strlen(family) + 1);
 
-   if (rtnl_talk(&rth, nlh, nlh, sizeof(req)) < 0) {
+   if (rtnl_talk(&rth, nlh, &answer) < 0) {
fprintf(stderr, "Error talking to the kernel\n");
goto errout;
}
 
{
struct rtattr *tb[CTRL_ATTR_MAX + 1];
-   int len = nlh->nlmsg_len;
+   int len = answer->nlmsg_len;
struct rtattr *attrs;
 
-   if (nlh->nlmsg_type !=  GENL_ID_CTRL) {
+   if (answer->nlmsg_type !=  GENL_ID_CTRL) {
fprintf(stderr, "Not a controller message, nlmsg_len=%d 
"
-   "nlmsg_type=0x%x\n", nlh->nlmsg_len, 
nlh->nlmsg_type);
+   "nlmsg_type=0x%x\n", an

[PATCHv3 iproute2 0/2] libnetlink: malloc correct buff at run time

2017-09-19 Thread Hangbin Liu

With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size") and
460c03f3f3cc ("iplink: double the buffer size also in iplink_get()"), we
extend the buffer size to avoid truncated message with large numbers of
VFs. But just as Michal said, this is not future-proof since the NIC
number is increasing. We have customer even has 220+ VFs now.

This is not make sense to hard code the buffer and increase it all the time.
So let's just malloc the correct buff size at run time.

Tested with most ip cmds and all look good.

---
v2 -> v3:
* rtnl_recvmsg():
  * free buf before each return.
  * return errno when recvmsg failed.
v1 -> v2 by Phil:
* rtnl_recvmsg():
  * Rename output buffer pointer arg to 'answer'.
  * Use realloc() and make sure old buffer is freed on error.
  * Always return a newly allocated buffer for caller to free.
  * Retry on EINTR or EAGAIN so caller doesn't have to.
  * Return well-known negative error codes instead of just -1 on error.
  * Simplify goto label names.
  * If no answer pointer was passed, just free the buffer.
* rtnl_dump_filter_l():
  * Don't retry if rtnl_recvmsg() returns 0 as this can't happen
anymore.
  * Free buffer returned by rtnl_recvmsg().
* __rtnl_talk():
  * Don't retry if rtnl_recvmsg() returns 0 as this can't happen
anymore.
  * Free buffer returned by rtnl_recvmsg().
  * Return a newly allocated buffer for callers to free.
* genl_ctrl_resolve_family()
  * Replace 'ghdr + GENL_HDRLEN' to 'answer + NLMSG_LENGTH(GENL_HDRLEN)'
* tc_action_gd()
  * Call print_action() only if cmd == RTM_GETACTION
* Change callers of rtnl_talk*() to always free the answer buffer if
  they passed one.
* Drop extra request buffer space in callers if only used for holding
  output data.
* Drop initialization of answer pointer if not necessary.
* Change callers to pass NULL instead of answer pointer if they don't
  use it afterwards.


Hangbin Liu (2):
  lib/libnetlink: re malloc buff if size is not enough
  lib/libnetlink: update rtnl_talk to support malloc buff at run time

 bridge/fdb.c |   2 +-
 bridge/link.c|   2 +-
 bridge/mdb.c |   2 +-
 bridge/vlan.c|   2 +-
 genl/ctrl.c  |  19 +---
 include/libnetlink.h |   6 +--
 ip/ipaddress.c   |   4 +-
 ip/ipaddrlabel.c |   4 +-
 ip/ipfou.c   |   4 +-
 ip/ipila.c   |   4 +-
 ip/ipl2tp.c  |   8 +--
 ip/iplink.c  |  38 +++
 ip/iplink_vrf.c  |  44 -
 ip/ipmacsec.c|   2 +-
 ip/ipneigh.c |   2 +-
 ip/ipnetns.c |  23 +
 ip/ipntable.c|   2 +-
 ip/iproute.c |  26 ++
 ip/iprule.c  |   6 +--
 ip/ipseg6.c  |   8 +--
 ip/iptoken.c |   2 +-
 ip/link_gre.c|  11 +++--
 ip/link_gre6.c   |  11 +++--
 ip/link_ip6tnl.c |  11 +++--
 ip/link_iptnl.c  |  10 ++--
 ip/link_vti.c|  11 +++--
 ip/link_vti6.c   |  11 +++--
 ip/tcp_metrics.c |   8 +--
 ip/xfrm_policy.c |  25 +-
 ip/xfrm_state.c  |  30 ++--
 lib/libgenl.c|   9 +++-
 lib/libnetlink.c | 134 ++-
 misc/ss.c|   2 +-
 tc/m_action.c|  12 ++---
 tc/tc_class.c|   2 +-
 tc/tc_filter.c   |   8 +--
 tc/tc_qdisc.c|   2 +-
 37 files changed, 298 insertions(+), 209 deletions(-)

-- 
2.5.5

[PATCHv3 iproute2 1/2] lib/libnetlink: re malloc buff if size is not enough

2017-09-19 Thread Hangbin Liu

With commit 72b365e8e0fd ("libnetlink: Double the dump buffer size")
we doubled the buffer size to support more VFs. But the VFs number is
increasing all the time. Some customers even use more than 200 VFs now.

We could not double it everytime when the buffer is not enough. Let's just
not hard code the buffer size and malloc the correct number when running.

Introduce function rtnl_recvmsg() to always return a newly allocated buffer.
The caller need to free it after using.

Signed-off-by: Hangbin Liu 
Signed-off-by: Phil Sutter 
---
 lib/libnetlink.c | 114 ++-
 1 file changed, 80 insertions(+), 34 deletions(-)

diff --git a/lib/libnetlink.c b/lib/libnetlink.c
index be7ac86..ab45b48 100644
--- a/lib/libnetlink.c
+++ b/lib/libnetlink.c
@@ -402,6 +402,64 @@ static void rtnl_dump_error(const struct rtnl_handle *rth,
}
 }
 
+static int rtnl_recvmsg(int fd, struct msghdr *msg, char **answer)
+{
+   struct iovec *iov;
+   int len = -1, buf_len = 32768;
+   char *bufp, *buf = NULL;
+
+   int flag = MSG_PEEK | MSG_TRUNC;
+
+realloc:
+   bufp = realloc(buf, buf_len);
+
+   if (bufp == NULL) {
+   fprintf(stderr, "malloc error: not enough buffer\n");
+   free(buf);
+   return -ENOMEM;
+   }
+   buf = bufp;
+   iov = msg->msg_iov;
+   iov->iov_base = buf;
+   iov->iov_len = buf_len;
+
+recv:
+   len = recvmsg(fd, msg, flag);
+
+   if (len < 0) {
+   if (errno == EINTR || errno == EAGAIN)
+   goto recv;
+   fprintf(stderr, "netlink receive error %s (%d)\n",
+   strerror(errno), errno);
+   free(buf);
+   return -errno;
+   }
+
+   if (len == 0) {
+   fprintf(stderr, "EOF on netlink\n");
+   free(buf);
+   return -ENODATA;
+   }
+
+   if (len > buf_len) {
+   buf_len = len;
+   flag = 0;
+   goto realloc;
+   }
+
+   if (flag != 0) {
+   flag = 0;
+   goto recv;
+   }
+
+   if (answer)
+   *answer = buf;
+   else
+   free(buf);
+
+   return len;
+}
+
 int rtnl_dump_filter_l(struct rtnl_handle *rth,
   const struct rtnl_dump_filter_arg *arg)
 {
@@ -413,31 +471,18 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
.msg_iov = &iov,
.msg_iovlen = 1,
};
-   char buf[32768];
+   char *buf;
int dump_intr = 0;
 
-   iov.iov_base = buf;
while (1) {
int status;
const struct rtnl_dump_filter_arg *a;
int found_done = 0;
int msglen = 0;
 
-   iov.iov_len = sizeof(buf);
-   status = recvmsg(rth->fd, &msg, 0);
-
-   if (status < 0) {
-   if (errno == EINTR || errno == EAGAIN)
-   continue;
-   fprintf(stderr, "netlink receive error %s (%d)\n",
-   strerror(errno), errno);
-   return -1;
-   }
-
-   if (status == 0) {
-   fprintf(stderr, "EOF on netlink\n");
-   return -1;
-   }
+   status = rtnl_recvmsg(rth->fd, &msg, &buf);
+   if (status < 0)
+   return status;
 
if (rth->dump_fp)
fwrite(buf, 1, NLMSG_ALIGN(status), rth->dump_fp);
@@ -462,8 +507,10 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
 
if (h->nlmsg_type == NLMSG_DONE) {
err = rtnl_dump_done(h);
-   if (err < 0)
+   if (err < 0) {
+   free(buf);
return -1;
+   }
 
found_done = 1;
break; /* process next filter */
@@ -471,19 +518,23 @@ int rtnl_dump_filter_l(struct rtnl_handle *rth,
 
if (h->nlmsg_type == NLMSG_ERROR) {
rtnl_dump_error(rth, h);
+   free(buf);
return -1;
}
 
if (!rth->dump_fp) {
err = a->filter(&nladdr, h, a->arg1);
-   if (err < 0)
+   if (err < 0) {
+   free(buf);
return err;
+   }
}
 
 sk

[PATCH v4 3/3] net: fec: return IRQ_HANDLED if fec_ptp_check_pps_event handled it

2017-09-19 Thread Troy Kisky

fec_ptp_check_pps_event will return 1 if FEC_T_TF_MASK caused
an interrupt. Don't return IRQ_NONE in this case.

Signed-off-by: Troy Kisky 
Acked-by: Fugang Duan 

---
v3: New patch, came from feedback from another patch.
v4: Added Acked-by

Signed-off-by: Troy Kisky 
---
 drivers/net/ethernet/freescale/fec_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 464055f..3dc2d77 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1604,8 +1604,8 @@ fec_enet_interrupt(int irq, void *dev_id)
}
 
if (fep->ptp_clock)
-   fec_ptp_check_pps_event(fep);
-
+   if (fec_ptp_check_pps_event(fep))
+   ret = IRQ_HANDLED;
return ret;
 }
 
-- 
2.7.4

[PATCH v4 2/3] net: fec: remove unused interrupt FEC_ENET_TS_TIMER

2017-09-19 Thread Troy Kisky

FEC_ENET_TS_TIMER is not checked in the interrupt routine
so there is no need to enable it.

Signed-off-by: Troy Kisky 
Acked-by: Fugang Duan 

---
v4: Added Acked-by

Signed-off-by: Troy Kisky 
---
 drivers/net/ethernet/freescale/fec.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec.h 
b/drivers/net/ethernet/freescale/fec.h
index 38c7b21..ede1876 100644
--- a/drivers/net/ethernet/freescale/fec.h
+++ b/drivers/net/ethernet/freescale/fec.h
@@ -374,8 +374,8 @@ struct bufdesc_ex {
 #define FEC_ENET_TS_AVAIL   ((uint)0x0001)
 #define FEC_ENET_TS_TIMER   ((uint)0x8000)
 
-#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII | 
FEC_ENET_TS_TIMER)
-#define FEC_NAPI_IMASK (FEC_ENET_MII | FEC_ENET_TS_TIMER)
+#define FEC_DEFAULT_IMASK (FEC_ENET_TXF | FEC_ENET_RXF | FEC_ENET_MII)
+#define FEC_NAPI_IMASK FEC_ENET_MII
 #define FEC_RX_DISABLED_IMASK (FEC_DEFAULT_IMASK & (~FEC_ENET_RXF))
 
 /* ENET interrupt coalescing macro define */
-- 
2.7.4

[PATCH v4 1/3] net: fec: only check queue 0 if RXF_0/TXF_0 interrupt is set

2017-09-19 Thread Troy Kisky

Before queue 0 was always checked if any queue caused an interrupt.
It is better to just mark queue 0 if queue 0 has caused an interrupt.

Signed-off-by: Troy Kisky 
Acked-by: Fugang Duan 

---
v3: add Acked-by
v4: no change

Signed-off-by: Troy Kisky 
---
 drivers/net/ethernet/freescale/fec_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/freescale/fec_main.c 
b/drivers/net/ethernet/freescale/fec_main.c
index 56f56d6..464055f 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -1559,14 +1559,14 @@ fec_enet_collect_events(struct fec_enet_private *fep, 
uint int_events)
if (int_events == 0)
return false;
 
-   if (int_events & FEC_ENET_RXF)
+   if (int_events & FEC_ENET_RXF_0)
fep->work_rx |= (1 << 2);
if (int_events & FEC_ENET_RXF_1)
fep->work_rx |= (1 << 0);
if (int_events & FEC_ENET_RXF_2)
fep->work_rx |= (1 << 1);
 
-   if (int_events & FEC_ENET_TXF)
+   if (int_events & FEC_ENET_TXF_0)
fep->work_tx |= (1 << 2);
if (int_events & FEC_ENET_TXF_1)
fep->work_tx |= (1 << 0);
-- 
2.7.4

linux-next: manual merge of the net-next tree with the driver-core.current tree

2017-09-19 Thread Stephen Rothwell

Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  lib/kobject_uevent.c

between commit:

  6878e7de6af7 ("driver core: suppress sending MODALIAS in UNBIND uevents")

from the driver-core.current tree and commit:

  16dff336b33d ("kobject: add kobject_uevent_net_broadcast()")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc lib/kobject_uevent.c
index f237a09a5862,147db91c10d0..
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@@ -294,26 -294,55 +294,75 @@@ static void cleanup_uevent_env(struct s
  }
  #endif
  
 +static void zap_modalias_env(struct kobj_uevent_env *env)
 +{
 +  static const char modalias_prefix[] = "MODALIAS=";
 +  int i;
 +
 +  for (i = 0; i < env->envp_idx;) {
 +  if (strncmp(env->envp[i], modalias_prefix,
 +  sizeof(modalias_prefix) - 1)) {
 +  i++;
 +  continue;
 +  }
 +
 +  if (i != env->envp_idx - 1)
 +  memmove(&env->envp[i], &env->envp[i + 1],
 +  sizeof(env->envp[i]) * env->envp_idx - 1);
 +
 +  env->envp_idx--;
 +  }
 +}
 +
+ static int kobject_uevent_net_broadcast(struct kobject *kobj,
+   struct kobj_uevent_env *env,
+   const char *action_string,
+   const char *devpath)
+ {
+   int retval = 0;
+ #if defined(CONFIG_NET)
+   struct sk_buff *skb = NULL;
+   struct uevent_sock *ue_sk;
+ 
+   /* send netlink message */
+   list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+   struct sock *uevent_sock = ue_sk->sk;
+ 
+   if (!netlink_has_listeners(uevent_sock, 1))
+   continue;
+ 
+   if (!skb) {
+   /* allocate message with the maximum possible size */
+   size_t len = strlen(action_string) + strlen(devpath) + 
2;
+   char *scratch;
+ 
+   retval = -ENOMEM;
+   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+   if (!skb)
+   continue;
+ 
+   /* add header */
+   scratch = skb_put(skb, len);
+   sprintf(scratch, "%s@%s", action_string, devpath);
+ 
+   skb_put_data(skb, env->buf, env->buflen);
+ 
+   NETLINK_CB(skb).dst_group = 1;
+   }
+ 
+   retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
+   0, 1, GFP_KERNEL,
+   kobj_bcast_filter,
+   kobj);
+   /* ENOBUFS should be handled in userspace */
+   if (retval == -ENOBUFS || retval == -ESRCH)
+   retval = 0;
+   }
+   consume_skb(skb);
+ #endif
+   return retval;
+ }
+ 
  /**
   * kobject_uevent_env - send an uevent with environmental data
   *

[PATCH 0/2] blackfin: Drop non-functional DSA code

2017-09-19 Thread Florian Fainelli

Hi David,

I sent those many months ago in the hope that the bfin-linux people
would pick those patches but nobody seems to be responding, can you
queue those via net-next since this affects DSA?

Thanks!

Florian Fainelli (2):
  blackfin: tcm-bf518: Remove dsa.h inclusion
  blackfin: ezbrd: Remove non-functional DSA/KSZ8893M code

 arch/blackfin/mach-bf518/boards/ezbrd.c | 47 -
 arch/blackfin/mach-bf518/boards/tcm-bf518.c |  1 -
 2 files changed, 48 deletions(-)

-- 
2.11.0

[PATCH 1/2] blackfin: tcm-bf518: Remove dsa.h inclusion

2017-09-19 Thread Florian Fainelli

Nothing in that file uses definitions from that header, so just get rid of it.

Signed-off-by: Florian Fainelli 
---
 arch/blackfin/mach-bf518/boards/tcm-bf518.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/blackfin/mach-bf518/boards/tcm-bf518.c 
b/arch/blackfin/mach-bf518/boards/tcm-bf518.c
index 240d5cb1f02c..37d868085f6a 100644
--- a/arch/blackfin/mach-bf518/boards/tcm-bf518.c
+++ b/arch/blackfin/mach-bf518/boards/tcm-bf518.c
@@ -25,7 +25,6 @@
 #include 
 #include 
 #include 
-#include 
 
 /*
  * Name the Board for the /proc/cpuinfo
-- 
2.11.0

[PATCH 2/2] blackfin: ezbrd: Remove non-functional DSA/KSZ8893M code

2017-09-19 Thread Florian Fainelli

There is no in tree driver for the KSZ8893M switch driver, so just get rid of
the code in that board file.

Signed-off-by: Florian Fainelli 
---
 arch/blackfin/mach-bf518/boards/ezbrd.c | 47 -
 1 file changed, 47 deletions(-)

diff --git a/arch/blackfin/mach-bf518/boards/ezbrd.c 
b/arch/blackfin/mach-bf518/boards/ezbrd.c
index d022112927c2..c51d1b810ac3 100644
--- a/arch/blackfin/mach-bf518/boards/ezbrd.c
+++ b/arch/blackfin/mach-bf518/boards/ezbrd.c
@@ -25,7 +25,6 @@
 #include 
 #include 
 #include 
-#include 
 
 /*
  * Name the Board for the /proc/cpuinfo
@@ -105,11 +104,7 @@ static const unsigned short bfin_mac_peripherals[] = {
 
 static struct bfin_phydev_platform_data bfin_phydev_data[] = {
{
-#if IS_ENABLED(CONFIG_NET_DSA_KSZ8893M)
-   .addr = 3,
-#else
.addr = 1,
-#endif
.irq = IRQ_MAC_PHYINT,
},
 };
@@ -119,9 +114,6 @@ static struct bfin_mii_bus_platform_data bfin_mii_bus_data 
= {
.phydev_data = bfin_phydev_data,
.phy_mode = PHY_INTERFACE_MODE_MII,
.mac_peripherals = bfin_mac_peripherals,
-#if IS_ENABLED(CONFIG_NET_DSA_KSZ8893M)
-   .phy_mask = 0xfff7, /* Only probe the port phy connect to the on chip 
MAC */
-#endif
.vlan1_mask = 1,
.vlan2_mask = 2,
 };
@@ -140,29 +132,6 @@ static struct platform_device bfin_mac_device = {
}
 };
 
-#if IS_ENABLED(CONFIG_NET_DSA_KSZ8893M)
-static struct dsa_chip_data ksz8893m_switch_chip_data = {
-   .mii_bus = &bfin_mii_bus.dev,
-   .port_names = {
-   NULL,
-   "eth%d",
-   "eth%d",
-   "cpu",
-   },
-};
-static struct dsa_platform_data ksz8893m_switch_data = {
-   .nr_chips = 1,
-   .netdev = &bfin_mac_device.dev,
-   .chip = &ksz8893m_switch_chip_data,
-};
-
-static struct platform_device ksz8893m_switch_device = {
-   .name   = "dsa",
-   .id = 0,
-   .num_resources  = 0,
-   .dev.platform_data = &ksz8893m_switch_data,
-};
-#endif
 #endif
 
 #if IS_ENABLED(CONFIG_MTD_M25P80)
@@ -228,19 +197,6 @@ static struct spi_board_info bfin_spi_board_info[] 
__initdata = {
},
 #endif
 
-#if IS_ENABLED(CONFIG_BFIN_MAC)
-#if IS_ENABLED(CONFIG_NET_DSA_KSZ8893M)
-   {
-   .modalias = "ksz8893m",
-   .max_speed_hz = 500,
-   .bus_num = 0,
-   .chip_select = 1,
-   .platform_data = NULL,
-   .mode = SPI_MODE_3,
-   },
-#endif
-#endif
-
 #if IS_ENABLED(CONFIG_MMC_SPI)
{
.modalias = "mmc_spi",
@@ -714,9 +670,6 @@ static struct platform_device *stamp_devices[] __initdata = 
{
 #if IS_ENABLED(CONFIG_BFIN_MAC)
&bfin_mii_bus,
&bfin_mac_device,
-#if IS_ENABLED(CONFIG_NET_DSA_KSZ8893M)
-   &ksz8893m_switch_device,
-#endif
 #endif
 
 #if IS_ENABLED(CONFIG_SPI_BFIN5XX)
-- 
2.11.0

[PATCH net-next] net: dsa: Utilize dsa_slave_dev_check()

2017-09-19 Thread Florian Fainelli

Instead of open coding the check.

Signed-off-by: Florian Fainelli 
---
 net/dsa/slave.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index d51b10450e1b..6fc9eb094267 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -1294,7 +1294,7 @@ static int dsa_slave_netdevice_event(struct 
notifier_block *nb,
 {
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
 
-   if (dev->netdev_ops != &dsa_slave_netdev_ops)
+   if (!dsa_slave_dev_check(dev))
return NOTIFY_DONE;
 
if (event == NETDEV_CHANGEUPPER)
-- 
2.9.3

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Tom Herbert

On Tue, Sep 19, 2017 at 5:34 PM, Samudrala, Sridhar
 wrote:
> On 9/12/2017 3:53 PM, Tom Herbert wrote:
>>
>> On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
>>  wrote:
>>>
>>>
>>> On 9/12/2017 8:47 AM, Eric Dumazet wrote:

 On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:
>
> On 9/11/2017 8:53 PM, Eric Dumazet wrote:
>>
>> On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:
>>
>>> Two ints in sock_common for this purpose is quite expensive and the
>>> use case for this is limited-- even if a RX->TX queue mapping were
>>> introduced to eliminate the queue pair assumption this still won't
>>> help if the receive and transmit interfaces are different for the
>>> connection. I think we really need to see some very compelling
>>> results
>>> to be able to justify this.
>
> Will try to collect and post some perf data with symmetric queue
> configuration.
>
>
> Here is some performance data i collected with memcached workload over
> ixgbe 10Gb NIC with mcblaster benchmark.
> ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a very
> low
> interrupt rate.
>   ethtool -L p1p1 combined 16
>   ethtool -C p1p1 rx-usecs 1000
> and busy poll is set to 1000usecs
>   sysctl net.core.busy_poll = 1000
>
> 16 threads  800K requests/sec
> =
>   rtt(min/avg/max)usecs intr/sec contextswitch/sec
> ---
> Default2/182/1064123391 61163
> Symmetric Queues   2/50/6311  20457 32843
>
> 32 threads  800K requests/sec
> =
>  rtt(min/avg/max)usecs intr/sec contextswitch/sec
> 
> Default2/162/639032168 69450
> Symmetric Queues2/50/385335044 35847
>
No idea what "Default" configuration is. Please report how xps_cpus is
being set, how many RSS queues there are, and what the mapping is
between RSS queues and CPUs and shared caches. Also, whether and
threads are pinned.

Thanks,
Tom

Re: [PATCH] ipv6_skip_exthdr: use ipv6_authlen for AH hdrlen

2017-09-19 Thread Tom Herbert

On Tue, Sep 19, 2017 at 5:59 AM, Xiang Gao  wrote:
> In ipv6_skip_exthdr, the lengh of AH header is computed manually
> as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
> named ipv6_authlen is already defined for exactly the same job. This
> commit replaces the manual computation code with the macro.

This isn't directly related to this patch, but I notice that flow
dissector doesn't used the ipv6_optlen macro and doesn't have
NEXTHDR_AUTH or NEXTHDR_NONE in the ip_proto switch statement. Would
be a nice fix. The NEXTHDR_NONE is probably just a break, but would be
nice to have it there for completeness.

Tom

> ---
>  net/ipv6/exthdrs_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/ipv6/exthdrs_core.c b/net/ipv6/exthdrs_core.c
> index 305e2ed730bf..115d60919f72 100644
> --- a/net/ipv6/exthdrs_core.c
> +++ b/net/ipv6/exthdrs_core.c
> @@ -99,7 +99,7 @@ int ipv6_skip_exthdr(const struct sk_buff *skb, int start, 
> u8 *nexthdrp,
> break;
> hdrlen = 8;
> } else if (nexthdr == NEXTHDR_AUTH)
> -   hdrlen = (hp->hdrlen+2)<<2;
> +   hdrlen = ipv6_authlen(hp);
> else
> hdrlen = ipv6_optlen(hp);
>
> --
> 2.14.1
>

Re: [RFC PATCH] net: Introduce a socket option to enable picking tx queue based on rx queue.

2017-09-19 Thread Samudrala, Sridhar


On 9/12/2017 3:53 PM, Tom Herbert wrote:

On Tue, Sep 12, 2017 at 3:31 PM, Samudrala, Sridhar
 wrote:


On 9/12/2017 8:47 AM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 23:27 -0700, Samudrala, Sridhar wrote:

On 9/11/2017 8:53 PM, Eric Dumazet wrote:

On Mon, 2017-09-11 at 20:12 -0700, Tom Herbert wrote:


Two ints in sock_common for this purpose is quite expensive and the
use case for this is limited-- even if a RX->TX queue mapping were
introduced to eliminate the queue pair assumption this still won't
help if the receive and transmit interfaces are different for the
connection. I think we really need to see some very compelling results
to be able to justify this.

Will try to collect and post some perf data with symmetric queue
configuration.


Here is some performance data i collected with memcached workload over
ixgbe 10Gb NIC with mcblaster benchmark.
ixgbe is configured with 16 queues and rx-usecs is set to 1000 for a 
very low

interrupt rate.
  ethtool -L p1p1 combined 16
  ethtool -C p1p1 rx-usecs 1000
and busy poll is set to 1000usecs
  sysctl net.core.busy_poll = 1000

16 threads  800K requests/sec
=
  rtt(min/avg/max)usecs intr/sec contextswitch/sec
---
Default2/182/1064123391 61163
Symmetric Queues   2/50/6311  20457 32843

32 threads  800K requests/sec
=
 rtt(min/avg/max)usecs intr/sec contextswitch/sec

Default2/162/639032168 69450
Symmetric Queues2/50/385335044 35847




Yes, this is unreasonable cost.

XPS should really cover the case already.


Eric,

Can you clarify how XPS covers the RX-> TX queue mapping case?
Is it possible to configure XPS to select TX queue based on the RX queue
of a flow?
IIUC, it is based on the CPU of the thread doing the transmit OR based
on skb->priority to TC mapping?
It may be possible to get this effect if the the threads are pinned to a
core, but if the app threads are
freely moving, i am not sure how XPS can be configured to select the TX
queue based on the RX queue of a flow.

If application is freely moving, how NIC can properly select the RX
queue so that packets are coming to the appropriate queue ?

The RX queue is selected via RSS and we don't want to move the flow based on
where the thread is running.

Unless flow director is enabled on the Intel device... This was, I
believe, one of the first attempts to introduce a queue pair notion to
general purpose NICs. The idea was that the device records the TX
queue for a flow and then uses that to determine receive queue in a
symmetric fashion. aRFS is similar, but was under SW control how the
mapping is done. As Eric mentioned there are scalability issues with
these mechanisms, but we also found that flow director can easily
reorder packets whenever the thread moves.


You must be referring to the ATR(application targeted routing) feature 
on Intel
NICs wherea flow director entry is added for a flow based on TX queue 
used for
that flow. Instead, we would like to select the TX queue based on the RX 
queue

of a flow.






This is called aRFS, and it does not scale to millions of flows.
We tried in the past, and this went nowhere really, since the setup cost
is prohibitive and DDOS vulnerable.

XPS will follow the thread, since selection is done on current cpu.

The problem is RX side. If application is free to migrate, then special
support (aRFS) is needed from the hardware.

This may be true if most of the rx processing is happening in the interrupt
context.
But with busy polling,  i think we don't need aRFS as a thread should be
able to poll
any queue irrespective of where it is running.

It's not just a problem with interrupt processing, in general we like
to have all receive processing an subsequent transmit of a reply to be
done on one CPU. Silo'ing is good for performance and parallelism.
This can sometimes be relaxed in situations where CPUs share a cache
so crossing CPUs is not not costly.


Yes. We would like to get this behavior even without binding the app 
thread to a CPU.







At least for passive connections, we already have all the support in the
kernel so that you can have one thread per NIC queue, dealing with
sockets that have incoming packets all received on one NIC RX queue.
(And of course all TX packets will use the symmetric TX queue)

SO_REUSEPORT plus appropriate BPF filter can achieve that.

Say you have 32 queues, 32 cpus.

Simply use 32 listeners, 32 threads (or 32 pools of threads)

Yes. This will work if each thread is pinned to a core associated with the
RX interrupt.
It may not be possible to pin the threads to a core.
Instead we want to associate a thread to a queue and do all the RX and TX
completion
of a queue in the same thread cont

Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski


Latest working kernel with same configuration and kernel config 4.12.13

There is no panic after routes from all 6x bgp sessions are learned.

ip r | wc -l
653112




W dniu 2017-09-20 o 02:06, Paweł Staszewski pisze:

Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn 
routes it panic.


https://bugzilla.kernel.org/attachment.cgi?id=258509



W dniu 2017-09-20 o 02:01, Paweł Staszewski pisze:

Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected 
- not by traffic - cause at time when panic occured there is almost 
no traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499

Re: [RFC net-next 0/5] TSN: Add qdisc-based config interfaces for traffic shapers

2017-09-19 Thread Vinicius Costa Gomes

Hi Richard,

Richard Cochran  writes:

> On Mon, Sep 18, 2017 at 04:06:28PM -0700, Vinicius Costa Gomes wrote:
>> That's the point, the application does not need to know that, and asking
>> that would be stupid.
>
> On the contrary, this information is essential to the application.
> Probably you have never seen an actual Ethernet field bus in
> operation?  In any case, you are missing the point.
>
>> (And that's another nice point of how 802.1Qbv works, applications do
>> not need to be changed to use it, and I think we should work to achieve
>> this on the Linux side)
>
> Once you start to care about real time performance, then you need to
> consider the applications.  This is industrial control, not streaming
> your tunes from your ipod.
>
>> That being said, that only works for kinds of traffic that maps well to
>> this configuration in advance model, which is the model that the IEEE
>> (see 802.1Qcc) and the AVNU Alliance[1] are pushing for.
>
> Again, you are missing the point of what they aiming for.  I have
> looked at a number of production systems, and in each case the
> developers want total control over the transmission, in order to
> reduce latency to an absolute minimum.  Typically the data to be sent
> are available only microseconds before the transmission deadline.
>
> Consider OpenAVB on github that people are already using.  Take a look
> at simple_talker.c and explain how "applications do not need to be
> changed to use it."

Just let me use the mention of OpenAVNU as a hook to explain what we
(the team I am part of) are working to do, perhaps it will make our
choices and designs clearer.

One of the problems with OpenAVNU is that it's too coupled with the i210
NIC. One of the things we want is to decouple OpenAVNU from the
controller. The way we thought best was to propose interfaces (that
would work along side to the Linux networking stack) as close as
possible to what the current standards define, that means the IEEE
802.1Q family of specifications, in the hope that network controller
vendors would also look at the specifications when designing their
controllers.

Our objective with the Qdiscs we are proposing (both cbs and taprio) is
to provide a sane way to configure controllers that support TSN features
(we were looking specifically at the IEEE specs).

After we have some rough consensus on the interfaces to use, then we can
start working on OpenAVNU.

>
>> [1]
>> http://avnu.org/theory-of-operation-for-tsn-enabled-industrial-systems/
>
> Did you even read this?
>
> [page 24]
>
> As described in section 2, some industrial control systems require
> predictable, very low latency and cycle-to-cycle variation to meet
> hard real-time application requirements. In these systems,
> multiple distributed controllers commonly synchronize their
> sensor/actuator operations with other controllers by scheduling
> these operations in time, typically using a repeating control
> cycle.
> ...
> The gate control mechanism is itself a time-aware PTP application
> operating within a bridge or end station port.
>
> It is an application, not a "god box."
>
>> In short, I see a per-packet transmission time and a per-queue schedule
>> as solutions to different problems.
>
> Well, I can agree with that.  For some non real-time applications,
> bandwidth shaping is enough, and your Qdisc idea is sufficient.  For
> the really challenging TSN targets (industrial control, automotive),
> your idea of an opaque schedule file won't fly.

(Sorry if I am being annoying here, but the idea of an opaque schedule
is not ours, that comes from the people who wrote the Qbv specification)

I have a question, what about a controller that doesn't provide a way to
set a per-packet transmission time, but it supports Qbv/Qbu. What would
be your proposal to configure it?

(I think LaunchTime is something specific to the i210, right?)


Cheers,
--
Vinicius

Re: [PATCH net-next 07/14] gtp: Support encapsulation of IPv6 packets

2017-09-19 Thread Tom Herbert

On Tue, Sep 19, 2017 at 10:42 AM, David Miller  wrote:
> From: Harald Welte 
> Date: Tue, 19 Sep 2017 20:12:45 +0800
>
>> Hi Dave,
>>
>> On Mon, Sep 18, 2017 at 09:19:08PM -0700, David Miller wrote:
>>
>>> > +static inline u32 ipv6_hashfn(const struct in6_addr *a)
>>> > +{
>>> > +  return __ipv6_addr_jhash(a, gtp_h_initval);
>>> > +}
>>>
>>> I know you are just following the pattern of the existing "ipv4_hashfn()" 
>>> here
>>> but this kind of stuff is not very global namespace friendly.  Even simply
>>> adding a "gtp_" prefix to these hash functions would be a lot better.
>>
>> I would agree if this was an inline function defined in a header file or
>> a non-static function.  But where is the global namespace concern in
>> case of static inline functions defined and used in the same .c file?
>
> The problem is if we create a generic ipv6_hashfn() in linux/ipv6.h or
> something like that, then this driver stops building.

It was a carry over since ipv4_hashfn was already defined in the file.
I will prefix both functions.

Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski


Just checked kernel 4.13.2 and same problem

Just after start all 6 bgp sessions - and kernel starts to learn routes 
it panic.


https://bugzilla.kernel.org/attachment.cgi?id=258509



W dniu 2017-09-20 o 02:01, Paweł Staszewski pisze:

Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected 
- not by traffic - cause at time when panic occured there is almost no 
traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499

Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski


Some information about enviroment:
Server is acting as a ip router with bgp
There are 6x bgp sessions - each with full bgp table ~600k prefixes

And it looks like panic is appearing after bgp sessions are connected - 
not by traffic - cause at time when panic occured there is almost no 
traffic.


Also when I run tris server without turning on BGP and push thru this 
server traffic by pktgen there is no panic.


just after it learn routes it panick







W dniu 2017-09-20 o 01:45, Paweł Staszewski pisze:
Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499

Re: Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski

Added few more screenshoots from kernels 4.14-rc1(net-next) and 
4.14-rc1(linux-next)


https://bugzilla.kernel.org/show_bug.cgi?id=197005


W dniu 2017-09-20 o 00:35, Paweł Staszewski pisze:

Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499

Re: [PATCH net-next 00/14] gtp: Additional feature support

2017-09-19 Thread Tom Herbert

On Tue, Sep 19, 2017 at 4:19 PM, Harald Welte  wrote:
> Hi Tom,
>
> On Tue, Sep 19, 2017 at 08:59:28AM -0700, Tom Herbert wrote:
>> On Tue, Sep 19, 2017 at 5:43 AM, Harald Welte 
>> wrote:
>> > On Mon, Sep 18, 2017 at 05:38:50PM -0700, Tom Herbert wrote:
>> >>   - IPv6 support
>> >
>> > see my detailed comments in other mails.  It's unfortunately only
>> > support for the already "deprecated" IPv6-only PDP contexts, not the
>> > more modern v4v6 type.  In order to interoperate with old and new
>> > approach, all three cases (v4, v6 and v4v6) should be supported from
>> > one code base.
>> >
>> It sounds like something that can be subsequently added.
>
> Not entirely, at least on the netlink (and any other configuration
> interface) you will have to reflect this from the very beginning.  You
> have to have an explicit PDP type and cannot rely on the address type to
> specify the type of PDP context.  Whatever interfaces are introduced
> now will have to remain compatible to any future change.
>
> My strategy to avoid any such possible 'road blocks' from being
> introduced would be to simply add v4v6 and v6 support in one go.  The
> differences are marginal (having both an IPv6 prefix and a v4 address in
> parallel, rather than mutually exclusive only).
>
>> Do you have a reference to the spec?
>
> See http://osmocom.org/issues/2418#note-7 which lists Section 11.2.1.3.2
> of 3GPP TS 29.061 in combination with RFC3314, RFC7066, RFC6459 and
> 3GPP TS 23.060 9.2.1 as well as a summary of my understanding of it some
> months ago.
>
>> >>   - Configurable networking interfaces so that GTP kernel can be
>> >>   used and tested without needing GSN network emulation (i.e. no
>> >>   user space daemon needed).
>> >
>> > We have some pretty decent userspace utilities for configuring the
>> > GTP interfaces and tunnels in the libgtpnl repository, but if it
>> > helps people to have another way of configuration, I won't be
>> > against it.
>> >
>> AFAIK those userspace utilities don't support IPv6.
>
> Of course not [yet]. libgtpnl and the command line tools have been
> implemented specifically for the in-kernel GTP driver, and you have to
> make sure to add related support on both the kernel and the userspace
> side (libgtpnl). So there's little point in adding features on either
> side before the other side.  There would be no way to test...
>
>> Being able to configure GTP like any other encapsulation will
>> facilitate development of IPv6 and other features.
>
> That may very well be the case, but adding "IPv6 support" to kernel GTP
> in a way that is not in line with the existing userspace libraries and
> control-plane implementations means that you're developing those
> features in an artificial environment that doesn't resemble real 3GPP
> interoperable networks out there.
>
> As indicated, I'm not against adding additional interfaces, but we have
> to make sure that we add IPv6 support (or any new feature support) to at
> least libgtpnl, and to make sure we test interoperability with existing
> 3GPP network equipment such as real IPv6 capable phones and SGSNs.
>
>> > I'm not sure if this is a useful feature.  GTP is used only in
>> > operator-controlled networks and only on standard ports.  It's not
>> > possible to negotiate any non-standard ports on the signaling plane
>> > either.
>> >
>> Bear in mind that we're not required to do everything the GTP spec
>> says.
>
> Yes, we are, at least as long as it affects interoperability with other
> implemetations out there.
>
> GTP uses well-known port numbers on *both* sides of the tunnel, and you
> cannot deviate from that.
>
> There's no point in having all kinds of feetures in the GTP user plane
> which are not interoperable with other implementations, and which are
> completely outside of the information model / architecture of GTP.
>
> In the real world, GTP-U is only used in combination with GTP-C.  And in
> GTP-C you can only negotiate the IP address of both sides of GTP-U, and
> not the port number information.  As a result, the port numbers are
> static on both sides.
>
>> My impression is GTP designers probably didn't think in terms of
>> getting best performance. But we can ;-)
>
> I think it's wasted efforts if it's about "random udp ports" as no
> standards-compliant implementation out there with which you will have to
> interoperate will be able to support it.
>
> GTP is used between home and roaming operator.  If you want to introduce
> changes to how it works, you will have to have control over both sides
> of the implementation of both the GTP-C and the GTP-u plane, which is
> very unlikely and rather the exception in the hundreds of operators you
> interoperate with.  Also keep in mind that there often are various
> "middleboxes" that will suddenly have to reflect your changes.  That
> starts from packet filters at various locations in the operator networks
> and/or roaming hubs, down to GTP hubs and others.
>
> My opinion is: Non-standard GTP ports

Development Setup

2017-09-19 Thread sdnlabs Janakaraj

Dear all,
I am new a newbie, I am curious to know what development tools with
Ubuntu as Host OS, will best fit for people entering into linux kernel
development focusing on Netlink, Netdev and Wireless MAC.


I have read many blogs describing the basic setup and things like
that. But I felt input from the current developers in the same field
will be more useful.





-devprabhu-

Re: [PATCH net] ipv6: fix net.ipv6.conf.all interface DAD handlers

2017-09-19 Thread David Miller

From: Matteo Croce 
Date: Tue, 12 Sep 2017 17:46:37 +0200

> Currently, writing into
> net.ipv6.conf.all.{accept_dad,use_optimistic,optimistic_dad} has no effect.
> Fix handling of these flags by:
> 
> - using the maximum of global and per-interface values for the
>   accept_dad flag. That is, if at least one of the two values is
>   non-zero, enable DAD on the interface. If at least one value is
>   set to 2, enable DAD and disable IPv6 operation on the interface if
>   MAC-based link-local address was found
> 
> - using the logical OR of global and per-interface values for the
>   optimistic_dad flag. If at least one of them is set to one, optimistic
>   duplicate address detection (RFC 4429) is enabled on the interface
> 
> - using the logical OR of global and per-interface values for the
>   use_optimistic flag. If at least one of them is set to one,
>   optimistic addresses won't be marked as deprecated during source address
>   selection on the interface.
> 
> While at it, as we're modifying the prototype for ipv6_use_optimistic_addr(),
> drop inline, and let the compiler decide.
> 
> Fixes: 7fd2561e4ebd ("net: ipv6: Add a sysctl to make optimistic addresses 
> useful candidates")
> Signed-off-by: Matteo Croce 

Applied, thank you.

Re: [PATCH] net: ipv6: fix regression of no RTM_DELADDR sent after DAD failure

2017-09-19 Thread David Miller

From: Mike Manning 
Date: Mon, 18 Sep 2017 14:06:40 +0100

> In the absence of a reply from Mahesh, I would be most grateful for
> anyone familiar with the IPv6 code to review this 1-line fix.
> 
> Or if not, then I request that the commit f784ad3d79e5 is backed out,
> as its intention is to remove the redundant but harmless RTM_DELADDR
> for addresses in tentative state, but is also incorrectly removing the
> very necessary RTM_DELADDR when an address is deleted that was previously
> notified with an RTM_NEWADDR as being in tentative dadfailed state.

I've applied your patch, and queued it up for -stable, thanks.

Re: [PATCH net v2] bpf: fix ri->map_owner pointer on bpf_prog_realloc

2017-09-19 Thread David Miller

From: Daniel Borkmann 
Date: Wed, 20 Sep 2017 00:44:21 +0200

> Commit 109980b894e9 ("bpf: don't select potentially stale
> ri->map from buggy xdp progs") passed the pointer to the prog
> itself to be loaded into r4 prior on bpf_redirect_map() helper
> call, so that we can store the owner into ri->map_owner out of
> the helper.
> 
> Issue with that is that the actual address of the prog is still
> subject to change when subsequent rewrites occur that require
> slow path in bpf_prog_realloc() to alloc more memory, e.g. from
> patching inlining helper functions or constant blinding. Thus,
> we really need to take prog->aux as the address we're holding,
> which also works with prog clones as they share the same aux
> object.
> 
> Instead of then fetching aux->prog during runtime, which could
> potentially incur cache misses due to false sharing, we are
> going to just use aux for comparison on the map owner. This
> will also keep the patchlet of the same size, and later check
> in xdp_map_invalid() only accesses read-only aux pointer from
> the prog, it's also in the same cacheline already from prior
> access when calling bpf_func.
> 
> Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy 
> xdp progs")
> Signed-off-by: Daniel Borkmann 
> Acked-by: Alexei Starovoitov 
> ---
>  v1->v2:
>   - Decided to go with prog->aux instead.

Applied, thanks Daniel.

Re: [PATCH v2 net-next 0/7] net: speedup netns create/delete time

2017-09-19 Thread David Miller

From: Eric Dumazet 
Date: Tue, 19 Sep 2017 16:27:02 -0700

> When rate of netns creation/deletion is high enough,
> we observe softlockups in cleanup_net() caused by huge list
> of netns and way too many rcu_barrier() calls.
> 
> This patch series does some optimizations in kobject,
> and add batching to tunnels so that netns dismantles are
> less costly.
 ...

Series applied, thanks Eric.

Re: [PATCH net-next] net_sched: no need to free qdisc in RCU callback

2017-09-19 Thread David Miller

From: Cong Wang 
Date: Tue, 19 Sep 2017 13:15:42 -0700

> gen estimator has been rewritten in commit 1c0d32fde5bd
> ("net_sched: gen_estimator: complete rewrite of rate estimators"),
> the caller no longer needs to wait for a grace period. So this
> patch gets rid of it.
> 
> Cc: Jamal Hadi Salim 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 

Nice.

Applied, thanks.

Re: [PATCH] isdn/i4l: check the message proto does not change across fetches

2017-09-19 Thread David Miller

From: Meng Xu 
Date: Tue, 19 Sep 2017 14:52:58 -0400

> In isdn_ppp_write(), the header (i.e., protobuf) of the buffer is fetched
> twice from userspace. The first fetch is used to peek at the protocol
> of the message and reset the huptimer if necessary; while the second
> fetch copies in the whole buffer. However, given that buf resides in
> userspace memory, a user process can race to change its memory content
> across fetches. By doing so, we can either avoid resetting the huptimer
> for any type of packets (by first setting proto to PPP_LCP and later
> change to the actual type) or force resetting the huptimer for LCP packets.
> 
> This patch does a memcmp between the two fetches and abort if changes to
> the protobuf is detected across fetches.
> 
> Signed-off-by: Meng Xu 

Doing a memcmp() for every buffer is expensive, ugly, and not the
way we usually handle this kind of issue.

Instead, atomically copy the entire buffer, as needed.

Something like:

struct sk_buff *skb = NULL;
unsigned char protobuf[4];
unsigned char *cpy_buf;

if (lp->isdn_device >= 0 && lp->isdn_channel >= 0 &&
(dev->drv[lp->isdn_device]->flags & DRV_FLAG_RUNNING) &&
lp->dialstate == 0 &&
(lp->flags & ISDN_NET_CONNECTED)) {
/*
 * we need to reserve enough space in front of
 * sk_buff. old call to dev_alloc_skb only reserved
 * 16 bytes, now we are looking what the driver want
 */
hl = dev->drv[lp->isdn_device]->interface->hl_hdrlen;
skb = alloc_skb(hl + count, GFP_ATOMIC);
if (!skb) {
printk(KERN_WARNING "isdn_ppp_write: out of 
memory!\n");
return count;
}
skb_reserve(skb, hl);
cpy_buf = skb_put(skb, count);
} else {
cpy_buf = protobuf;
count = sizeof(protobuf);
}
if (copy_from_user(cpy_buf, buf, count)) {
kfree_skb(skb);
return -EFAULT;
}
proto = PPP_PROTOCOL(cpy_buf);
if (proto != PPP_LCP)
lp->huptimer = 0;
...

[PATCH v2 net-next 1/7] kobject: add kobject_uevent_net_broadcast()

2017-09-19 Thread Eric Dumazet

This removes some #ifdef pollution and will ease follow up patches.

Signed-off-by: Eric Dumazet 
---
 lib/kobject_uevent.c | 96 +---
 1 file changed, 53 insertions(+), 43 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 
e590523ea4761425df5e112a2c2aab873dbaa90d..4f48cc3b11d566e44c4115cc7716bc3b1cdf96df
 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -294,6 +294,57 @@ static void cleanup_uevent_env(struct subprocess_info 
*info)
 }
 #endif
 
+static int kobject_uevent_net_broadcast(struct kobject *kobj,
+   struct kobj_uevent_env *env,
+   const char *action_string,
+   const char *devpath)
+{
+   int retval = 0;
+#if defined(CONFIG_NET)
+   struct uevent_sock *ue_sk;
+
+   /* send netlink message */
+   list_for_each_entry(ue_sk, &uevent_sock_list, list) {
+   struct sock *uevent_sock = ue_sk->sk;
+   struct sk_buff *skb;
+   size_t len;
+
+   if (!netlink_has_listeners(uevent_sock, 1))
+   continue;
+
+   /* allocate message with the maximum possible size */
+   len = strlen(action_string) + strlen(devpath) + 2;
+   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+   if (skb) {
+   char *scratch;
+   int i;
+
+   /* add header */
+   scratch = skb_put(skb, len);
+   sprintf(scratch, "%s@%s", action_string, devpath);
+
+   /* copy keys to our continuous event payload buffer */
+   for (i = 0; i < env->envp_idx; i++) {
+   len = strlen(env->envp[i]) + 1;
+   scratch = skb_put(skb, len);
+   strcpy(scratch, env->envp[i]);
+   }
+
+   NETLINK_CB(skb).dst_group = 1;
+   retval = netlink_broadcast_filtered(uevent_sock, skb,
+   0, 1, GFP_KERNEL,
+   kobj_bcast_filter,
+   kobj);
+   /* ENOBUFS should be handled in userspace */
+   if (retval == -ENOBUFS || retval == -ESRCH)
+   retval = 0;
+   } else
+   retval = -ENOMEM;
+   }
+#endif
+   return retval;
+}
+
 /**
  * kobject_uevent_env - send an uevent with environmental data
  *
@@ -316,9 +367,6 @@ int kobject_uevent_env(struct kobject *kobj, enum 
kobject_action action,
const struct kset_uevent_ops *uevent_ops;
int i = 0;
int retval = 0;
-#ifdef CONFIG_NET
-   struct uevent_sock *ue_sk;
-#endif
 
pr_debug("kobject: '%s' (%p): %s\n",
 kobject_name(kobj), kobj, __func__);
@@ -427,46 +475,8 @@ int kobject_uevent_env(struct kobject *kobj, enum 
kobject_action action,
mutex_unlock(&uevent_sock_mutex);
goto exit;
}
-
-#if defined(CONFIG_NET)
-   /* send netlink message */
-   list_for_each_entry(ue_sk, &uevent_sock_list, list) {
-   struct sock *uevent_sock = ue_sk->sk;
-   struct sk_buff *skb;
-   size_t len;
-
-   if (!netlink_has_listeners(uevent_sock, 1))
-   continue;
-
-   /* allocate message with the maximum possible size */
-   len = strlen(action_string) + strlen(devpath) + 2;
-   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
-   if (skb) {
-   char *scratch;
-
-   /* add header */
-   scratch = skb_put(skb, len);
-   sprintf(scratch, "%s@%s", action_string, devpath);
-
-   /* copy keys to our continuous event payload buffer */
-   for (i = 0; i < env->envp_idx; i++) {
-   len = strlen(env->envp[i]) + 1;
-   scratch = skb_put(skb, len);
-   strcpy(scratch, env->envp[i]);
-   }
-
-   NETLINK_CB(skb).dst_group = 1;
-   retval = netlink_broadcast_filtered(uevent_sock, skb,
-   0, 1, GFP_KERNEL,
-   kobj_bcast_filter,
-   kobj);
-   /* ENOBUFS should be handled in userspace */
-   if (retval == -ENOBUFS || retval == -ESRCH)
-   retval = 0;
-   } else
-

[PATCH v2 net-next 5/7] tcp: batch tcp_net_metrics_exit

2017-09-19 Thread Eric Dumazet

When dealing with a list of dismantling netns, we can scan
tcp_metrics once, saving cpu cycles.

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_metrics.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c
index 
102b2c90bb807d3a88d31b59324baf72cf901cdf..0ab78abc811bef0388089befed672e3d4ee9d881
 100644
--- a/net/ipv4/tcp_metrics.c
+++ b/net/ipv4/tcp_metrics.c
@@ -892,10 +892,14 @@ static void tcp_metrics_flush_all(struct net *net)
 
for (row = 0; row < max_rows; row++, hb++) {
struct tcp_metrics_block __rcu **pp;
+   bool match;
+
spin_lock_bh(&tcp_metrics_lock);
pp = &hb->chain;
for (tm = deref_locked(*pp); tm; tm = deref_locked(*pp)) {
-   if (net_eq(tm_net(tm), net)) {
+   match = net ? net_eq(tm_net(tm), net) :
+   !atomic_read(&tm_net(tm)->count);
+   if (match) {
*pp = tm->tcpm_next;
kfree_rcu(tm, rcu_head);
} else {
@@ -1018,14 +1022,14 @@ static int __net_init tcp_net_metrics_init(struct net 
*net)
return 0;
 }
 
-static void __net_exit tcp_net_metrics_exit(struct net *net)
+static void __net_exit tcp_net_metrics_exit_batch(struct list_head 
*net_exit_list)
 {
-   tcp_metrics_flush_all(net);
+   tcp_metrics_flush_all(NULL);
 }
 
 static __net_initdata struct pernet_operations tcp_net_metrics_ops = {
-   .init   =   tcp_net_metrics_init,
-   .exit   =   tcp_net_metrics_exit,
+   .init   =   tcp_net_metrics_init,
+   .exit_batch =   tcp_net_metrics_exit_batch,
 };
 
 void __init tcp_metrics_init(void)
-- 
2.14.1.690.gbb1197296e-goog

[PATCH v2 net-next 6/7] ipv6: speedup ipv6 tunnels dismantle

2017-09-19 Thread Eric Dumazet

Implement exit_batch() method to dismantle more devices
per round.

(rtnl_lock() ...
 unregister_netdevice_many() ...
 rtnl_unlock())

Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch :
$ time ./add_del_unshare.sh
net_namespace110267   550412 : tunables840 : 
slabdata110267  0

real3m25.292s
user0m0.644s
sys 0m40.153s

After patch:

$ time ./add_del_unshare.sh
net_namespace126282   550412 : tunables840 : 
slabdata126282  0

real1m38.965s
user0m0.688s
sys 0m37.017s

Signed-off-by: Eric Dumazet 
---
 net/ipv6/ip6_gre.c|  8 +---
 net/ipv6/ip6_tunnel.c | 20 +++-
 net/ipv6/ip6_vti.c| 23 ++-
 net/ipv6/sit.c|  9 ++---
 4 files changed, 36 insertions(+), 24 deletions(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 
b7a72d40933441f835708f55e2d8af371661a5fb..c82d41ef25e283ff92b1eed1f8b927c9d7b8f333
 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -1155,19 +1155,21 @@ static int __net_init ip6gre_init_net(struct net *net)
return err;
 }
 
-static void __net_exit ip6gre_exit_net(struct net *net)
+static void __net_exit ip6gre_exit_batch_net(struct list_head *net_list)
 {
+   struct net *net;
LIST_HEAD(list);
 
rtnl_lock();
-   ip6gre_destroy_tunnels(net, &list);
+   list_for_each_entry(net, net_list, exit_list)
+   ip6gre_destroy_tunnels(net, &list);
unregister_netdevice_many(&list);
rtnl_unlock();
 }
 
 static struct pernet_operations ip6gre_net_ops = {
.init = ip6gre_init_net,
-   .exit = ip6gre_exit_net,
+   .exit_batch = ip6gre_exit_batch_net,
.id   = &ip6gre_net_id,
.size = sizeof(struct ip6gre_net),
 };
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index 
ae73164559d5c4d7f2650ae63c56d76dc93b165c..3d6df489b39f00014f330340927c4d11a64911c2
 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -2167,17 +2167,16 @@ static struct xfrm6_tunnel ip6ip6_handler __read_mostly 
= {
.priority   =   1,
 };
 
-static void __net_exit ip6_tnl_destroy_tunnels(struct net *net)
+static void __net_exit ip6_tnl_destroy_tunnels(struct net *net, struct 
list_head *list)
 {
struct ip6_tnl_net *ip6n = net_generic(net, ip6_tnl_net_id);
struct net_device *dev, *aux;
int h;
struct ip6_tnl *t;
-   LIST_HEAD(list);
 
for_each_netdev_safe(net, dev, aux)
if (dev->rtnl_link_ops == &ip6_link_ops)
-   unregister_netdevice_queue(dev, &list);
+   unregister_netdevice_queue(dev, list);
 
for (h = 0; h < IP6_TUNNEL_HASH_SIZE; h++) {
t = rtnl_dereference(ip6n->tnls_r_l[h]);
@@ -2186,12 +2185,10 @@ static void __net_exit ip6_tnl_destroy_tunnels(struct 
net *net)
 * been added to the list by the previous loop.
 */
if (!net_eq(dev_net(t->dev), net))
-   unregister_netdevice_queue(t->dev, &list);
+   unregister_netdevice_queue(t->dev, list);
t = rtnl_dereference(t->next);
}
}
-
-   unregister_netdevice_many(&list);
 }
 
 static int __net_init ip6_tnl_init_net(struct net *net)
@@ -2235,16 +2232,21 @@ static int __net_init ip6_tnl_init_net(struct net *net)
return err;
 }
 
-static void __net_exit ip6_tnl_exit_net(struct net *net)
+static void __net_exit ip6_tnl_exit_batch_net(struct list_head *net_list)
 {
+   struct net *net;
+   LIST_HEAD(list);
+
rtnl_lock();
-   ip6_tnl_destroy_tunnels(net);
+   list_for_each_entry(net, net_list, exit_list)
+   ip6_tnl_destroy_tunnels(net, &list);
+   unregister_netdevice_many(&list);
rtnl_unlock();
 }
 
 static struct pernet_operations ip6_tnl_net_ops = {
.init = ip6_tnl_init_net,
-   .exit = ip6_tnl_exit_net,
+   .exit_batch = ip6_tnl_exit_batch_net,
.id   = &ip6_tnl_net_id,
.size = sizeof(struct ip6_tnl_net),
 };
diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index 
79444a4bfd6d245b66a7edcefe2b5b32801bf2c0..714914d1bb987c46cc98817903ec7bcc367a1b2d
 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -1052,23 +1052,22 @@ static struct rtnl_link_ops vti6_link_ops __read_mostly 
= {
.get_link_net   = ip6_tnl_get_link_net,
 };
 
-static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n)
+static void __net_exit vti6_destroy_tunnels(struct vti6_net *ip6n,
+   struct list_head *list)
 {
int h;
struct ip6_tnl *t;
-   LIST_HEAD(list);
 
for (h = 0; h < IP6_VTI_HASH_SIZE; h++) {

[PATCH v2 net-next 4/7] ipv6: addrlabel: per netns list

2017-09-19 Thread Eric Dumazet

Having a global list of labels do not scale to thousands of
netns in the cloud era. This causes quadratic behavior on
netns creation and deletion.

This is time having a per netns list of ~10 labels.

Tested:

$ time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 3.637 MB perf.data (~158898 samples) ]

real0m20.837s # instead of 0m24.227s
user0m0.328s
sys 0m20.338s # instead of 0m23.753s

16.17%   ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
12.30%   ip  [kernel.kallsyms]  [k] netlink_has_listeners
 6.76%   ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
 5.78%   ip  [kernel.kallsyms]  [k] memset_erms
 5.77%   ip  [kernel.kallsyms]  [k] kobject_uevent_env
 5.18%   ip  [kernel.kallsyms]  [k] refcount_sub_and_test
 4.96%   ip  [kernel.kallsyms]  [k] _raw_read_lock
 3.82%   ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
 3.33%   ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 2.11%   ip  [kernel.kallsyms]  [k] unmap_page_range
 1.77%   ip  [kernel.kallsyms]  [k] __wake_up
 1.69%   ip  [kernel.kallsyms]  [k] strlen
 1.17%   ip  [kernel.kallsyms]  [k] __wake_up_common
 1.09%   ip  [kernel.kallsyms]  [k] insert_header
 1.04%   ip  [kernel.kallsyms]  [k] page_remove_rmap
 1.01%   ip  [kernel.kallsyms]  [k] consume_skb
 0.98%   ip  [kernel.kallsyms]  [k] netlink_trim
 0.51%   ip  [kernel.kallsyms]  [k] kernfs_link_sibling
 0.51%   ip  [kernel.kallsyms]  [k] filemap_map_pages
 0.46%   ip  [kernel.kallsyms]  [k] memcpy_erms

Signed-off-by: Eric Dumazet 
---
 include/net/netns/ipv6.h |  5 +++
 net/ipv6/addrlabel.c | 81 ++--
 2 files changed, 35 insertions(+), 51 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 
2544f9760a4263b7f1b8d622331ca63038586137..2ea1ed341ef81901b4fa271b0f7f4592e17c4f8a
 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -89,6 +89,11 @@ struct netns_ipv6 {
atomic_tfib6_sernum;
struct seg6_pernet_data *seg6_data;
struct fib_notifier_ops *notifier_ops;
+   struct {
+   struct hlist_head head;
+   spinlock_t  lock;
+   u32 seq;
+   } ip6addrlbl_table;
 };
 
 #if IS_ENABLED(CONFIG_NF_DEFRAG_IPV6)
diff --git a/net/ipv6/addrlabel.c b/net/ipv6/addrlabel.c
index 
b055bc79f56d555c89684116c1580984950f77a8..c6311d7108f651c7385cd6316752ba4a86667dcc
 100644
--- a/net/ipv6/addrlabel.c
+++ b/net/ipv6/addrlabel.c
@@ -30,7 +30,6 @@
  * Policy Table
  */
 struct ip6addrlbl_entry {
-   possible_net_t lbl_net;
struct in6_addr prefix;
int prefixlen;
int ifindex;
@@ -41,19 +40,6 @@ struct ip6addrlbl_entry {
struct rcu_head rcu;
 };
 
-static struct ip6addrlbl_table
-{
-   struct hlist_head head;
-   spinlock_t lock;
-   u32 seq;
-} ip6addrlbl_table;
-
-static inline
-struct net *ip6addrlbl_net(const struct ip6addrlbl_entry *lbl)
-{
-   return read_pnet(&lbl->lbl_net);
-}
-
 /*
  * Default policy table (RFC6724 + extensions)
  *
@@ -148,13 +134,10 @@ static inline void ip6addrlbl_put(struct ip6addrlbl_entry 
*p)
 }
 
 /* Find label */
-static bool __ip6addrlbl_match(struct net *net,
-  const struct ip6addrlbl_entry *p,
+static bool __ip6addrlbl_match(const struct ip6addrlbl_entry *p,
   const struct in6_addr *addr,
   int addrtype, int ifindex)
 {
-   if (!net_eq(ip6addrlbl_net(p), net))
-   return false;
if (p->ifindex && p->ifindex != ifindex)
return false;
if (p->addrtype && p->addrtype != addrtype)
@@ -169,8 +152,9 @@ static struct ip6addrlbl_entry *__ipv6_addr_label(struct 
net *net,
  int type, int ifindex)
 {
struct ip6addrlbl_entry *p;
-   hlist_for_each_entry_rcu(p, &ip6addrlbl_table.head, list) {
-   if (__ip6addrlbl_match(net, p, addr, type, ifindex))
+
+   hlist_for_each_entry_rcu(p, &net->ipv6.ip6addrlbl_table.head, list) {
+   if (__ip6addrlbl_match(p, addr, type, ifindex))
return p;
}
return NULL;
@@ -196,8 +180,7 @@ u32 ipv6_addr_label(struct net *net,
 }
 
 /* allocate one entry */
-static struct ip6addrlbl_entry *ip6addrlbl_alloc(struct net *net,
-const struct in6_addr *prefix,
+static struct ip6addrlbl_entry *ip6addrlbl_alloc(const struct in6_addr *prefix,
 int prefixlen, int ifindex,
 u32 label)
 {
@@ -236,24 +219,23 @@ static struct ip6addrlbl_entry *ip6addrl

[PATCH v2 net-next 3/7] kobject: factorize skb setup in kobject_uevent_net_broadcast()

2017-09-19 Thread Eric Dumazet

We can build one skb and let it be cloned in netlink.

This is much faster, and use less memory (all clones will
share the same skb->head)

Tested:

time perf record (for f in `seq 1 3000` ; do ip netns add tast$f; done)
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 4.110 MB perf.data (~179584 samples) ]

real0m24.227s # instead of 0m52.554s
user0m0.329s
sys 0m23.753s # instead of 0m51.375s

14.77%   ip  [kernel.kallsyms]  [k] __ip6addrlbl_add
14.56%   ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
11.65%   ip  [kernel.kallsyms]  [k] netlink_has_listeners
 6.19%   ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
 5.66%   ip  [kernel.kallsyms]  [k] kobject_uevent_env
 4.97%   ip  [kernel.kallsyms]  [k] memset_erms
 4.67%   ip  [kernel.kallsyms]  [k] refcount_sub_and_test
 4.41%   ip  [kernel.kallsyms]  [k] _raw_read_lock
 3.59%   ip  [kernel.kallsyms]  [k] refcount_inc_not_zero
 3.13%   ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 1.55%   ip  [kernel.kallsyms]  [k] __wake_up
 1.20%   ip  [kernel.kallsyms]  [k] strlen
 1.03%   ip  [kernel.kallsyms]  [k] __wake_up_common
 0.93%   ip  [kernel.kallsyms]  [k] consume_skb
 0.92%   ip  [kernel.kallsyms]  [k] netlink_trim
 0.87%   ip  [kernel.kallsyms]  [k] insert_header
 0.63%   ip  [kernel.kallsyms]  [k] unmap_page_range

Signed-off-by: Eric Dumazet 
---
 lib/kobject_uevent.c | 34 +++---
 1 file changed, 19 insertions(+), 15 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 
78b2a7e378c0deda3b32b1178d7f44203702c3f2..147db91c10d06485868ff56626a5a9b073a8a846
 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -301,23 +301,26 @@ static int kobject_uevent_net_broadcast(struct kobject 
*kobj,
 {
int retval = 0;
 #if defined(CONFIG_NET)
+   struct sk_buff *skb = NULL;
struct uevent_sock *ue_sk;
 
/* send netlink message */
list_for_each_entry(ue_sk, &uevent_sock_list, list) {
struct sock *uevent_sock = ue_sk->sk;
-   struct sk_buff *skb;
-   size_t len;
 
if (!netlink_has_listeners(uevent_sock, 1))
continue;
 
-   /* allocate message with the maximum possible size */
-   len = strlen(action_string) + strlen(devpath) + 2;
-   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
-   if (skb) {
+   if (!skb) {
+   /* allocate message with the maximum possible size */
+   size_t len = strlen(action_string) + strlen(devpath) + 
2;
char *scratch;
 
+   retval = -ENOMEM;
+   skb = alloc_skb(len + env->buflen, GFP_KERNEL);
+   if (!skb)
+   continue;
+
/* add header */
scratch = skb_put(skb, len);
sprintf(scratch, "%s@%s", action_string, devpath);
@@ -325,16 +328,17 @@ static int kobject_uevent_net_broadcast(struct kobject 
*kobj,
skb_put_data(skb, env->buf, env->buflen);
 
NETLINK_CB(skb).dst_group = 1;
-   retval = netlink_broadcast_filtered(uevent_sock, skb,
-   0, 1, GFP_KERNEL,
-   kobj_bcast_filter,
-   kobj);
-   /* ENOBUFS should be handled in userspace */
-   if (retval == -ENOBUFS || retval == -ESRCH)
-   retval = 0;
-   } else
-   retval = -ENOMEM;
+   }
+
+   retval = netlink_broadcast_filtered(uevent_sock, skb_get(skb),
+   0, 1, GFP_KERNEL,
+   kobj_bcast_filter,
+   kobj);
+   /* ENOBUFS should be handled in userspace */
+   if (retval == -ENOBUFS || retval == -ESRCH)
+   retval = 0;
}
+   consume_skb(skb);
 #endif
return retval;
 }
-- 
2.14.1.690.gbb1197296e-goog

[PATCH v2 net-next 7/7] ipv4: speedup ipv6 tunnels dismantle

2017-09-19 Thread Eric Dumazet

Implement exit_batch() method to dismantle more devices
per round.

(rtnl_lock() ...
 unregister_netdevice_many() ...
 rtnl_unlock())

Tested:
$ cat add_del_unshare.sh
for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch :
$ time ./add_del_unshare.sh
net_namespace126282   550412 : tunables840 : 
slabdata126282  0

real1m38.965s
user0m0.688s
sys 0m37.017s

After patch:
$ time ./add_del_unshare.sh
net_namespace135291   550412 : tunables840 : 
slabdata135291  0

real0m22.117s
user0m0.728s
sys 0m35.328s

Signed-off-by: Eric Dumazet 
---
 include/net/ip_tunnels.h |  3 ++-
 net/ipv4/ip_gre.c| 22 +-
 net/ipv4/ip_tunnel.c | 12 +---
 net/ipv4/ip_vti.c|  7 +++
 net/ipv4/ipip.c  |  7 +++
 5 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 
992652856fe8c7c1032e0f5f92ce7ee5aa0119da..b41a1e057fcec9d6e4c5a0c1cafd1f1d537ccd53
 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -258,7 +258,8 @@ int ip_tunnel_get_iflink(const struct net_device *dev);
 int ip_tunnel_init_net(struct net *net, unsigned int ip_tnl_net_id,
   struct rtnl_link_ops *ops, char *devname);
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops 
*ops);
+void ip_tunnel_delete_nets(struct list_head *list_net, unsigned int id,
+  struct rtnl_link_ops *ops);
 
 void ip_tunnel_xmit(struct sk_buff *skb, struct net_device *dev,
const struct iphdr *tnl_params, const u8 protocol);
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 
0162fb955b33abf18514cbfd482e72a0ebce6e48..9cee986ac6b8ed04ff95e193fe1e8e60e74d84a9
 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -1013,15 +1013,14 @@ static int __net_init ipgre_init_net(struct net *net)
return ip_tunnel_init_net(net, ipgre_net_id, &ipgre_link_ops, NULL);
 }
 
-static void __net_exit ipgre_exit_net(struct net *net)
+static void __net_exit ipgre_exit_batch_net(struct list_head *list_net)
 {
-   struct ip_tunnel_net *itn = net_generic(net, ipgre_net_id);
-   ip_tunnel_delete_net(itn, &ipgre_link_ops);
+   ip_tunnel_delete_nets(list_net, ipgre_net_id, &ipgre_link_ops);
 }
 
 static struct pernet_operations ipgre_net_ops = {
.init = ipgre_init_net,
-   .exit = ipgre_exit_net,
+   .exit_batch = ipgre_exit_batch_net,
.id   = &ipgre_net_id,
.size = sizeof(struct ip_tunnel_net),
 };
@@ -1540,15 +1539,14 @@ static int __net_init ipgre_tap_init_net(struct net 
*net)
return ip_tunnel_init_net(net, gre_tap_net_id, &ipgre_tap_ops, 
"gretap0");
 }
 
-static void __net_exit ipgre_tap_exit_net(struct net *net)
+static void __net_exit ipgre_tap_exit_batch_net(struct list_head *list_net)
 {
-   struct ip_tunnel_net *itn = net_generic(net, gre_tap_net_id);
-   ip_tunnel_delete_net(itn, &ipgre_tap_ops);
+   ip_tunnel_delete_nets(list_net, gre_tap_net_id, &ipgre_tap_ops);
 }
 
 static struct pernet_operations ipgre_tap_net_ops = {
.init = ipgre_tap_init_net,
-   .exit = ipgre_tap_exit_net,
+   .exit_batch = ipgre_tap_exit_batch_net,
.id   = &gre_tap_net_id,
.size = sizeof(struct ip_tunnel_net),
 };
@@ -1559,16 +1557,14 @@ static int __net_init erspan_init_net(struct net *net)
  &erspan_link_ops, "erspan0");
 }
 
-static void __net_exit erspan_exit_net(struct net *net)
+static void __net_exit erspan_exit_batch_net(struct list_head *net_list)
 {
-   struct ip_tunnel_net *itn = net_generic(net, erspan_net_id);
-
-   ip_tunnel_delete_net(itn, &erspan_link_ops);
+   ip_tunnel_delete_nets(net_list, erspan_net_id, &erspan_link_ops);
 }
 
 static struct pernet_operations erspan_net_ops = {
.init = erspan_init_net,
-   .exit = erspan_exit_net,
+   .exit_batch = erspan_exit_batch_net,
.id   = &erspan_net_id,
.size = sizeof(struct ip_tunnel_net),
 };
diff --git a/net/ipv4/ip_tunnel.c b/net/ipv4/ip_tunnel.c
index 
e9805ad664ac24c3405ad015cfaab89dc1c95279..fe6fee728ce49d01b55aa478698e1a3bcf9a3bdb
 100644
--- a/net/ipv4/ip_tunnel.c
+++ b/net/ipv4/ip_tunnel.c
@@ -1061,16 +1061,22 @@ static void ip_tunnel_destroy(struct ip_tunnel_net 
*itn, struct list_head *head,
}
 }
 
-void ip_tunnel_delete_net(struct ip_tunnel_net *itn, struct rtnl_link_ops *ops)
+void ip_tunnel_delete_nets(struct list_head *net_list, unsigned int id,
+  struct rtnl_link_ops *ops)
 {
+   struct ip_tunnel_net *itn;
+   struct net *net;
LIST_HEAD(list);
 
rtnl_lock();
-   ip_tunnel_destroy(itn, &list, ops);
+   list_for_each_entry(net, net_list, exit_list) {
+   itn =

[PATCH v2 net-next 2/7] kobject: copy env blob in one go

2017-09-19 Thread Eric Dumazet

No need to iterate over strings, just copy in one efficient memcpy() call.

Tested:
time perf record "(for f in `seq 1 3000` ; do ip netns add tast$f; done)"
[ perf record: Woken up 10 times to write data ]
[ perf record: Captured and wrote 8.224 MB perf.data (~359301 samples) ]

real0m52.554s  # instead of 1m7.492s
user0m0.309s
sys 0m51.375s # instead of 1m6.875s

 9.88%   ip  [kernel.kallsyms]  [k] netlink_broadcast_filtered
 8.86%   ip  [kernel.kallsyms]  [k] string
 7.37%   ip  [kernel.kallsyms]  [k] __ip6addrlbl_add
 5.68%   ip  [kernel.kallsyms]  [k] netlink_has_listeners
 5.52%   ip  [kernel.kallsyms]  [k] memcpy_erms
 4.76%   ip  [kernel.kallsyms]  [k] __alloc_skb
 4.54%   ip  [kernel.kallsyms]  [k] vsnprintf
 3.94%   ip  [kernel.kallsyms]  [k] format_decode
 3.80%   ip  [kernel.kallsyms]  [k] kmem_cache_alloc_node_trace
 3.71%   ip  [kernel.kallsyms]  [k] kmem_cache_alloc_node
 3.66%   ip  [kernel.kallsyms]  [k] kobject_uevent_env
 3.38%   ip  [kernel.kallsyms]  [k] strlen
 2.65%   ip  [kernel.kallsyms]  [k] _raw_spin_lock_irqsave
 2.20%   ip  [kernel.kallsyms]  [k] kfree
 2.09%   ip  [kernel.kallsyms]  [k] memset_erms
 2.07%   ip  [kernel.kallsyms]  [k] ___cache_free
 1.95%   ip  [kernel.kallsyms]  [k] kmem_cache_free
 1.91%   ip  [kernel.kallsyms]  [k] _raw_read_lock
 1.45%   ip  [kernel.kallsyms]  [k] ksize
 1.25%   ip  [kernel.kallsyms]  [k] _raw_spin_unlock_irqrestore
 1.00%   ip  [kernel.kallsyms]  [k] widen_string

Signed-off-by: Eric Dumazet 
---
 lib/kobject_uevent.c | 8 +---
 1 file changed, 1 insertion(+), 7 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 
4f48cc3b11d566e44c4115cc7716bc3b1cdf96df..78b2a7e378c0deda3b32b1178d7f44203702c3f2
 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -317,18 +317,12 @@ static int kobject_uevent_net_broadcast(struct kobject 
*kobj,
skb = alloc_skb(len + env->buflen, GFP_KERNEL);
if (skb) {
char *scratch;
-   int i;
 
/* add header */
scratch = skb_put(skb, len);
sprintf(scratch, "%s@%s", action_string, devpath);
 
-   /* copy keys to our continuous event payload buffer */
-   for (i = 0; i < env->envp_idx; i++) {
-   len = strlen(env->envp[i]) + 1;
-   scratch = skb_put(skb, len);
-   strcpy(scratch, env->envp[i]);
-   }
+   skb_put_data(skb, env->buf, env->buflen);
 
NETLINK_CB(skb).dst_group = 1;
retval = netlink_broadcast_filtered(uevent_sock, skb,
-- 
2.14.1.690.gbb1197296e-goog

[PATCH v2 net-next 0/7] net: speedup netns create/delete time

2017-09-19 Thread Eric Dumazet

When rate of netns creation/deletion is high enough,
we observe softlockups in cleanup_net() caused by huge list
of netns and way too many rcu_barrier() calls.

This patch series does some optimizations in kobject,
and add batching to tunnels so that netns dismantles are
less costly.

IPv6 addrlabels also get a per netns list, and tcp_metrics
also benefit from batch flushing.

This gives me one order of magnitude gain.
(~50 ms -> ~5 ms for one netns create/delete pair)

Tested:

for i in `seq 1 40`
do
 (for j in `seq 1 100` ; do  unshare -n /bin/true >/dev/null ; done) &
done
wait ; grep net_namespace /proc/slabinfo

Before patch series :

$ time ./add_del_unshare.sh
net_namespace116258   550412 : tunables840 : 
slabdata116258  0

real3m24.910s
user0m0.747s
sys 0m43.162s

After :
$ time ./add_del_unshare.sh
net_namespace135291   550412 : tunables840 : 
slabdata135291  0

real0m22.117s
user0m0.728s
sys 0m35.328s


Eric Dumazet (7):
  kobject: add kobject_uevent_net_broadcast()
  kobject: copy env blob in one go
  kobject: factorize skb setup in kobject_uevent_net_broadcast()
  ipv6: addrlabel: per netns list
  tcp: batch tcp_net_metrics_exit
  ipv6: speedup ipv6 tunnels dismantle
  ipv4: speedup ipv6 tunnels dismantle

 include/net/ip_tunnels.h |  3 +-
 include/net/netns/ipv6.h |  5 +++
 lib/kobject_uevent.c | 94 ++--
 net/ipv4/ip_gre.c| 22 +---
 net/ipv4/ip_tunnel.c | 12 +--
 net/ipv4/ip_vti.c|  7 ++--
 net/ipv4/ipip.c  |  7 ++--
 net/ipv4/tcp_metrics.c   | 14 +---
 net/ipv6/addrlabel.c | 81 -
 net/ipv6/ip6_gre.c   |  8 +++--
 net/ipv6/ip6_tunnel.c| 20 ++-
 net/ipv6/ip6_vti.c   | 23 +++-
 net/ipv6/sit.c   |  9 +++--
 13 files changed, 157 insertions(+), 148 deletions(-)

-- 
2.14.1.690.gbb1197296e-goog

Re: [PATCH net-next 00/14] gtp: Additional feature support

2017-09-19 Thread Harald Welte

Hi Tom,

On Tue, Sep 19, 2017 at 08:59:28AM -0700, Tom Herbert wrote:
> On Tue, Sep 19, 2017 at 5:43 AM, Harald Welte 
> wrote:
> > On Mon, Sep 18, 2017 at 05:38:50PM -0700, Tom Herbert wrote:
> >>   - IPv6 support
> >
> > see my detailed comments in other mails.  It's unfortunately only
> > support for the already "deprecated" IPv6-only PDP contexts, not the
> > more modern v4v6 type.  In order to interoperate with old and new
> > approach, all three cases (v4, v6 and v4v6) should be supported from
> > one code base.
> >
> It sounds like something that can be subsequently added. 

Not entirely, at least on the netlink (and any other configuration
interface) you will have to reflect this from the very beginning.  You
have to have an explicit PDP type and cannot rely on the address type to
specify the type of PDP context.  Whatever interfaces are introduced
now will have to remain compatible to any future change.

My strategy to avoid any such possible 'road blocks' from being
introduced would be to simply add v4v6 and v6 support in one go.  The
differences are marginal (having both an IPv6 prefix and a v4 address in
parallel, rather than mutually exclusive only).

> Do you have a reference to the spec?

See http://osmocom.org/issues/2418#note-7 which lists Section 11.2.1.3.2
of 3GPP TS 29.061 in combination with RFC3314, RFC7066, RFC6459 and
3GPP TS 23.060 9.2.1 as well as a summary of my understanding of it some
months ago.

> >>   - Configurable networking interfaces so that GTP kernel can be
> >>   used and tested without needing GSN network emulation (i.e. no
> >>   user space daemon needed).
> >
> > We have some pretty decent userspace utilities for configuring the
> > GTP interfaces and tunnels in the libgtpnl repository, but if it
> > helps people to have another way of configuration, I won't be
> > against it.
> >
> AFAIK those userspace utilities don't support IPv6. 

Of course not [yet]. libgtpnl and the command line tools have been
implemented specifically for the in-kernel GTP driver, and you have to
make sure to add related support on both the kernel and the userspace
side (libgtpnl). So there's little point in adding features on either
side before the other side.  There would be no way to test...

> Being able to configure GTP like any other encapsulation will
> facilitate development of IPv6 and other features.

That may very well be the case, but adding "IPv6 support" to kernel GTP
in a way that is not in line with the existing userspace libraries and
control-plane implementations means that you're developing those
features in an artificial environment that doesn't resemble real 3GPP
interoperable networks out there.

As indicated, I'm not against adding additional interfaces, but we have
to make sure that we add IPv6 support (or any new feature support) to at
least libgtpnl, and to make sure we test interoperability with existing
3GPP network equipment such as real IPv6 capable phones and SGSNs.

> > I'm not sure if this is a useful feature.  GTP is used only in
> > operator-controlled networks and only on standard ports.  It's not
> > possible to negotiate any non-standard ports on the signaling plane
> > either.
> >
> Bear in mind that we're not required to do everything the GTP spec
> says. 

Yes, we are, at least as long as it affects interoperability with other
implemetations out there.

GTP uses well-known port numbers on *both* sides of the tunnel, and you
cannot deviate from that.

There's no point in having all kinds of feetures in the GTP user plane
which are not interoperable with other implementations, and which are
completely outside of the information model / architecture of GTP.

In the real world, GTP-U is only used in combination with GTP-C.  And in
GTP-C you can only negotiate the IP address of both sides of GTP-U, and
not the port number information.  As a result, the port numbers are
static on both sides.

> My impression is GTP designers probably didn't think in terms of
> getting best performance. But we can ;-)

I think it's wasted efforts if it's about "random udp ports" as no
standards-compliant implementation out there with which you will have to
interoperate will be able to support it.

GTP is used between home and roaming operator.  If you want to introduce
changes to how it works, you will have to have control over both sides
of the implementation of both the GTP-C and the GTP-u plane, which is
very unlikely and rather the exception in the hundreds of operators you
interoperate with.  Also keep in mind that there often are various
"middleboxes" that will suddenly have to reflect your changes.  That
starts from packet filters at various locations in the operator networks
and/or roaming hubs, down to GTP hubs and others.

My opinion is: Non-standard GTP ports are not going to happen.

> I also brought up open_ggsn. ggsn to sgsn.

That's good to hear.  For both v4 and v6 PDP contexts?  Whcih phones
did you use for testing?  Particularly given how convolved the add

Re: [PATCH] net: emac: Fix napi poll list corruption

2017-09-19 Thread David Miller

From: Christian Lamparter 
Date: Tue, 19 Sep 2017 19:35:18 +0200

> This patch is pretty much a carbon copy of
> commit 3079c652141f ("caif: Fix napi poll list corruption")
> with "caif" replaced by "emac".
> 
> The commit d75b1ade567f ("net: less interrupt masking in NAPI")
> breaks emac.
> 
> It is now required that if the entire budget is consumed when poll
> returns, the napi poll_list must remain empty.  However, like some
> other drivers emac tries to do a last-ditch check and if there is
> more work it will call napi_reschedule and then immediately process
> some of this new work.  Should the entire budget be consumed while
> processing such new work then we will violate the new caller
> contract.
> 
> This patch fixes this by not touching any work when we reschedule
> in emac.
> 
> Signed-off-by: Christian Lamparter 

Applied and queued up for -stable, thanks.

Re: [PATCH net-next 0/7] net: speedup netns create/delete time

2017-09-19 Thread Eric Dumazet

On Tue, 2017-09-19 at 16:02 -0700, David Miller wrote:
> From: Eric Dumazet 
> Date: Mon, 18 Sep 2017 12:07:26 -0700
> 
> > When rate of netns creation/deletion is high enough,
> > we observe softlockups in cleanup_net() caused by huge list
> > of netns and way too many rcu_barrier() calls.
> > 
> > This patch series does some optimizations in kobject,
> > and add batching to tunnels so that netns dismantles are
> > less costly.
> > 
> > IPv6 addrlabels also get a per netns list, and tcp_metrics
> > also benefit from batch flushing.
> > 
> > This gives me one order of magnitude gain.
> > (~50 ms -> ~5 ms for one netns create/delete pair)
> 
> I like it.
> 
> Please address the feedback about using skb_put_data() and
> resubmit.

Sure, will also remove a spurious // comment I accidentally left in
patch 7/7.

Re: [patch net-next] team: fall back to hash if table entry is empty

2017-09-19 Thread David Miller

From: Jim Hanko 
Date: Tue, 19 Sep 2017 11:33:39 -0700

> If the hash to port mapping table does not have a valid port (i.e. when
> a port goes down), fall back to the simple hashing mechanism to avoid
> dropping packets.
> 
> Signed-off-by: Jim Hanko 
> Acked-by: Jiri Pirko 

Applied, thanks.

Re: [PATCH net] tcp: fastopen: fix on syn-data transmit failure

2017-09-19 Thread David Miller

From: Eric Dumazet 
Date: Tue, 19 Sep 2017 10:05:57 -0700

> From: Eric Dumazet 
> 
> Our recent change exposed a bug in TCP Fastopen Client that syzkaller
> found right away [1]
> 
> When we prepare skb with SYN+DATA, we attempt to transmit it,
> and we update socket state as if the transmit was a success.
> 
> In socket RTX queue we have two skbs, one with the SYN alone,
> and a second one containing the DATA.
> 
> When (malicious) ACK comes in, we now complain that second one had no
> skb_mstamp.
> 
> The proper fix is to make sure that if the transmit failed, we do not
> pretend we sent the DATA skb, and make it our send_head.
> 
> When 3WHS completes, we can now send the DATA right away, without having
> to wait for a timeout.
> 
> [1]
 ...
> Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
> Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 

Applied, thanks Eric.

Re: [net-next v2 0/4] test_rhashtable: don't allocate huge static array

2017-09-19 Thread David Miller

From: Florian Westphal 
Date: Wed, 20 Sep 2017 01:12:10 +0200

> Add a test case for the rhlist interface.
> While at it, cleanup current rhashtable test a bit and add a check
> for max_size support.
> 
> No changes since v1, except in last patch.
> kbuild robot complained about large onstack allocation caused by
> struct rhltable when lockdep is enabled.

Looks good, series applied, thanks Florian.

[net-next v2 3/4] test_rhashtable: add a check for max_size

2017-09-19 Thread Florian Westphal

add a test that tries to insert more than max_size elements.

Signed-off-by: Florian Westphal 
---
 lib/test_rhashtable.c | 41 +
 1 file changed, 41 insertions(+)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 69f5b3849980..1eee90e6e394 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -246,6 +246,43 @@ static s64 __init test_rhashtable(struct rhashtable *ht, 
struct test_obj *array,
 
 static struct rhashtable ht;
 
+static int __init test_rhashtable_max(struct test_obj *array,
+ unsigned int entries)
+{
+   unsigned int i, insert_retries = 0;
+   int err;
+
+   test_rht_params.max_size = roundup_pow_of_two(entries / 8);
+   err = rhashtable_init(&ht, &test_rht_params);
+   if (err)
+   return err;
+
+   for (i = 0; i < ht.max_elems; i++) {
+   struct test_obj *obj = &array[i];
+
+   obj->value.id = i * 2;
+   err = insert_retry(&ht, obj, test_rht_params);
+   if (err > 0)
+   insert_retries += err;
+   else if (err)
+   return err;
+   }
+
+   err = insert_retry(&ht, &array[ht.max_elems], test_rht_params);
+   if (err == -E2BIG) {
+   err = 0;
+   } else {
+   pr_info("insert element %u should have failed with %d, got 
%d\n",
+   ht.max_elems, -E2BIG, err);
+   if (err == 0)
+   err = -1;
+   }
+
+   rhashtable_destroy(&ht);
+
+   return err;
+}
+
 static int thread_lookup_test(struct thread_data *tdata)
 {
unsigned int entries = tdata->entries;
@@ -386,7 +423,11 @@ static int __init test_rht_init(void)
total_time += time;
}
 
+   pr_info("test if its possible to exceed max_size %d: %s\n",
+   test_rht_params.max_size, test_rhashtable_max(objs, 
entries) == 0 ?
+   "no, ok" : "YES, failed");
vfree(objs);
+
do_div(total_time, runs);
pr_info("Average test time: %llu\n", total_time);
 
-- 
2.13.5

[net-next v2 4/4] test_rhashtable: add test case for rhl_table interface

2017-09-19 Thread Florian Westphal

also test rhltable.  rhltable remove operations are slow as
deletions require a list walk, thus test with 1/16th of the given
entry count number to get a run duration similar to rhashtable one.

Signed-off-by: Florian Westphal 
---
 change since v1:
 place struct rhltable in initdata section to avoid large onstack
 allocation warnings when lockdep is enabled.

 lib/test_rhashtable.c | 196 +-
 1 file changed, 194 insertions(+), 2 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 1eee90e6e394..de4d0584631a 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #define MAX_ENTRIES100
@@ -66,6 +67,11 @@ struct test_obj {
struct rhash_head   node;
 };
 
+struct test_obj_rhl {
+   struct test_obj_val value;
+   struct rhlist_head  list_node;
+};
+
 struct thread_data {
unsigned int entries;
int id;
@@ -245,6 +251,186 @@ static s64 __init test_rhashtable(struct rhashtable *ht, 
struct test_obj *array,
 }
 
 static struct rhashtable ht;
+static struct rhltable rhlt __initdata;
+
+static int __init test_rhltable(unsigned int entries)
+{
+   struct test_obj_rhl *rhl_test_objects;
+   unsigned long *obj_in_table;
+   unsigned int i, j, k;
+   int ret, err;
+
+   if (entries == 0)
+   entries = 1;
+
+   rhl_test_objects = vzalloc(sizeof(*rhl_test_objects) * entries);
+   if (!rhl_test_objects)
+   return -ENOMEM;
+
+   ret = -ENOMEM;
+   obj_in_table = vzalloc(BITS_TO_LONGS(entries) * sizeof(unsigned long));
+   if (!obj_in_table)
+   goto out_free;
+
+   /* nulls_base not supported in rhlist interface */
+   test_rht_params.nulls_base = 0;
+   err = rhltable_init(&rhlt, &test_rht_params);
+   if (WARN_ON(err))
+   goto out_free;
+
+   k = prandom_u32();
+   ret = 0;
+   for (i = 0; i < entries; i++) {
+   rhl_test_objects[i].value.id = k;
+   err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node,
+ test_rht_params);
+   if (WARN(err, "error %d on element %d\n", err, i))
+   break;
+   if (err == 0)
+   set_bit(i, obj_in_table);
+   }
+
+   if (err)
+   ret = err;
+
+   pr_info("test %d add/delete pairs into rhlist\n", entries);
+   for (i = 0; i < entries; i++) {
+   struct rhlist_head *h, *pos;
+   struct test_obj_rhl *obj;
+   struct test_obj_val key = {
+   .id = k,
+   };
+   bool found;
+
+   rcu_read_lock();
+   h = rhltable_lookup(&rhlt, &key, test_rht_params);
+   if (WARN(!h, "key not found during iteration %d of %d", i, 
entries)) {
+   rcu_read_unlock();
+   break;
+   }
+
+   if (i) {
+   j = i - 1;
+   rhl_for_each_entry_rcu(obj, pos, h, list_node) {
+   if (WARN(pos == &rhl_test_objects[j].list_node, 
"old element found, should be gone"))
+   break;
+   }
+   }
+
+   cond_resched_rcu();
+
+   found = false;
+
+   rhl_for_each_entry_rcu(obj, pos, h, list_node) {
+   if (pos == &rhl_test_objects[i].list_node) {
+   found = true;
+   break;
+   }
+   }
+
+   rcu_read_unlock();
+
+   if (WARN(!found, "element %d not found", i))
+   break;
+
+   err = rhltable_remove(&rhlt, &rhl_test_objects[i].list_node, 
test_rht_params);
+   WARN(err, "rhltable_remove: err %d for iteration %d\n", err, i);
+   if (err == 0)
+   clear_bit(i, obj_in_table);
+   }
+
+   if (ret == 0 && err)
+   ret = err;
+
+   for (i = 0; i < entries; i++) {
+   WARN(test_bit(i, obj_in_table), "elem %d allegedly still 
present", i);
+
+   err = rhltable_insert(&rhlt, &rhl_test_objects[i].list_node,
+ test_rht_params);
+   if (WARN(err, "error %d on element %d\n", err, i))
+   break;
+   if (err == 0)
+   set_bit(i, obj_in_table);
+   }
+
+   pr_info("test %d random rhlist add/delete operations\n", entries);
+   for (j = 0; j < entries; j++) {
+   u32 i = prandom_u32_max(entries);
+   u32 prand = prandom_u32();
+
+   cond_resched();
+
+   if (prand == 0)
+   prand = prandom_u32();
+
+

[net-next v2 1/4] test_rhashtable: don't allocate huge static array

2017-09-19 Thread Florian Westphal

Signed-off-by: Florian Westphal 
---
 lib/test_rhashtable.c | 27 ---
 1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index 0ffca990a833..c40d6e636f33 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -72,8 +72,6 @@ struct thread_data {
struct test_obj *objs;
 };
 
-static struct test_obj array[MAX_ENTRIES];
-
 static struct rhashtable_params test_rht_params = {
.head_offset = offsetof(struct test_obj, node),
.key_offset = offsetof(struct test_obj, value),
@@ -85,7 +83,7 @@ static struct rhashtable_params test_rht_params = {
 static struct semaphore prestart_sem;
 static struct semaphore startup_sem = __SEMAPHORE_INITIALIZER(startup_sem, 0);
 
-static int insert_retry(struct rhashtable *ht, struct rhash_head *obj,
+static int insert_retry(struct rhashtable *ht, struct test_obj *obj,
 const struct rhashtable_params params)
 {
int err, retries = -1, enomem_retries = 0;
@@ -93,7 +91,7 @@ static int insert_retry(struct rhashtable *ht, struct 
rhash_head *obj,
do {
retries++;
cond_resched();
-   err = rhashtable_insert_fast(ht, obj, params);
+   err = rhashtable_insert_fast(ht, &obj->node, params);
if (err == -ENOMEM && enomem_retry) {
enomem_retries++;
err = -EBUSY;
@@ -107,7 +105,7 @@ static int insert_retry(struct rhashtable *ht, struct 
rhash_head *obj,
return err ? : retries;
 }
 
-static int __init test_rht_lookup(struct rhashtable *ht)
+static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj 
*array)
 {
unsigned int i;
 
@@ -186,7 +184,7 @@ static void test_bucket_stats(struct rhashtable *ht)
pr_warn("Test failed: Total count mismatch ^^^");
 }
 
-static s64 __init test_rhashtable(struct rhashtable *ht)
+static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj 
*array)
 {
struct test_obj *obj;
int err;
@@ -203,7 +201,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
struct test_obj *obj = &array[i];
 
obj->value.id = i * 2;
-   err = insert_retry(ht, &obj->node, test_rht_params);
+   err = insert_retry(ht, obj, test_rht_params);
if (err > 0)
insert_retries += err;
else if (err)
@@ -216,7 +214,7 @@ static s64 __init test_rhashtable(struct rhashtable *ht)
 
test_bucket_stats(ht);
rcu_read_lock();
-   test_rht_lookup(ht);
+   test_rht_lookup(ht, array);
rcu_read_unlock();
 
test_bucket_stats(ht);
@@ -286,7 +284,7 @@ static int threadfunc(void *data)
for (i = 0; i < entries; i++) {
tdata->objs[i].value.id = i;
tdata->objs[i].value.tid = tdata->id;
-   err = insert_retry(&ht, &tdata->objs[i].node, test_rht_params);
+   err = insert_retry(&ht, &tdata->objs[i], test_rht_params);
if (err > 0) {
insert_retries += err;
} else if (err) {
@@ -349,6 +347,10 @@ static int __init test_rht_init(void)
test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
test_rht_params.nelem_hint = size;
 
+   objs = vzalloc((test_rht_params.max_size + 1) * sizeof(struct 
test_obj));
+   if (!objs)
+   return -ENOMEM;
+
pr_info("Running rhashtable test nelem=%d, max_size=%d, shrinking=%d\n",
size, max_size, shrinking);
 
@@ -356,7 +358,8 @@ static int __init test_rht_init(void)
s64 time;
 
pr_info("Test %02d:\n", i);
-   memset(&array, 0, sizeof(array));
+   memset(objs, 0, test_rht_params.max_size * sizeof(struct 
test_obj));
+
err = rhashtable_init(&ht, &test_rht_params);
if (err < 0) {
pr_warn("Test failed: Unable to initialize hashtable: 
%d\n",
@@ -364,9 +367,10 @@ static int __init test_rht_init(void)
continue;
}
 
-   time = test_rhashtable(&ht);
+   time = test_rhashtable(&ht, objs);
rhashtable_destroy(&ht);
if (time < 0) {
+   vfree(objs);
pr_warn("Test failed: return code %lld\n", time);
return -EINVAL;
}
@@ -374,6 +378,7 @@ static int __init test_rht_init(void)
total_time += time;
}
 
+   vfree(objs);
do_div(total_time, runs);
pr_info("Average test time: %llu\n", total_time);
 
-- 
2.13.5

[net-next v2 2/4] test_rhashtable: don't use global entries variable

2017-09-19 Thread Florian Westphal

pass the entries to test as an argument instead.
Followup patch will add an rhlist test case; rhlist delete opererations
are slow so we need to use a smaller number to test it.

Signed-off-by: Florian Westphal 
---
 lib/test_rhashtable.c | 37 +++--
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/lib/test_rhashtable.c b/lib/test_rhashtable.c
index c40d6e636f33..69f5b3849980 100644
--- a/lib/test_rhashtable.c
+++ b/lib/test_rhashtable.c
@@ -28,9 +28,9 @@
 #define MAX_ENTRIES100
 #define TEST_INSERT_FAIL INT_MAX
 
-static int entries = 5;
-module_param(entries, int, 0);
-MODULE_PARM_DESC(entries, "Number of entries to add (default: 5)");
+static int parm_entries = 5;
+module_param(parm_entries, int, 0);
+MODULE_PARM_DESC(parm_entries, "Number of entries to add (default: 5)");
 
 static int runs = 4;
 module_param(runs, int, 0);
@@ -67,6 +67,7 @@ struct test_obj {
 };
 
 struct thread_data {
+   unsigned int entries;
int id;
struct task_struct *task;
struct test_obj *objs;
@@ -105,11 +106,12 @@ static int insert_retry(struct rhashtable *ht, struct 
test_obj *obj,
return err ? : retries;
 }
 
-static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj 
*array)
+static int __init test_rht_lookup(struct rhashtable *ht, struct test_obj 
*array,
+ unsigned int entries)
 {
unsigned int i;
 
-   for (i = 0; i < entries * 2; i++) {
+   for (i = 0; i < entries; i++) {
struct test_obj *obj;
bool expected = !(i % 2);
struct test_obj_val key = {
@@ -142,7 +144,7 @@ static int __init test_rht_lookup(struct rhashtable *ht, 
struct test_obj *array)
return 0;
 }
 
-static void test_bucket_stats(struct rhashtable *ht)
+static void test_bucket_stats(struct rhashtable *ht, unsigned int entries)
 {
unsigned int err, total = 0, chain_len = 0;
struct rhashtable_iter hti;
@@ -184,7 +186,8 @@ static void test_bucket_stats(struct rhashtable *ht)
pr_warn("Test failed: Total count mismatch ^^^");
 }
 
-static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj 
*array)
+static s64 __init test_rhashtable(struct rhashtable *ht, struct test_obj 
*array,
+ unsigned int entries)
 {
struct test_obj *obj;
int err;
@@ -212,12 +215,12 @@ static s64 __init test_rhashtable(struct rhashtable *ht, 
struct test_obj *array)
pr_info("  %u insertions retried due to memory pressure\n",
insert_retries);
 
-   test_bucket_stats(ht);
+   test_bucket_stats(ht, entries);
rcu_read_lock();
-   test_rht_lookup(ht, array);
+   test_rht_lookup(ht, array, entries);
rcu_read_unlock();
 
-   test_bucket_stats(ht);
+   test_bucket_stats(ht, entries);
 
pr_info("  Deleting %d keys\n", entries);
for (i = 0; i < entries; i++) {
@@ -245,6 +248,7 @@ static struct rhashtable ht;
 
 static int thread_lookup_test(struct thread_data *tdata)
 {
+   unsigned int entries = tdata->entries;
int i, err = 0;
 
for (i = 0; i < entries; i++) {
@@ -281,7 +285,7 @@ static int threadfunc(void *data)
if (down_interruptible(&startup_sem))
pr_err("  thread[%d]: down_interruptible failed\n", tdata->id);
 
-   for (i = 0; i < entries; i++) {
+   for (i = 0; i < tdata->entries; i++) {
tdata->objs[i].value.id = i;
tdata->objs[i].value.tid = tdata->id;
err = insert_retry(&ht, &tdata->objs[i], test_rht_params);
@@ -305,7 +309,7 @@ static int threadfunc(void *data)
}
 
for (step = 10; step > 0; step--) {
-   for (i = 0; i < entries; i += step) {
+   for (i = 0; i < tdata->entries; i += step) {
if (tdata->objs[i].value.id == TEST_INSERT_FAIL)
continue;
err = rhashtable_remove_fast(&ht, &tdata->objs[i].node,
@@ -336,12 +340,16 @@ static int threadfunc(void *data)
 
 static int __init test_rht_init(void)
 {
+   unsigned int entries;
int i, err, started_threads = 0, failed_threads = 0;
u64 total_time = 0;
struct thread_data *tdata;
struct test_obj *objs;
 
-   entries = min(entries, MAX_ENTRIES);
+   if (parm_entries < 0)
+   parm_entries = 1;
+
+   entries = min(parm_entries, MAX_ENTRIES);
 
test_rht_params.automatic_shrinking = shrinking;
test_rht_params.max_size = max_size ? : roundup_pow_of_two(entries);
@@ -367,7 +375,7 @@ static int __init test_rht_init(void)
continue;
}
 
-   time = test_rhashtable(&ht, objs);
+   time = test_rhashtable(&ht, objs, entries);
rhashtable_destroy(&ht);
if (time < 0) {

[net-next v2 0/4] test_rhashtable: don't allocate huge static array

2017-09-19 Thread Florian Westphal

Add a test case for the rhlist interface.
While at it, cleanup current rhashtable test a bit and add a check
for max_size support.

No changes since v1, except in last patch.
kbuild robot complained about large onstack allocation caused by
struct rhltable when lockdep is enabled.

Re: [PATCH net-next v3 00/12] net: dsa: b53/bcm_sf2 cleanups

2017-09-19 Thread David Miller

From: Florian Fainelli 
Date: Tue, 19 Sep 2017 10:46:42 -0700

> This patch series is a first pass set of clean-ups to reduce the number of 
> LOCs
> between b53 and bcm_sf2 and sharing as many functions as possible.
> 
> There is a number of additional cleanups queued up locally that require more
> thorough testing.

Series applied, thanks.

Re: [PATCH V2 net 0/7] Bug fixes for the HNS3 Ethernet Driver for Hip08 SoC

2017-09-19 Thread David Miller

From: Salil Mehta 
Date: Tue, 19 Sep 2017 17:17:09 +0100

> This patch set presents some bug fixes for the HNS3 Ethernet driver identified
> during internal testing & stabilization efforts.
> 
> Change Log:
> Patch V2: Resolved comments from Leon Romanovsky
> Patch V1: Initial Submit

Series applied, thank you.

Re: [PATCH net-next 0/4] net: dsa: move master ethtool code

2017-09-19 Thread David Miller

From: Florian Fainelli 
Date: Tue, 19 Sep 2017 13:04:56 -0700

> On 09/19/2017 08:56 AM, Vivien Didelot wrote:
>> The DSA core overrides the master device's ethtool_ops structure so that
>> it can inject statistics and such of its dedicated switch CPU port.
>> 
>> This ethtool code is currently called on unnecessary conditions or
>> before the master interface and its switch CPU port get wired up.
>> This patchset fixes this.
>> 
>> Similarly to slave.c where the DSA slave net_device is the entry point
>> of the dsa_slave_* functions, this patchset also isolates the master's
>> ethtool code in a new master.c file, where the DSA master net_device is
>> the entry point of the dsa_master_* functions.
>> 
>> This is a first step towards better control of the master device and
>> support for multiple CPU ports.
> 
> Tested-by: Florian Fainelli 
> 
> * ethtool -S eth0 -> switch port CPU stats are still correctly overlayed
> * ethtool -s gphy wol g -> both switch port and CPU port correctly
> enable WoL
> * ethtool -i eth0 -> driver still reports correct information

Series applied, thanks everyone.

Re: [PATCH net-next 0/7] net: speedup netns create/delete time

2017-09-19 Thread David Miller

From: Eric Dumazet 
Date: Mon, 18 Sep 2017 12:07:26 -0700

> When rate of netns creation/deletion is high enough,
> we observe softlockups in cleanup_net() caused by huge list
> of netns and way too many rcu_barrier() calls.
> 
> This patch series does some optimizations in kobject,
> and add batching to tunnels so that netns dismantles are
> less costly.
> 
> IPv6 addrlabels also get a per netns list, and tcp_metrics
> also benefit from batch flushing.
> 
> This gives me one order of magnitude gain.
> (~50 ms -> ~5 ms for one netns create/delete pair)

I like it.

Please address the feedback about using skb_put_data() and
resubmit.

Thanks.

Re: [PATCH,net-next,0/2] Improve code coverage of syzkaller

2017-09-19 Thread David Miller

From: Petar Penkov 
Date: Tue, 19 Sep 2017 00:34:00 -0700

> The following patches address this by providing the user(syzkaller)
> with the ability to send via napi_gro_receive() and napi_gro_frags().
> Additionally, syzkaller can specify how many fragments there are and
> how much data per fragment there is. This is done by exploiting the
> convenient structure of iovecs. Finally, this patch series adds
> support for exercising the flow dissector during fuzzing.
> 
> The code path including napi_gro_receive() can be enabled via the
> CONFIG_TUN_NAPI compile-time flag, and can be used by users other than
> syzkaller. The remainder of the changes in this patch series give the
> user significantly more control over packets entering the kernel. To
> avoid potential security vulnerabilities, hide the ability to send
> custom skbs and the flow dissector code paths behind a run-time flag
> IFF_NAPI_FRAGS that is advertised and accepted only if CONFIG_TUN_NAPI
> is enabled.
> 
> The patch series will be followed with changes to packetdrill, where
> these additions to the TUN driver are exercised and demonstrated.
> This will give the ability to write regression tests for specific
> parts of the early networking stack.
> 
> Patch 1/ Add NAPI struct per receive queue, enable NAPI, and use
>napi_gro_receive() 
> Patch 2/ Use NAPI skb and napi_gro_frags(), exercise flow
>dissector, and allow custom skbs.

I'm happy with everything except the TUN_NAPI Kconfig knob
requirement.

Rebuilding something just to test things isn't going to fly very well.

Please make it secure somehow, enable this stuff by default.

Thanks.

Re: [PATCH] net: ethernet: aquantia: default to no in config

2017-09-19 Thread David Miller

From: vcap...@pengaru.com
Date: Tue, 19 Sep 2017 16:02:49 -0700

> Out of curiosity, what's the rationale for that decision?

So that you don't need to know what special vendor knob needs to be
switched in order to even be offered the config knob for the driver
you are interested in.

Re: [PATCH] net: ethernet: aquantia: default to no in config

2017-09-19 Thread vcaputo

On Tue, Sep 19, 2017 at 03:52:31PM -0700, David Miller wrote:
> From: Vito Caputo 
> Date: Tue, 19 Sep 2017 15:43:15 -0700
> 
> > NET_VENDOR_AQUANTIA was "default y" for some reason, which seems
> > obviously inappropriate.
> 
> It is appropriate.
> 
> We make all vendor guards default to yes.

Thanks for the quick response.

Out of curiosity, what's the rationale for that decision?

Re: [PATCH] net: ethernet: aquantia: default to no in config

2017-09-19 Thread David Miller

From: Vito Caputo 
Date: Tue, 19 Sep 2017 15:43:15 -0700

> NET_VENDOR_AQUANTIA was "default y" for some reason, which seems
> obviously inappropriate.

It is appropriate.

We make all vendor guards default to yes.

[PATCH] net: ethernet: aquantia: default to no in config

2017-09-19 Thread Vito Caputo

NET_VENDOR_AQUANTIA was "default y" for some reason, which seems
obviously inappropriate.
---
 drivers/net/ethernet/aquantia/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/aquantia/Kconfig 
b/drivers/net/ethernet/aquantia/Kconfig
index cdf78e069a39..6167b13cf349 100644
--- a/drivers/net/ethernet/aquantia/Kconfig
+++ b/drivers/net/ethernet/aquantia/Kconfig
@@ -4,7 +4,7 @@
 
 config NET_VENDOR_AQUANTIA
bool "aQuantia devices"
-   default y
+   default n
---help---
  Set this to y if you have an Ethernet network cards that uses the 
aQuantia
  AQC107/AQC108 chipset.
-- 
2.11.0

[PATCH net v2] bpf: fix ri->map_owner pointer on bpf_prog_realloc

2017-09-19 Thread Daniel Borkmann

Commit 109980b894e9 ("bpf: don't select potentially stale
ri->map from buggy xdp progs") passed the pointer to the prog
itself to be loaded into r4 prior on bpf_redirect_map() helper
call, so that we can store the owner into ri->map_owner out of
the helper.

Issue with that is that the actual address of the prog is still
subject to change when subsequent rewrites occur that require
slow path in bpf_prog_realloc() to alloc more memory, e.g. from
patching inlining helper functions or constant blinding. Thus,
we really need to take prog->aux as the address we're holding,
which also works with prog clones as they share the same aux
object.

Instead of then fetching aux->prog during runtime, which could
potentially incur cache misses due to false sharing, we are
going to just use aux for comparison on the map owner. This
will also keep the patchlet of the same size, and later check
in xdp_map_invalid() only accesses read-only aux pointer from
the prog, it's also in the same cacheline already from prior
access when calling bpf_func.

Fixes: 109980b894e9 ("bpf: don't select potentially stale ri->map from buggy 
xdp progs")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
---
 v1->v2:
  - Decided to go with prog->aux instead.

 kernel/bpf/verifier.c |  7 ++-
 net/core/filter.c | 24 +++-
 2 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 799b245..b914fbe 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -4205,7 +4205,12 @@ static int fixup_bpf_calls(struct bpf_verifier_env *env)
}
 
if (insn->imm == BPF_FUNC_redirect_map) {
-   u64 addr = (unsigned long)prog;
+   /* Note, we cannot use prog directly as imm as 
subsequent
+* rewrites would still change the prog pointer. The 
only
+* stable address we can use is aux, which also works 
with
+* prog clones during blinding.
+*/
+   u64 addr = (unsigned long)prog->aux;
struct bpf_insn r4_ld[] = {
BPF_LD_IMM64(BPF_REG_4, addr),
*insn,
diff --git a/net/core/filter.c b/net/core/filter.c
index 24dd33d..82edad5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -1794,7 +1794,7 @@ struct redirect_info {
u32 flags;
struct bpf_map *map;
struct bpf_map *map_to_flush;
-   const struct bpf_prog *map_owner;
+   unsigned long   map_owner;
 };
 
 static DEFINE_PER_CPU(struct redirect_info, redirect_info);
@@ -2500,11 +2500,17 @@ void xdp_do_flush_map(void)
 }
 EXPORT_SYMBOL_GPL(xdp_do_flush_map);
 
+static inline bool xdp_map_invalid(const struct bpf_prog *xdp_prog,
+  unsigned long aux)
+{
+   return (unsigned long)xdp_prog->aux != aux;
+}
+
 static int xdp_do_redirect_map(struct net_device *dev, struct xdp_buff *xdp,
   struct bpf_prog *xdp_prog)
 {
struct redirect_info *ri = this_cpu_ptr(&redirect_info);
-   const struct bpf_prog *map_owner = ri->map_owner;
+   unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
struct net_device *fwd = NULL;
u32 index = ri->ifindex;
@@ -2512,9 +2518,9 @@ static int xdp_do_redirect_map(struct net_device *dev, 
struct xdp_buff *xdp,
 
ri->ifindex = 0;
ri->map = NULL;
-   ri->map_owner = NULL;
+   ri->map_owner = 0;
 
-   if (unlikely(map_owner != xdp_prog)) {
+   if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
err = -EFAULT;
map = NULL;
goto err;
@@ -2574,7 +2580,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
struct bpf_prog *xdp_prog)
 {
struct redirect_info *ri = this_cpu_ptr(&redirect_info);
-   const struct bpf_prog *map_owner = ri->map_owner;
+   unsigned long map_owner = ri->map_owner;
struct bpf_map *map = ri->map;
struct net_device *fwd = NULL;
u32 index = ri->ifindex;
@@ -2583,10 +2589,10 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
 
ri->ifindex = 0;
ri->map = NULL;
-   ri->map_owner = NULL;
+   ri->map_owner = 0;
 
if (map) {
-   if (unlikely(map_owner != xdp_prog)) {
+   if (unlikely(xdp_map_invalid(xdp_prog, map_owner))) {
err = -EFAULT;
map = NULL;
goto err;
@@ -2632,7 +2638,7 @@ int xdp_do_generic_redirect(struct net_device *dev, 
struct sk_buff *skb,
ri->ifindex = ifindex;
ri->flags = flags;
ri->map = NULL;
-   ri->map_owner = NULL;
+   ri->map_owner = 0;
 
return XDP_REDIRECT;
 }
@@ -2646,7 +2652,7 @@

Re: [PATCH net] bpf: do not disable/enable BH in bpf_map_free_id()

2017-09-19 Thread David Miller

From: Eric Dumazet 
Date: Tue, 19 Sep 2017 09:15:59 -0700

> From: Eric Dumazet 
> 
> syzkaller reported following splat [1]
> 
> Since hard irq are disabled by the caller, bpf_map_free_id()
> should not try to enable/disable BH.
> 
> Another solution would be to change htab_map_delete_elem() to
> defer the free_htab_elem() call after
> raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
> enough to cover other code paths.
> 
> [1]
 ...
> Fixes: f3f1c054c288 ("bpf: Introduce bpf_map ID")
> Signed-off-by: Eric Dumazet 
> Cc: Martin KaFai Lau 

Applied and queued up for -stable, thanks Eric.

Latest net-next from GIT panic

2017-09-19 Thread Paweł Staszewski


Just tried latest net-next git and found kernel panic.

Below link to bugzilla.

https://bugzilla.kernel.org/attachment.cgi?id=258499

Re: [PATCH] ipv6_skip_exthdr: use ipv6_authlen for AH hdrlen

2017-09-19 Thread David Miller

From: Xiang Gao 
Date: Tue, 19 Sep 2017 08:59:50 -0400

> In ipv6_skip_exthdr, the lengh of AH header is computed manually
> as (hp->hdrlen+2)<<2. However, in include/linux/ipv6.h, a macro
> named ipv6_authlen is already defined for exactly the same job. This
> commit replaces the manual computation code with the macro.

All patch submissions must have a proper signoff.

Also, please use a proper subsystem prefix in your Subject
line "[PATCH] ipv6: Use ipv6_authlen for AH hdrlen in ipv6_skip_exthdr()"
would have been much better as "ipv6: " is the appropriate
subsystem prefix to use here.

Thanks.

Re: [PATCH net-next] selftests: rtnetlink.sh: add test case for device ifalias

2017-09-19 Thread David Miller

From: Florian Westphal 
Date: Tue, 19 Sep 2017 14:42:17 +0200

> Signed-off-by: Florian Westphal 

Applied, thanks Florian.

Re: [PATCH net] tcp: fastopen: fix on syn-data transmit failure

2017-09-19 Thread Yuchung Cheng

On Tue, Sep 19, 2017 at 10:05 AM, Eric Dumazet  wrote:
> From: Eric Dumazet 
>
> Our recent change exposed a bug in TCP Fastopen Client that syzkaller
> found right away [1]
>
> When we prepare skb with SYN+DATA, we attempt to transmit it,
> and we update socket state as if the transmit was a success.
>
> In socket RTX queue we have two skbs, one with the SYN alone,
> and a second one containing the DATA.
>
> When (malicious) ACK comes in, we now complain that second one had no
> skb_mstamp.
>
> The proper fix is to make sure that if the transmit failed, we do not
> pretend we sent the DATA skb, and make it our send_head.
>
> When 3WHS completes, we can now send the DATA right away, without having
> to wait for a timeout.
>
> [1]
> WARNING: CPU: 0 PID: 100189 at net/ipv4/tcp_input.c:3117 
> tcp_clean_rtx_queue+0x2057/0x2ab0 net/ipv4/tcp_input.c:3117()
>
>  WARN_ON_ONCE(last_ackt == 0);
>
> Modules linked in:
> CPU: 0 PID: 100189 Comm: syz-executor1 Not tainted
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
>   8800b35cb1d8 81cad00d 
>  828a4347 88009f86c080 8316eb20 0d7f
>  8800b35cb220 812c33c2 8800baad2440 0009d46575c0
> Call Trace:
>  [] __dump_stack
>  [] dump_stack+0xc1/0x124
>  [] warn_slowpath_common+0xe2/0x150
>  [] warn_slowpath_null+0x2e/0x40
>  [] tcp_clean_rtx_queue+0x2057/0x2ab0 n
>  [] tcp_ack+0x151d/0x3930
>  [] tcp_rcv_state_process+0x1c69/0x4fd0
>  [] tcp_v4_do_rcv+0x54f/0x7c0
>  [] sk_backlog_rcv
>  [] __release_sock+0x12b/0x3a0
>  [] release_sock+0x5e/0x1c0
>  [] inet_wait_for_connect
>  [] __inet_stream_connect+0x545/0xc50
>  [] tcp_sendmsg_fastopen
>  [] tcp_sendmsg+0x2298/0x35a0
>  [] inet_sendmsg+0xe5/0x520
>  [] sock_sendmsg_nosec
>  [] sock_sendmsg+0xcf/0x110
>
> Fixes: 8c72c65b426b ("tcp: update skb->skb_mstamp more carefully")
> Fixes: 783237e8daf1 ("net-tcp: Fast Open client - sending SYN-data")
> Signed-off-by: Eric Dumazet 
> Reported-by: Dmitry Vyukov 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
Acked-by: Yuchung Cheng 

Thanks Eric for fixing this. The current arrangement of SYN plus data
packet seems to cause more code for error cases. I am wondering a
(subsequent) refactoring patch can make it simpler by updating the
states after a successful transmission (instead of update and revert).

> ---
>  net/ipv4/tcp_output.c |9 +
>  1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 
> 517d737059d18d8821b65dcdf54d9bb3448784c2..0bc9e46a53696578eb6e911f2f75e6b34c80894f
>  100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -3389,6 +3389,10 @@ static int tcp_send_syn_data(struct sock *sk, struct 
> sk_buff *syn)
> goto done;
> }
>
> +   /* data was not sent, this is our new send_head */
> +   sk->sk_send_head = syn_data;
> +   tp->packets_out -= tcp_skb_pcount(syn_data);
> +
>  fallback:
> /* Send a regular SYN with Fast Open cookie request option */
> if (fo->cookie.len > 0)
> @@ -3441,6 +3445,11 @@ int tcp_connect(struct sock *sk)
>  */
> tp->snd_nxt = tp->write_seq;
> tp->pushed_seq = tp->write_seq;
> +   buff = tcp_send_head(sk);
> +   if (unlikely(buff)) {
> +   tp->snd_nxt = TCP_SKB_CB(buff)->seq;
> +   tp->pushed_seq  = TCP_SKB_CB(buff)->seq;
> +   }
> TCP_INC_STATS(sock_net(sk), TCP_MIB_ACTIVEOPENS);
>
> /* Timer for repeating the SYN until an answer. */
>
>

Re: [PATCH v2 net-next] net: sk_buff rbnode reorg

2017-09-19 Thread David Miller

From: Eric Dumazet 
Date: Tue, 19 Sep 2017 05:14:24 -0700

> From: Eric Dumazet 
> 
> skb->rbnode shares space with skb->next, skb->prev and skb->tstamp
> 
> Current uses (TCP receive ofo queue and netem) need to save/restore
> tstamp, while skb->dev is either NULL (TCP) or a constant for a given
> queue (netem).
> 
> Since we plan using an RB tree for TCP retransmit queue to speedup SACK
> processing with large BDP, this patch exchanges skb->dev and
> skb->tstamp.
> 
> This saves some overhead in both TCP and netem.
> 
> v2: removes the swtstamp field from struct tcp_skb_cb
> 
> Signed-off-by: Eric Dumazet 

Looks great, applied, thanks Eric.

Re: [PATCH] rhashtable: Documentation tweak

2017-09-19 Thread David Miller

From: Andreas Gruenbacher 
Date: Tue, 19 Sep 2017 12:41:37 +0200

> Clarify that rhashtable_walk_{stop,start} will not reset the iterator to
> the beginning of the hash table.  Confusion between rhashtable_walk_enter
> and rhashtable_walk_start has already lead to a bug.
> 
> Signed-off-by: Andreas Gruenbacher 

Applied, thanks.

Reporting transceiver with ethtool_link_ksettings

2017-09-19 Thread Florian Fainelli

Hi,

After tracking down why all network interfaces using PHYLIB and using
phy_ethtool_link_ksettings_get would report "Transceiver: internal" it
became clear that's because ethtool_link_ksettings deprecated that field...

We could have deprecated setting the transceiver which makes sense, but
not deprecating getting the transceiver type which is useful information.

So what are the options here? Would this be acceptable:

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 06fc04c73079..bb9b55806bf4 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -1752,7 +1752,9 @@ struct ethtool_link_settings {
__u8eth_tp_mdix;
__u8eth_tp_mdix_ctrl;
__s8link_mode_masks_nwords;
-   __u32   reserved[8];
+   __u8transceiver;
+   __u8reserved1[3];
+   __u32   reserved[7];
__u32   link_mode_masks[0];
/* layout of link_mode_masks fields:
 * __u32 map_supported[link_mode_masks_nwords];

--
Florian

Re: [PATCH] tcp: avoid bogus warning in tcp_clean_rtx_queue

2017-09-19 Thread David Miller

From: Arnd Bergmann 
Date: Tue, 19 Sep 2017 23:32:33 +0200

> On Tue, Sep 19, 2017 at 11:02 PM, David Miller  wrote:
>> What cpu did you test the object code generation upon and does that
>> cpu have branch prediction hints in the target you are building for?
> 
> This was a randconfig build targetting ARMv5. I'm pretty sure that has
> no such hint instructions.

I just tested on sparc64 and it changed the branch prediction:

 .L2157:
-   brz,pn  %i3, .L1898 ! first_ackt,
+   brz,pt  %i2, .L1898 ! first_ackt,
 mov-1, %o2 !, seq_rtt_us

Reply

2017-09-19 Thread a

Are you free for discussion?

Re: [PATCH] tcp: avoid bogus warning in tcp_clean_rtx_queue

2017-09-19 Thread Arnd Bergmann

On Tue, Sep 19, 2017 at 11:02 PM, David Miller  wrote:
> From: Arnd Bergmann 
> Date: Mon, 18 Sep 2017 22:48:47 +0200
>
>> gcc-4.9 warns that it cannot trace the state of the 'last_ackt'
>> variable since the change to the TCP timestamping code, when
>> CONFIG_PROFILE_ANNOTATED_BRANCHES is set:
>>
>> net/ipv4/tcp_input.c: In function 'tcp_clean_rtx_queue':
>> include/net/tcp.h:757:23: error: 'last_ackt' may be used uninitialized in 
>> this function [-Werror=maybe-uninitialized]
>>
>> Other gcc versions, both older and newer do now show this
>> warning. Removing the 'likely' annotation makes it go away,
>> and has no effect on the object code without
>> CONFIG_PROFILE_ANNOTATED_BRANCHES, as tested with gcc-4.9
>> and gcc-7.1.1, so this seems to be a safe workaround.
>>
>> Fixes: 9a568de4818d ("tcp: switch TCP TS option (RFC 7323) to 1ms clock")
>> Signed-off-by: Arnd Bergmann 
>
> This reaches the limits at which I am willing to work around compiler
> stuff.

I see. It is a definitely a really obscure case, so if there is any doubt
that the workaround is harmless, then we shouldn't take it. The warning
only shows up on gcc-4.9 but not anything newer, and we disable
-Wmaybe-uninitialized on all older versions because of the false
positives.

It's also possible that it needed a combination of multiple other options,
not just CONFIG_PROFILE_ANNOTATED_BRANCHES. I build-tested
with gcc-4.9 to see if anything would show up that we don't also get a
warning for in gcc-7, and this came up once in several hundred randconfig
builds across multiple architectures (no other new warnings appeared
with gcc-4.9).

> What cpu did you test the object code generation upon and does that
> cpu have branch prediction hints in the target you are building for?

This was a randconfig build targetting ARMv5. I'm pretty sure that has
no such hint instructions.

   Arnd

Re: [PATCH 3/3][v2] selftests: silence test output by default

2017-09-19 Thread Shuah Khan

On 09/19/2017 07:51 AM, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> Some of the networking tests are very noisy and make it impossible to
> see if we actually passed the tests as they run.  Default to suppressing
> the output from any tests run in order to make it easier to track what
> failed.
> 
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - dump output into /tmp/testname instead of /dev/null
> 

Thanks for the fix. Applied to linux-kselftest for 4.14-rc2

-- Shuah

Re: [PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-19 Thread Daniel Mack

On 09/19/2017 11:29 PM, David Miller wrote:
> From: Craig Gallek 
> Date: Tue, 19 Sep 2017 17:16:13 -0400
> 
>> On Tue, Sep 19, 2017 at 5:13 PM, Daniel Mack  wrote:
>>> On 09/19/2017 10:55 PM, David Miller wrote:
 From: Craig Gallek 
 Date: Mon, 18 Sep 2017 15:30:54 -0400

> This was previously left as a TODO.  Add the implementation and
> extend the test to cover it.

 Series applied, thanks.

>>>
>>> Hmm, I think these patches need some more discussion regarding the IM
>>> nodes handling, see the reply I sent an hour ago. Could you wait for
>>> that before pushing your tree?
>>
>> I can follow up with a patch to implement your suggestion.  It's
>> really just an efficiency improvement, though, so I think it's ok to
>> handle independently. (Sorry, I haven't had a chance to play with the
>> implementation details yet).
> 
> Sorry, I thought the core implementation had been agreed upon and the
> series was OK.  All that was asked for were simplifications and/or
> optimization which could be done via follow-up patches.
> 
> It's already pushed out to my tree, so I would need to do a real
> revert.
> 
> I hope that won't be necessary.
> 

Nah, it's okay I guess. I trust Craig to send follow-up patches. After
all, efficiency is what this whole exercise is all about, so I think it
should be done correctly :)



Thanks,
Daniel

Re: [PATCH 2/3][v2] selftests: actually run the various net selftests

2017-09-19 Thread Shuah Khan

On 09/19/2017 07:51 AM, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> These self tests are just self contained binaries, they are not run by
> any of the scripts in the directory.  This means they need to be marked
> with TEST_GEN_PROGS to actually be run, not TEST_GEN_FILES.
> 
> Signed-off-by: Josef Bacik 
> ---
> v1->v2:
> - Moved msg_zerocopy to TEST_GEN_FILES since it's not runnable in it's current
>   state
> 

I usually don't sent new tests however, since it is a test for a
regression, applied to linux-kselftest fixes for 4.14-rc2

thanks,
-- Shuah

Re: [PATCH 1/3][v2] selftest: add a reuseaddr test

2017-09-19 Thread Shuah Khan

On 09/19/2017 07:51 AM, jo...@toxicpanda.com wrote:
> From: Josef Bacik 
> 
> This is to test for a regression introduced by
> 
> b9470c27607b ("inet: kill smallest_size and smallest_port")
> 
> which introduced a problem with reuseaddr and bind conflicts.
> 
> Signed-off-by: Josef Bacik 

I usually don't sent new tests however, since it is a test for a
regression, applied to linux-kselftest fixes for 4.14-rc2

thanks,
-- Shuah

Re: [PATCH net-next 0/3] Implement delete for BPF LPM trie

2017-09-19 Thread David Miller

From: Craig Gallek 
Date: Tue, 19 Sep 2017 17:16:13 -0400

> On Tue, Sep 19, 2017 at 5:13 PM, Daniel Mack  wrote:
>> On 09/19/2017 10:55 PM, David Miller wrote:
>>> From: Craig Gallek 
>>> Date: Mon, 18 Sep 2017 15:30:54 -0400
>>>
 This was previously left as a TODO.  Add the implementation and
 extend the test to cover it.
>>>
>>> Series applied, thanks.
>>>
>>
>> Hmm, I think these patches need some more discussion regarding the IM
>> nodes handling, see the reply I sent an hour ago. Could you wait for
>> that before pushing your tree?
> 
> I can follow up with a patch to implement your suggestion.  It's
> really just an efficiency improvement, though, so I think it's ok to
> handle independently. (Sorry, I haven't had a chance to play with the
> implementation details yet).

Sorry, I thought the core implementation had been agreed upon and the
series was OK.  All that was asked for were simplifications and/or
optimization which could be done via follow-up patches.

It's already pushed out to my tree, so I would need to do a real
revert.

I hope that won't be necessary.

Re: [PATCH net-next] net_sched: no need to free qdisc in RCU callback

2017-09-19 Thread Eric Dumazet

On Tue, 2017-09-19 at 13:15 -0700, Cong Wang wrote:
> gen estimator has been rewritten in commit 1c0d32fde5bd
> ("net_sched: gen_estimator: complete rewrite of rate estimators"),
> the caller no longer needs to wait for a grace period. So this
> patch gets rid of it.
> 
> Cc: Jamal Hadi Salim 
> Cc: Eric Dumazet 
> Signed-off-by: Cong Wang 
> ---

Acked-by: Eric Dumazet

1 2 3 >

1 - 100 of 297 matches

Mail list logo